lib/lzo: implement run-length encoding

Patch series "lib/lzo: run-length encoding support", v5. Following on from the previous lzo-rle patchset: https://lkml.org/lkml/2018/11/30/972 This patchset contains only the RLE patches, and should be applied on top of the non-RLE patches ( https://lkml.org/lkml/2019/2/5/366 ). Previously, some questions were raised around the RLE patches. I've done some additional benchmarking to answer these questions. In short: - RLE offers significant additional performance (data-dependent) - I didn't measure any regressions that were clearly outside the noise One concern with this patchset was around performance - specifically, measuring RLE impact separately from Matt Sealey's patches (CTZ & fast copy). I have done some additional benchmarking which I hope clarifies the benefits of each part of the patchset. Firstly, I've captured some memory via /dev/fmem from a Chromebook with many tabs open which is starting to swap, and then split this into 4178 4k pages. I've excluded the all-zero pages (as zram does), and also the no-zero pages (which won't tell us anything about RLE performance). This should give a realistic test dataset for zram. What I found was that the data is VERY bimodal: 44% of pages in this dataset contain 5% or fewer zeros, and 44% contain over 90% zeros (30% if you include the no-zero pages). This supports the idea of special-casing zeros in zram. Next, I've benchmarked four variants of lzo on these pages (on 64-bit Arm at max frequency): baseline LZO; baseline + Matt Sealey's patches (aka MS); baseline + RLE only; baseline + MS + RLE. Numbers are for weighted roundtrip throughput (the weighting reflects that zram does more compression than decompression). https://drive.google.com/file/d/1VLtLjRVxgUNuWFOxaGPwJYhl_hMQXpHe/view?usp=sharing Matt's patches help in all cases for Arm (and no effect on Intel), as expected. RLE also behaves as expected: with few zeros present, it makes no difference; above ~75%, it gives a good improvement (50 - 300 MB/s on top of the benefit from Matt's patches). Best performance is seen with both MS and RLE patches. Finally, I have benchmarked the same dataset on an x86-64 device. Here, the MS patches make no difference (as expected); RLE helps, similarly as on Arm. There were no definite regressions; allowing for observational error, 0.1% (3/4178) of cases had a regression > 1 standard deviation, of which the largest was 4.6% (1.2 standard deviations). I think this is probably within the noise. https://drive.google.com/file/d/1xCUVwmiGD0heEMx5gcVEmLBI4eLaageV/view?usp=sharing One point to note is that the graphs show RLE appears to help very slightly with no zeros present! This is because the extra code causes the clang optimiser to change code layout in a way that happens to have a significant benefit. Taking baseline LZO and adding a do-nothing line like "__builtin_prefetch(out_len);" immediately before the "goto next" has the same effect. So this is a real, but basically spurious effect - it's small enough not to upset the overall findings. This patch (of 3): When using zram, we frequently encounter long runs of zero bytes. This adds a special case which identifies runs of zeros and encodes them using run-length encoding. This is faster for both compression and decompresion. For high-entropy data which doesn't hit this case, impact is minimal. Compression ratio is within a few percent in all cases. This modifies the bitstream in a way which is backwards compatible (i.e., we can decompress old bitstreams, but old versions of lzo cannot decompress new bitstreams). Link: http://lkml.kernel.org/r/20190205155944.16007-2-dave.rodgman@arm.com Signed-off-by: Dave Rodgman <dave.rodgman@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Markus F.X.J. Oberhumer <markus@oberhumer.com> Cc: Matt Sealey <matt.sealey@arm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <nitingupta910@gmail.com> Cc: Richard Purdie <rpurdie@openedhand.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Sonny Rao <sonnyrao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
author: Dave Rodgman <dave.rodgman@arm.com> 2019-03-07 19:30:40 -0500
committer: Linus Torvalds <torvalds@linux-foundation.org> 2019-03-07 21:32:02 -0500
commit: 5ee4014af99f77dac89e01961b717d13ff1a8ea5 (patch)
tree: 33987106adbb2f59723c420154b83b94b122f90d
parent: 761b3238504858bbc630dc957eed1659dd7eaff1 (diff)
5 files changed, 181 insertions, 43 deletions
diff --git a/Documentation/lzo.txt b/Documentation/lzo.txt
index 6fa6a93d0949..306c60344ca7 100644
--- a/Documentation/lzo.txt
+++ b/Documentation/lzo.txt
@@ -78,16 +78,30 @@ Description
     is an implementation design choice independent on the algorithm or
     encoding.
+Versions
+0: Original version
+1: LZO-RLE
+Version 1 of LZO implements an extension to encode runs of zeros using run
+length encoding. This improves speed for data with many zeros, which is a
+common case for zram. This modifies the bitstream in a backwards compatible way
+(v1 can correctly decompress v0 compressed data, but v0 cannot read v1 data).
 Byte sequences
 ==============
  First byte encoding::
-      0..17   : follow regular instruction encoding, see below. It is worth
+      0..16   : follow regular instruction encoding, see below. It is worth
-                noting that codes 16 and 17 will represent a block copy from
+                noting that code 16 will represent a block copy from the
-                the dictionary which is empty, and that they will always be
+                dictionary which is empty, and that it will always be
                invalid at this place.
+      17      : bitstream version. If the first byte is 17, the next byte
+                gives the bitstream version. If the first byte is not 17,
+                the bitstream version is 0.
      18..21  : copy 0..3 literals
                state = (byte - 17) = 0..3  [ copy <state> literals ]
                skip byte
@@ -140,6 +154,11 @@ Byte sequences
           state = S (copy S literals after this block)
           End of stream is reached if distance == 16384
+        In version 1, this instruction is also used to encode a run of zeros if
+        distance = 0xbfff, i.e. H = 1 and the D bits are all 1.
+           In this case, it is followed by a fourth byte, X.
+           run length = ((X << 3) | (0 0 0 0 0 L L L)) + 4.
      0 0 1 L L L L L  (32..63)
           Copy of small block within 16kB distance (preferably less than 34B)
           length = 2 + (L ?: 31 + (zero_bytes * 255) + non_zero_byte)
@@ -165,7 +184,9 @@ Authors
 =======
  This document was written by Willy Tarreau <w@1wt.eu> on 2014/07/19 during an
-  analysis of the decompression code available in Linux 3.16-rc5. The code is
+  analysis of the decompression code available in Linux 3.16-rc5, and updated
-  tricky, it is possible that this document contains mistakes or that a few
+  by Dave Rodgman <dave.rodgman@arm.com> on 2018/10/30 to introduce run-length
-  corner cases were overlooked. In any case, please report any doubt, fix, or
+  encoding. The code is tricky, it is possible that this document contains
-  proposed updates to the author(s) so that the document can be updated.
+  mistakes or that a few corner cases were overlooked. In any case, please
+  report any doubt, fix, or proposed updates to the author(s) so that the
+  document can be updated.
diff --git a/include/linux/lzo.h b/include/linux/lzo.h
index 2ae27cb89927..547a86c71e1b 100644
--- a/include/linux/lzo.h
+++ b/include/linux/lzo.h
@@ -18,7 +18,7 @@
 #define LZO1X_1_MEM_COMPRESS    (8192 * sizeof(unsigned short))
 #define LZO1X_MEM_COMPRESS      LZO1X_1_MEM_COMPRESS
-#define lzo1x_worst_compress(x) ((x) + ((x) / 16) + 64 + 3)
+#define lzo1x_worst_compress(x) ((x) + ((x) / 16) + 64 + 3 + 2)
 /* This requires 'wrkmem' of size LZO1X_1_MEM_COMPRESS */
 int lzo1x_1_compress(const unsigned char *src, size_t src_len,
diff --git a/lib/lzo/lzo1x_compress.c b/lib/lzo/lzo1x_compress.c
index 236eb21167b5..89cd561201ff 100644
--- a/lib/lzo/lzo1x_compress.c
+++ b/lib/lzo/lzo1x_compress.c
@@ -20,7 +20,7 @@
 static noinline size_t
 lzo1x_1_do_compress(const unsigned char *in, size_t in_len,
                    unsigned char *out, size_t *out_len,
-                    size_t ti, void *wrkmem)
+                    size_t ti, void *wrkmem, signed char *state_offset)
 {
        const unsigned char *ip;
        unsigned char *op;
@@ -35,27 +35,85 @@ lzo1x_1_do_compress(const unsigned char *in, size_t in_len,
        ip += ti < 4 ? 4 - ti : 0;
        for (;;) {
-                const unsigned char *m_pos;
+                const unsigned char *m_pos = NULL;
                size_t t, m_len, m_off;
                u32 dv;
+                u32 run_length = 0;
 literal:
                ip += 1 + ((ip - ii) >> 5);
 next:
                if (unlikely(ip >= ip_end))
                        break;
                dv = get_unaligned_le32(ip);
-                t = ((dv * 0x1824429d) >> (32 - D_BITS)) & D_MASK;
-                m_pos = in + dict[t];
+                if (dv == 0) {
-                dict[t] = (lzo_dict_t) (ip - in);
+                        const unsigned char *ir = ip + 4;
-                if (unlikely(dv != get_unaligned_le32(m_pos)))
+                        const unsigned char *limit = ip_end
-                        goto literal;
+                                < (ip + MAX_ZERO_RUN_LENGTH + 1)
+                                ? ip_end : ip + MAX_ZERO_RUN_LENGTH + 1;
+#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && \
+        defined(LZO_FAST_64BIT_MEMORY_ACCESS)
+                        u64 dv64;
+                        for (; (ir + 32) <= limit; ir += 32) {
+                                dv64 = get_unaligned((u64 *)ir);
+                                dv64 |= get_unaligned((u64 *)ir + 1);
+                                dv64 |= get_unaligned((u64 *)ir + 2);
+                                dv64 |= get_unaligned((u64 *)ir + 3);
+                                if (dv64)
+                                        break;
+                        }
+                        for (; (ir + 8) <= limit; ir += 8) {
+                                dv64 = get_unaligned((u64 *)ir);
+                                if (dv64) {
+#  if defined(__LITTLE_ENDIAN)
+                                        ir += __builtin_ctzll(dv64) >> 3;
+#  elif defined(__BIG_ENDIAN)
+                                        ir += __builtin_clzll(dv64) >> 3;
+#  else
+#    error "missing endian definition"
+#  endif
+                                        break;
+                                }
+                        }
+#else
+                        while ((ir < (const unsigned char *)
+                                        ALIGN((uintptr_t)ir, 4)) &&
+                                        (ir < limit) && (*ir == 0))
+                                ir++;
+                        for (; (ir + 4) <= limit; ir += 4) {
+                                dv = *((u32 *)ir);
+                                if (dv) {
+#  if defined(__LITTLE_ENDIAN)
+                                        ir += __builtin_ctz(dv) >> 3;
+#  elif defined(__BIG_ENDIAN)
+                                        ir += __builtin_clz(dv) >> 3;
+#  else
+#    error "missing endian definition"
+#  endif
+                                        break;
+                                }
+                        }
+#endif
+                        while (likely(ir < limit) && unlikely(*ir == 0))
+                                ir++;
+                        run_length = ir - ip;
+                        if (run_length > MAX_ZERO_RUN_LENGTH)
+                                run_length = MAX_ZERO_RUN_LENGTH;
+                } else {
+                        t = ((dv * 0x1824429d) >> (32 - D_BITS)) & D_MASK;
+                        m_pos = in + dict[t];
+                        dict[t] = (lzo_dict_t) (ip - in);
+                        if (unlikely(dv != get_unaligned_le32(m_pos)))
+                                goto literal;
+                }
                ii -= ti;
                ti = 0;
                t = ip - ii;
                if (t != 0) {
                        if (t <= 3) {
-                                op[-2] |= t;
+                                op[*state_offset] |= t;
                                COPY4(op, ii);
                                op += t;
                        } else if (t <= 16) {
@@ -88,6 +146,17 @@ next:
                        }
                }
+                if (unlikely(run_length)) {
+                        ip += run_length;
+                        run_length -= MIN_ZERO_RUN_LENGTH;
+                        put_unaligned_le32((run_length << 21) | 0xfffc18
+                                           | (run_length & 0x7), op);
+                        op += 4;
+                        run_length = 0;
+                        *state_offset = -3;
+                        goto finished_writing_instruction;
+                }
                m_len = 4;
                {
 #if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && defined(LZO_USE_CTZ64)
@@ -170,7 +239,6 @@ m_len_done:
                m_off = ip - m_pos;
                ip += m_len;
-                ii = ip;
                if (m_len <= M2_MAX_LEN && m_off <= M2_MAX_OFFSET) {
                        m_off -= 1;
                        *op++ = (((m_len - 1) << 5) | ((m_off & 7) << 2));
@@ -207,6 +275,9 @@ m_len_done:
                        *op++ = (m_off << 2);
                        *op++ = (m_off >> 6);
                }
+                *state_offset = -2;
+finished_writing_instruction:
+                ii = ip;
                goto next;
        }
        *out_len = op - out;
@@ -221,6 +292,12 @@ int lzo1x_1_compress(const unsigned char *in, size_t in_len,
        unsigned char *op = out;
        size_t l = in_len;
        size_t t = 0;
+        signed char state_offset = -2;
+        // LZO v0 will never write 17 as first byte,
+        // so this is used to version the bitstream
+        *op++ = 17;
+        *op++ = LZO_VERSION;
        while (l > 20) {
                size_t ll = l <= (M4_MAX_OFFSET + 1) ? l : (M4_MAX_OFFSET + 1);
@@ -229,7 +306,8 @@ int lzo1x_1_compress(const unsigned char *in, size_t in_len,
                        break;
                BUILD_BUG_ON(D_SIZE * sizeof(lzo_dict_t) > LZO1X_1_MEM_COMPRESS);
                memset(wrkmem, 0, D_SIZE * sizeof(lzo_dict_t));
-                t = lzo1x_1_do_compress(ip, ll, op, out_len, t, wrkmem);
+                t = lzo1x_1_do_compress(ip, ll, op, out_len,
+                                        t, wrkmem, &state_offset);
                ip += ll;
                op += *out_len;
                l  -= ll;
@@ -242,7 +320,7 @@ int lzo1x_1_compress(const unsigned char *in, size_t in_len,
                if (op == out && t <= 238) {
                        *op++ = (17 + t);
                } else if (t <= 3) {
-                        op[-2] |= t;
+                        op[state_offset] |= t;
                } else if (t <= 18) {
                        *op++ = (t - 3);
                } else {
diff --git a/lib/lzo/lzo1x_decompress_safe.c b/lib/lzo/lzo1x_decompress_safe.c
index a1c387f6afba..6d2600ea3b55 100644
--- a/lib/lzo/lzo1x_decompress_safe.c
+++ b/lib/lzo/lzo1x_decompress_safe.c
@@ -46,11 +46,23 @@ int lzo1x_decompress_safe(const unsigned char *in, size_t in_len,
        const unsigned char * const ip_end = in + in_len;
        unsigned char * const op_end = out + *out_len;
+        unsigned char bitstream_version;
        op = out;
        ip = in;
        if (unlikely(in_len < 3))
                goto input_overrun;
+        if (likely(*ip == 17)) {
+                bitstream_version = ip[1];
+                ip += 2;
+                if (unlikely(in_len < 5))
+                        goto input_overrun;
+        } else {
+                bitstream_version = 0;
+        }
        if (*ip > 17) {
                t = *ip++ - 17;
                if (t < 4) {
@@ -154,32 +166,49 @@ copy_literal_run:
                        m_pos -= next >> 2;
                        next &= 3;
                } else {
-                        m_pos = op;
+                        NEED_IP(2);
-                        m_pos -= (t & 8) << 11;
+                        next = get_unaligned_le16(ip);
-                        t = (t & 7) + (3 - 1);
+                        if (((next & 0xfffc) == 0xfffc) &&
-                        if (unlikely(t == 2)) {
+                            ((t & 0xf8) == 0x18) &&
-                                size_t offset;
+                            likely(bitstream_version)) {
-                                const unsigned char *ip_last = ip;
+                                NEED_IP(3);
+                                t &= 7;
+                                t |= ip[2] << 3;
+                                t += MIN_ZERO_RUN_LENGTH;
+                                NEED_OP(t);
+                                memset(op, 0, t);
+                                op += t;
+                                next &= 3;
+                                ip += 3;
+                                goto match_next;
+                        } else {
+                                m_pos = op;
+                                m_pos -= (t & 8) << 11;
+                                t = (t & 7) + (3 - 1);
+                                if (unlikely(t == 2)) {
+                                        size_t offset;
+                                        const unsigned char *ip_last = ip;
-                                while (unlikely(*ip == 0)) {
+                                        while (unlikely(*ip == 0)) {
-                                        ip++;
+                                                ip++;
-                                        NEED_IP(1);
+                                                NEED_IP(1);
-                                }
+                                        }
-                                offset = ip - ip_last;
+                                        offset = ip - ip_last;
-                                if (unlikely(offset > MAX_255_COUNT))
+                                        if (unlikely(offset > MAX_255_COUNT))
-                                        return LZO_E_ERROR;
+                                                return LZO_E_ERROR;
-                                offset = (offset << 8) - offset;
+                                        offset = (offset << 8) - offset;
-                                t += offset + 7 + *ip++;
+                                        t += offset + 7 + *ip++;
-                                NEED_IP(2);
+                                        NEED_IP(2);
+                                        next = get_unaligned_le16(ip);
+                                }
+                                ip += 2;
+                                m_pos -= next >> 2;
+                                next &= 3;
+                                if (m_pos == op)
+                                        goto eof_found;
+                                m_pos -= 0x4000;
                        }
-                        next = get_unaligned_le16(ip);
-                        ip += 2;
-                        m_pos -= next >> 2;
-                        next &= 3;
-                        if (m_pos == op)
-                                goto eof_found;
-                        m_pos -= 0x4000;
                }
                TEST_LB(m_pos);
 #if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)
diff --git a/lib/lzo/lzodefs.h b/lib/lzo/lzodefs.h
index fa0a45fed8c4..ac64159ee344 100644
--- a/lib/lzo/lzodefs.h
+++ b/lib/lzo/lzodefs.h
@@ -13,6 +13,12 @@
 */
+/* Version
+ * 0: original lzo version
+ * 1: lzo with support for RLE
+ */
+#define LZO_VERSION 1
 #define COPY4(dst, src) \
                put_unaligned(get_unaligned((const u32 *)(src)), (u32 *)(dst))
 #if defined(CONFIG_X86_64) || defined(CONFIG_ARM64)
@@ -28,6 +34,7 @@
 #elif defined(CONFIG_X86_64) || defined(CONFIG_ARM64)
 #define LZO_USE_CTZ64   1
 #define LZO_USE_CTZ32   1
+#define LZO_FAST_64BIT_MEMORY_ACCESS
 #elif defined(CONFIG_X86) || defined(CONFIG_PPC)
 #define LZO_USE_CTZ32   1
 #elif defined(CONFIG_ARM) && (__LINUX_ARM_ARCH__ >= 5)
@@ -37,7 +44,7 @@
 #define M1_MAX_OFFSET   0x0400
 #define M2_MAX_OFFSET   0x0800
 #define M3_MAX_OFFSET   0x4000
-#define M4_MAX_OFFSET   0xbfff
+#define M4_MAX_OFFSET   0xbffe
 #define M1_MIN_LEN      2
 #define M1_MAX_LEN      2
@@ -53,6 +60,9 @@
 #define M3_MARKER       32
 #define M4_MARKER       16
+#define MIN_ZERO_RUN_LENGTH     4
+#define MAX_ZERO_RUN_LENGTH     (2047 + MIN_ZERO_RUN_LENGTH)
 #define lzo_dict_t      unsigned short
 #define D_BITS          13
 #define D_SIZE          (1u << D_BITS)
author	Dave Rodgman <dave.rodgman@arm.com>	2019-03-07 19:30:40 -0500
committer	Linus Torvalds <torvalds@linux-foundation.org>	2019-03-07 21:32:02 -0500
commit	5ee4014af99f77dac89e01961b717d13ff1a8ea5 (patch)
tree	33987106adbb2f59723c420154b83b94b122f90d
parent	761b3238504858bbc630dc957eed1659dd7eaff1 (diff)

diff --git a/Documentation/lzo.txt b/Documentation/lzo.txt index 6fa6a93d0949..306c60344ca7 100644 --- a/Documentation/lzo.txt +++ b/Documentation/lzo.txt
@@ -78,16 +78,30 @@ Description
78	is an implementation design choice independent on the algorithm or	78	is an implementation design choice independent on the algorithm or
79	encoding.	79	encoding.
80		80
		81	Versions
		82
		83	0: Original version
		84	1: LZO-RLE
		85
		86	Version 1 of LZO implements an extension to encode runs of zeros using run
		87	length encoding. This improves speed for data with many zeros, which is a
		88	common case for zram. This modifies the bitstream in a backwards compatible way
		89	(v1 can correctly decompress v0 compressed data, but v0 cannot read v1 data).
		90
81	Byte sequences	91	Byte sequences
82	==============	92	==============
83		93
84	First byte encoding::	94	First byte encoding::
85		95
86	0..17 : follow regular instruction encoding, see below. It is worth	96	0..16 : follow regular instruction encoding, see below. It is worth
87	noting that codes 16 and 17 will represent a block copy from	97	noting that code 16 will represent a block copy from the
88	the dictionary which is empty, and that they will always be	98	dictionary which is empty, and that it will always be
89	invalid at this place.	99	invalid at this place.
90		100
		101	17 : bitstream version. If the first byte is 17, the next byte
		102	gives the bitstream version. If the first byte is not 17,
		103	the bitstream version is 0.
		104
91	18..21 : copy 0..3 literals	105	18..21 : copy 0..3 literals
92	state = (byte - 17) = 0..3 [ copy <state> literals ]	106	state = (byte - 17) = 0..3 [ copy <state> literals ]
93	skip byte	107	skip byte
@@ -140,6 +154,11 @@ Byte sequences
140	state = S (copy S literals after this block)	154	state = S (copy S literals after this block)
141	End of stream is reached if distance == 16384	155	End of stream is reached if distance == 16384
142		156
		157	In version 1, this instruction is also used to encode a run of zeros if
		158	distance = 0xbfff, i.e. H = 1 and the D bits are all 1.
		159	In this case, it is followed by a fourth byte, X.
		160	run length = ((X << 3) \| (0 0 0 0 0 L L L)) + 4.
		161
143	0 0 1 L L L L L (32..63)	162	0 0 1 L L L L L (32..63)
144	Copy of small block within 16kB distance (preferably less than 34B)	163	Copy of small block within 16kB distance (preferably less than 34B)
145	length = 2 + (L ?: 31 + (zero_bytes * 255) + non_zero_byte)	164	length = 2 + (L ?: 31 + (zero_bytes * 255) + non_zero_byte)
@@ -165,7 +184,9 @@ Authors
165	=======	184	=======
166		185
167	This document was written by Willy Tarreau <w@1wt.eu> on 2014/07/19 during an	186	This document was written by Willy Tarreau <w@1wt.eu> on 2014/07/19 during an
168	analysis of the decompression code available in Linux 3.16-rc5. The code is	187	analysis of the decompression code available in Linux 3.16-rc5, and updated
169	tricky, it is possible that this document contains mistakes or that a few	188	by Dave Rodgman <dave.rodgman@arm.com> on 2018/10/30 to introduce run-length
170	corner cases were overlooked. In any case, please report any doubt, fix, or	189	encoding. The code is tricky, it is possible that this document contains
171	proposed updates to the author(s) so that the document can be updated.	190	mistakes or that a few corner cases were overlooked. In any case, please
		191	report any doubt, fix, or proposed updates to the author(s) so that the
		192	document can be updated.


diff --git a/include/linux/lzo.h b/include/linux/lzo.h index 2ae27cb89927..547a86c71e1b 100644 --- a/include/linux/lzo.h +++ b/include/linux/lzo.h
@@ -18,7 +18,7 @@
18	#define LZO1X_1_MEM_COMPRESS (8192 * sizeof(unsigned short))	18	#define LZO1X_1_MEM_COMPRESS (8192 * sizeof(unsigned short))
19	#define LZO1X_MEM_COMPRESS LZO1X_1_MEM_COMPRESS	19	#define LZO1X_MEM_COMPRESS LZO1X_1_MEM_COMPRESS
20		20
21	#define lzo1x_worst_compress(x) ((x) + ((x) / 16) + 64 + 3)	21	#define lzo1x_worst_compress(x) ((x) + ((x) / 16) + 64 + 3 + 2)
22		22
23	/* This requires 'wrkmem' of size LZO1X_1_MEM_COMPRESS */	23	/* This requires 'wrkmem' of size LZO1X_1_MEM_COMPRESS */
24	int lzo1x_1_compress(const unsigned char *src, size_t src_len,	24	int lzo1x_1_compress(const unsigned char *src, size_t src_len,


diff --git a/lib/lzo/lzo1x_compress.c b/lib/lzo/lzo1x_compress.c index 236eb21167b5..89cd561201ff 100644 --- a/lib/lzo/lzo1x_compress.c +++ b/lib/lzo/lzo1x_compress.c
@@ -20,7 +20,7 @@
20	static noinline size_t	20	static noinline size_t
21	lzo1x_1_do_compress(const unsigned char *in, size_t in_len,	21	lzo1x_1_do_compress(const unsigned char *in, size_t in_len,
22	unsigned char out, size_t out_len,	22	unsigned char out, size_t out_len,
23	size_t ti, void *wrkmem)	23	size_t ti, void wrkmem, signed char state_offset)
24	{	24	{
25	const unsigned char *ip;	25	const unsigned char *ip;
26	unsigned char *op;	26	unsigned char *op;
@@ -35,27 +35,85 @@ lzo1x_1_do_compress(const unsigned char *in, size_t in_len,
35	ip += ti < 4 ? 4 - ti : 0;	35	ip += ti < 4 ? 4 - ti : 0;
36		36
37	for (;;) {	37	for (;;) {
38	const unsigned char *m_pos;	38	const unsigned char *m_pos = NULL;
39	size_t t, m_len, m_off;	39	size_t t, m_len, m_off;
40	u32 dv;	40	u32 dv;
		41	u32 run_length = 0;
41	literal:	42	literal:
42	ip += 1 + ((ip - ii) >> 5);	43	ip += 1 + ((ip - ii) >> 5);
43	next:	44	next:
44	if (unlikely(ip >= ip_end))	45	if (unlikely(ip >= ip_end))
45	break;	46	break;
46	dv = get_unaligned_le32(ip);	47	dv = get_unaligned_le32(ip);
47	t = ((dv * 0x1824429d) >> (32 - D_BITS)) & D_MASK;	48
48	m_pos = in + dict[t];	49	if (dv == 0) {
49	dict[t] = (lzo_dict_t) (ip - in);	50	const unsigned char *ir = ip + 4;
50	if (unlikely(dv != get_unaligned_le32(m_pos)))	51	const unsigned char *limit = ip_end
51	goto literal;	52	< (ip + MAX_ZERO_RUN_LENGTH + 1)
		53	? ip_end : ip + MAX_ZERO_RUN_LENGTH + 1;
		54	#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && \
		55	defined(LZO_FAST_64BIT_MEMORY_ACCESS)
		56	u64 dv64;
		57
		58	for (; (ir + 32) <= limit; ir += 32) {
		59	dv64 = get_unaligned((u64 *)ir);
		60	dv64 \|= get_unaligned((u64 *)ir + 1);
		61	dv64 \|= get_unaligned((u64 *)ir + 2);
		62	dv64 \|= get_unaligned((u64 *)ir + 3);
		63	if (dv64)
		64	break;
		65	}
		66	for (; (ir + 8) <= limit; ir += 8) {
		67	dv64 = get_unaligned((u64 *)ir);
		68	if (dv64) {
		69	# if defined(__LITTLE_ENDIAN)
		70	ir += __builtin_ctzll(dv64) >> 3;
		71	# elif defined(__BIG_ENDIAN)
		72	ir += __builtin_clzll(dv64) >> 3;
		73	# else
		74	# error "missing endian definition"
		75	# endif
		76	break;
		77	}
		78	}
		79	#else
		80	while ((ir < (const unsigned char *)
		81	ALIGN((uintptr_t)ir, 4)) &&
		82	(ir < limit) && (*ir == 0))
		83	ir++;
		84	for (; (ir + 4) <= limit; ir += 4) {
		85	dv = ((u32 )ir);
		86	if (dv) {
		87	# if defined(__LITTLE_ENDIAN)
		88	ir += __builtin_ctz(dv) >> 3;
		89	# elif defined(__BIG_ENDIAN)
		90	ir += __builtin_clz(dv) >> 3;
		91	# else
		92	# error "missing endian definition"
		93	# endif
		94	break;
		95	}
		96	}
		97	#endif
		98	while (likely(ir < limit) && unlikely(*ir == 0))
		99	ir++;
		100	run_length = ir - ip;
		101	if (run_length > MAX_ZERO_RUN_LENGTH)
		102	run_length = MAX_ZERO_RUN_LENGTH;
		103	} else {
		104	t = ((dv * 0x1824429d) >> (32 - D_BITS)) & D_MASK;
		105	m_pos = in + dict[t];
		106	dict[t] = (lzo_dict_t) (ip - in);
		107	if (unlikely(dv != get_unaligned_le32(m_pos)))
		108	goto literal;
		109	}
52		110
53	ii -= ti;	111	ii -= ti;
54	ti = 0;	112	ti = 0;
55	t = ip - ii;	113	t = ip - ii;
56	if (t != 0) {	114	if (t != 0) {
57	if (t <= 3) {	115	if (t <= 3) {
58	op[-2] \|= t;	116	op[*state_offset] \|= t;
59	COPY4(op, ii);	117	COPY4(op, ii);
60	op += t;	118	op += t;
61	} else if (t <= 16) {	119	} else if (t <= 16) {
@@ -88,6 +146,17 @@ next:
88	}	146	}
89	}	147	}
90		148
		149	if (unlikely(run_length)) {
		150	ip += run_length;
		151	run_length -= MIN_ZERO_RUN_LENGTH;
		152	put_unaligned_le32((run_length << 21) \| 0xfffc18
		153	\| (run_length & 0x7), op);
		154	op += 4;
		155	run_length = 0;
		156	*state_offset = -3;
		157	goto finished_writing_instruction;
		158	}
		159
91	m_len = 4;	160	m_len = 4;
92	{	161	{
93	#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && defined(LZO_USE_CTZ64)	162	#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && defined(LZO_USE_CTZ64)
@@ -170,7 +239,6 @@ m_len_done:
170		239
171	m_off = ip - m_pos;	240	m_off = ip - m_pos;
172	ip += m_len;	241	ip += m_len;
173	ii = ip;
174	if (m_len <= M2_MAX_LEN && m_off <= M2_MAX_OFFSET) {	242	if (m_len <= M2_MAX_LEN && m_off <= M2_MAX_OFFSET) {
175	m_off -= 1;	243	m_off -= 1;
176	*op++ = (((m_len - 1) << 5) \| ((m_off & 7) << 2));	244	*op++ = (((m_len - 1) << 5) \| ((m_off & 7) << 2));
@@ -207,6 +275,9 @@ m_len_done:
207	*op++ = (m_off << 2);	275	*op++ = (m_off << 2);
208	*op++ = (m_off >> 6);	276	*op++ = (m_off >> 6);
209	}	277	}
		278	*state_offset = -2;
		279	finished_writing_instruction:
		280	ii = ip;
210	goto next;	281	goto next;
211	}	282	}
212	*out_len = op - out;	283	*out_len = op - out;
@@ -221,6 +292,12 @@ int lzo1x_1_compress(const unsigned char *in, size_t in_len,
221	unsigned char *op = out;	292	unsigned char *op = out;
222	size_t l = in_len;	293	size_t l = in_len;
223	size_t t = 0;	294	size_t t = 0;
		295	signed char state_offset = -2;
		296
		297	// LZO v0 will never write 17 as first byte,
		298	// so this is used to version the bitstream
		299	*op++ = 17;
		300	*op++ = LZO_VERSION;
224		301
225	while (l > 20) {	302	while (l > 20) {
226	size_t ll = l <= (M4_MAX_OFFSET + 1) ? l : (M4_MAX_OFFSET + 1);	303	size_t ll = l <= (M4_MAX_OFFSET + 1) ? l : (M4_MAX_OFFSET + 1);
@@ -229,7 +306,8 @@ int lzo1x_1_compress(const unsigned char *in, size_t in_len,
229	break;	306	break;
230	BUILD_BUG_ON(D_SIZE * sizeof(lzo_dict_t) > LZO1X_1_MEM_COMPRESS);	307	BUILD_BUG_ON(D_SIZE * sizeof(lzo_dict_t) > LZO1X_1_MEM_COMPRESS);
231	memset(wrkmem, 0, D_SIZE * sizeof(lzo_dict_t));	308	memset(wrkmem, 0, D_SIZE * sizeof(lzo_dict_t));
232	t = lzo1x_1_do_compress(ip, ll, op, out_len, t, wrkmem);	309	t = lzo1x_1_do_compress(ip, ll, op, out_len,
		310	t, wrkmem, &state_offset);
233	ip += ll;	311	ip += ll;
234	op += *out_len;	312	op += *out_len;
235	l -= ll;	313	l -= ll;
@@ -242,7 +320,7 @@ int lzo1x_1_compress(const unsigned char *in, size_t in_len,
242	if (op == out && t <= 238) {	320	if (op == out && t <= 238) {
243	*op++ = (17 + t);	321	*op++ = (17 + t);
244	} else if (t <= 3) {	322	} else if (t <= 3) {
245	op[-2] \|= t;	323	op[state_offset] \|= t;
246	} else if (t <= 18) {	324	} else if (t <= 18) {
247	*op++ = (t - 3);	325	*op++ = (t - 3);
248	} else {	326	} else {


diff --git a/lib/lzo/lzo1x_decompress_safe.c b/lib/lzo/lzo1x_decompress_safe.c index a1c387f6afba..6d2600ea3b55 100644 --- a/lib/lzo/lzo1x_decompress_safe.c +++ b/lib/lzo/lzo1x_decompress_safe.c
@@ -46,11 +46,23 @@ int lzo1x_decompress_safe(const unsigned char *in, size_t in_len,
46	const unsigned char * const ip_end = in + in_len;	46	const unsigned char * const ip_end = in + in_len;
47	unsigned char * const op_end = out + *out_len;	47	unsigned char * const op_end = out + *out_len;
48		48
		49	unsigned char bitstream_version;
		50
49	op = out;	51	op = out;
50	ip = in;	52	ip = in;
51		53
52	if (unlikely(in_len < 3))	54	if (unlikely(in_len < 3))
53	goto input_overrun;	55	goto input_overrun;
		56
		57	if (likely(*ip == 17)) {
		58	bitstream_version = ip[1];
		59	ip += 2;
		60	if (unlikely(in_len < 5))
		61	goto input_overrun;
		62	} else {
		63	bitstream_version = 0;
		64	}
		65
54	if (*ip > 17) {	66	if (*ip > 17) {
55	t = *ip++ - 17;	67	t = *ip++ - 17;
56	if (t < 4) {	68	if (t < 4) {
@@ -154,32 +166,49 @@ copy_literal_run:
154	m_pos -= next >> 2;	166	m_pos -= next >> 2;
155	next &= 3;	167	next &= 3;
156	} else {	168	} else {
157	m_pos = op;	169	NEED_IP(2);
158	m_pos -= (t & 8) << 11;	170	next = get_unaligned_le16(ip);
159	t = (t & 7) + (3 - 1);	171	if (((next & 0xfffc) == 0xfffc) &&
160	if (unlikely(t == 2)) {	172	((t & 0xf8) == 0x18) &&
161	size_t offset;	173	likely(bitstream_version)) {
162	const unsigned char *ip_last = ip;	174	NEED_IP(3);
		175	t &= 7;
		176	t \|= ip[2] << 3;
		177	t += MIN_ZERO_RUN_LENGTH;
		178	NEED_OP(t);
		179	memset(op, 0, t);
		180	op += t;
		181	next &= 3;
		182	ip += 3;
		183	goto match_next;
		184	} else {
		185	m_pos = op;
		186	m_pos -= (t & 8) << 11;
		187	t = (t & 7) + (3 - 1);
		188	if (unlikely(t == 2)) {
		189	size_t offset;
		190	const unsigned char *ip_last = ip;
163		191
164	while (unlikely(*ip == 0)) {	192	while (unlikely(*ip == 0)) {
165	ip++;	193	ip++;
166	NEED_IP(1);	194	NEED_IP(1);
167	}	195	}
168	offset = ip - ip_last;	196	offset = ip - ip_last;
169	if (unlikely(offset > MAX_255_COUNT))	197	if (unlikely(offset > MAX_255_COUNT))
170	return LZO_E_ERROR;	198	return LZO_E_ERROR;
171		199
172	offset = (offset << 8) - offset;	200	offset = (offset << 8) - offset;
173	t += offset + 7 + *ip++;	201	t += offset + 7 + *ip++;
174	NEED_IP(2);	202	NEED_IP(2);
		203	next = get_unaligned_le16(ip);
		204	}
		205	ip += 2;
		206	m_pos -= next >> 2;
		207	next &= 3;
		208	if (m_pos == op)
		209	goto eof_found;
		210	m_pos -= 0x4000;
175	}	211	}
176	next = get_unaligned_le16(ip);
177	ip += 2;
178	m_pos -= next >> 2;
179	next &= 3;
180	if (m_pos == op)
181	goto eof_found;
182	m_pos -= 0x4000;
183	}	212	}
184	TEST_LB(m_pos);	213	TEST_LB(m_pos);
185	#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)	214	#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)


diff --git a/lib/lzo/lzodefs.h b/lib/lzo/lzodefs.h index fa0a45fed8c4..ac64159ee344 100644 --- a/lib/lzo/lzodefs.h +++ b/lib/lzo/lzodefs.h
@@ -13,6 +13,12 @@
13	*/	13	*/
14		14
15		15
		16	/* Version
		17	* 0: original lzo version
		18	* 1: lzo with support for RLE
		19	*/
		20	#define LZO_VERSION 1
		21
16	#define COPY4(dst, src) \	22	#define COPY4(dst, src) \
17	put_unaligned(get_unaligned((const u32 )(src)), (u32 )(dst))	23	put_unaligned(get_unaligned((const u32 )(src)), (u32 )(dst))
18	#if defined(CONFIG_X86_64) \|\| defined(CONFIG_ARM64)	24	#if defined(CONFIG_X86_64) \|\| defined(CONFIG_ARM64)
@@ -28,6 +34,7 @@
28	#elif defined(CONFIG_X86_64) \|\| defined(CONFIG_ARM64)	34	#elif defined(CONFIG_X86_64) \|\| defined(CONFIG_ARM64)
29	#define LZO_USE_CTZ64 1	35	#define LZO_USE_CTZ64 1
30	#define LZO_USE_CTZ32 1	36	#define LZO_USE_CTZ32 1
		37	#define LZO_FAST_64BIT_MEMORY_ACCESS
31	#elif defined(CONFIG_X86) \|\| defined(CONFIG_PPC)	38	#elif defined(CONFIG_X86) \|\| defined(CONFIG_PPC)
32	#define LZO_USE_CTZ32 1	39	#define LZO_USE_CTZ32 1
33	#elif defined(CONFIG_ARM) && (__LINUX_ARM_ARCH__ >= 5)	40	#elif defined(CONFIG_ARM) && (__LINUX_ARM_ARCH__ >= 5)
@@ -37,7 +44,7 @@
37	#define M1_MAX_OFFSET 0x0400	44	#define M1_MAX_OFFSET 0x0400
38	#define M2_MAX_OFFSET 0x0800	45	#define M2_MAX_OFFSET 0x0800
39	#define M3_MAX_OFFSET 0x4000	46	#define M3_MAX_OFFSET 0x4000
40	#define M4_MAX_OFFSET 0xbfff	47	#define M4_MAX_OFFSET 0xbffe
41		48
42	#define M1_MIN_LEN 2	49	#define M1_MIN_LEN 2
43	#define M1_MAX_LEN 2	50	#define M1_MAX_LEN 2
@@ -53,6 +60,9 @@
53	#define M3_MARKER 32	60	#define M3_MARKER 32
54	#define M4_MARKER 16	61	#define M4_MARKER 16
55		62
		63	#define MIN_ZERO_RUN_LENGTH 4
		64	#define MAX_ZERO_RUN_LENGTH (2047 + MIN_ZERO_RUN_LENGTH)
		65
56	#define lzo_dict_t unsigned short	66	#define lzo_dict_t unsigned short
57	#define D_BITS 13	67	#define D_BITS 13
58	#define D_SIZE (1u << D_BITS)	68	#define D_SIZE (1u << D_BITS)