Optimize CRC16 lookup table size #2300

Cuda-Chen · 2025-07-03T02:45:10Z

Apply half-byte lookup table for fitting the whole table into CPU's L1 cacheline (usually with size of 64 Bytes).

For an ordinary CRC16 lookup table, it needs 512 Bytes (2 Bytes for each polynomial *
256 entries for the bit permutation of one Byte).
Whereas this implementation merely needs 32 Bytes (2 Bytes for each polynomial *
16 entries for the bit permutation of one Byte).

Mention #2031.

zuiderkwast

Thank you! This is very interesting. Some first thoughts:

Have you done any benchmarks?

My guess is that the large table is faster if all of the table is in L1 cache, so if the algorithms runs isolated in a benchmark loop, the old one will be faster. However, if the table is not in L1 cache, this new one will be faster.

An idea is to just test using valkey-benchmark against a single-cluster. Another idea is to use perf and/or a flame graph and look at the part spent on crc16. If you are not familiar with benchmarking Valkey, there are others who can help with this (e.g. @xbasel, @SoftlyRaining, @roshkhatri).
Since the whole file is completely rewritten, we can also drop the copyright and licence header and replace it with our shorter variant
```
/*
 * Copyright Valkey Contributors.
 * All rights reserved.
 * SPDX-License-Identifier: BSD 3-Clause
 */
```
Let's move the old implementation with the large table to a unit test and use it to compare the old and new implementation. Let's create a new unit test in src/unit/test_crc16.c. Follow the other files in that directory to see how it's done. It's a very minimal unit test framework.

zuiderkwast · 2025-07-04T17:12:56Z

src/crc16.c

 /* CRC16 implementation according to CCITT standards.
 *
- * Note by @antirez: this is actually the XMODEM CRC 16 algorithm, using the
- * following parameters:
+ * Note by @cudachen: this is actually the XMODEM CRC-16 algorithm (with
+ * bit-reflected), using the following parameters:
 *
 * Name                       : "XMODEM", also known as "ZMODEM", "CRC-16/ACORN"
 * Width                      : 16 bit
- * Poly                       : 1021 (That is actually x^16 + x^12 + x^5 + 1)
- * Initialization             : 0000
- * Reflect Input byte         : False
- * Reflect Output CRC         : False
- * Xor constant to output CRC : 0000
- * Output for "123456789"     : 31C3
+ * Poly                       : 0x8408 (That is actually 0x1021, or x^16 + x^12 + x^5 + 1)
+ * Initialization             : 0x0000
+ * Reflect Input byte         : True
+ * Reflect Output CRC         : True
+ * Xor constant to output CRC : 0x0000
+ * Output for "123456789"     : 0x31C3
 */


Why change the description? It the old description not correct or is it equivalent to your new description, but yours is written in a reflected way...?

I think we can keep the old description, if it's not incorrect. It seems to be the most common way to describe this CRC-16 variant. For example, according to this page, the Xmodem variant has the parameters mentioned by Antirez:

https://reveng.sourceforge.io/crc-catalogue/all.htm#crc.cat.crc-16-xmodem

If a reflected and non-reflected variant are equivalent with just different numbers in the table, is it possible to avoid the reverse_8_bit() and reverse_16_bit() calls, if we just use different numbers in the table?

but yours is written in a reflected way

For my opinion, writing in a reflected way has these advantages:

No need to 1) get input byte then shift, 2) "first get value then shift" when retrieving the entry from LUT.
For example:

// a reflected way for(i = 0; i < len; i++) { uint_8 v = (uint8_t)buf[i]; v = reverse_8_bit(v); // REQUIRED: reverse all bits crc ^= v; crc = (crc >> 4) ^ tbl_reduced[crc & 0x000f]; crc = (crc >> 4) ^ tbl_reduced[crc & 0x000f]; } return reverse_16_bit(crc); // REQUIRED: reverse all bits // a normal way for(i = 0; i < len; i++) { uint_8 v = (uint8_t)buf[i] << 8; crc ^= v; crc = (crc << 4) ^ tbl_forward_reduced[crc >> 12]; crc = (crc << 4) ^ tbl_forward_reduced[crc >> 12]; } return crc;

Looks more intuitive.
Many implementations uses reflected way (including carry-less multiplication ones). Choosing this results in a great comparison for code readability.

However, my current implementation needs additional two XOR operations in order to get the same result compared to a normal calculation.

If a reflected and non-reflected variant are equivalent with just different numbers in the table, is it possible to avoid the reverse_8_bit() and reverse_16_bit() calls, if we just use different numbers in the table?

For my experiment, there exists a non-reflected variant of half-byte table solution as below:

/* Adapted from: https://www.eevblog.com/forum/programming/crc32-half-bytenibble-lookup-table-algorithm/ */ // A forward LUT, just retrieve the first 16 entries from ordinary LUT. static const uint16_t tbl_forward_reduced[16] = { 0x0000,0x1021,0x2042,0x3063,0x4084,0x50a5,0x60c6,0x70e7, 0x8108,0x9129,0xa14a,0xb16b,0xc18c,0xd1ad,0xe1ce,0xf1ef, }; // CRC-16 main subroutine uint16_t crc16(const char *buf, int len) { int counter; uint16_t crc = 0; for(counter = 0; counter < len; counter++) { crc ^= (uint16_t)buf[counter] << 8; crc = (crc << 4) ^ tbl_forward_reduced[crc >> 12]; crc = (crc << 4) ^ tbl_forward_reduced[crc >> 12]; } return crc; }

For my current understanding, a normal calculation is equivalent to a bit-reversed calculation with reflecting bits of input bytes and output CRC, and vice versa (e.g., CRC32C denotes the normal calculation with reflecting bits of input bytes and output CRC, whereas many implementations choose bit-reversed calculation because there are no needs to reflect the bits of input bytes and output CRC).
You may see this post for further information: https://github.com/Michaelangel007/crc32?tab=readme-ov-file#tldr-crc32-summary

To sum up, I think we may try the normal calculation then benchmarking to choose which implementations may be the best.

codecov · 2025-07-04T17:50:47Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.45%. Comparing base (663dac9) to head (1607895).

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2300      +/-   ##
============================================
+ Coverage     71.30%   71.45%   +0.15%     
============================================
  Files           123      123              
  Lines         67122    67128       +6     
============================================
+ Hits          47859    47967     +108     
+ Misses        19263    19161     -102

Files with missing lines	Coverage Δ
src/crc16.c	`100.00% <100.00%> (ø)`

... and 12 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Cuda-Chen · 2025-07-05T07:56:14Z

Hi @zuiderkwast ,

Have you done any benchmarks?
...
If you are not familiar with benchmarking Valkey, there are others who can help with this (e.g. @xbasel, @SoftlyRaining, @roshkhatri).

I haven't done benchmarks, yet.
Still, thanks for pointing some directions of benchmarking, including an initial entrypoint of how ValKey does benckmarking. I am going to adapt myself to the test framework within these days.

By the way, I would like to ask how should I inform these people you have mentioned for help (e.g., GitHub Issues)?

Since the whole file is completely rewritten, we can also drop the copyright and licence header and replace it with our shorter variant

I will replace the copyright and licence header in future commits.

Let's move the old implementation with the large table to a unit test and use it to compare the old and new implementation. Let's create a new unit test in src/unit/test_crc16.c. Follow the other files in that directory to see how it's done. It's a very minimal unit test framework.

Thanks for saving my day! I'm just thinking how ValKey executing unit tests.

Cuda-Chen · 2025-07-07T08:23:58Z

Hi @zuiderkwast ,

I am going to update the process of benchmarking.
The code and the results will be written in later comments.
For testing, here some factors:

I create a code snippet for testing the implementations (original, forward, and reverse calculation).
The test program is compiled as follows: gcc test.c -O2 -Wall -g -fno-omit-frame-pointer
Use perf record -g -e cycles,instructions,L1-dcache-load-misses ./a.out to save the results.
Use perf report -g graph,0.5,caller for viewing the detailed results.
Use perf stat -g -e cycles,instructions,L1-dcache-load-misses ./a.out to view the glance of results.

In short, I observe the following phenomena:

Both half-byte LUT implementations get longer execution time.
The forward calculation has fewer L1 cache miss counts compared to original one.
The reverse calculation has more L1 cache miss counts compared to original one.
- The reason is that the reverse bit functions provide some L1 cache miss counts.

I also provide the L1 cache miss counts table as follows:

original	forward	reverse
10727	9809	12288

Cuda-Chen · 2025-07-07T08:26:20Z

Test Environments

CPU

$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          36 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   4
  On-line CPU(s) list:    0-3
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz
    CPU family:           6
    Model:                58
    Thread(s) per core:   2
    Core(s) per socket:   2
    Socket(s):            1
    Stepping:             9
    CPU max MHz:          3200.0000
    CPU min MHz:          1200.0000
    BogoMIPS:             5187.90
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts a
                          cpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon peb
                          s bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monit
                          or ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadl
                          ine_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti ssbd ibrs ibpb stibp 
                          tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln
                           pts md_clear flush_l1d
Virtualization features:  
  Virtualization:         VT-x
Caches (sum of all):      
  L1d:                    64 KiB (2 instances)
  L1i:                    64 KiB (2 instances)
  L2:                     512 KiB (2 instances)
  L3:                     3 MiB (1 instance)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-3
Vulnerabilities:          
  Gather data sampling:   Not affected
  Itlb multihit:          KVM: Mitigation: VMX disabled
  L1tf:                   Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
  Mds:                    Mitigation; Clear CPU buffers; SMT vulnerable
  Meltdown:               Mitigation; PTI
  Mmio stale data:        Unknown: No mitigations
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP conditional; RSB filling; PB
                          RSB-eIBRS Not affected; BHI Not affected
  Srbds:                  Vulnerable: No microcode
  Tsx async abort:        Not affected

Code for Testing

#include <stdio.h>
#include <stdint.h>
#include <string.h>

#define NROUND 100000

static const uint16_t crc16tab[256]= {
    0x0000,0x1021,0x2042,0x3063,0x4084,0x50a5,0x60c6,0x70e7,
    0x8108,0x9129,0xa14a,0xb16b,0xc18c,0xd1ad,0xe1ce,0xf1ef,
    0x1231,0x0210,0x3273,0x2252,0x52b5,0x4294,0x72f7,0x62d6,
    0x9339,0x8318,0xb37b,0xa35a,0xd3bd,0xc39c,0xf3ff,0xe3de,
    0x2462,0x3443,0x0420,0x1401,0x64e6,0x74c7,0x44a4,0x5485,
    0xa56a,0xb54b,0x8528,0x9509,0xe5ee,0xf5cf,0xc5ac,0xd58d,
    0x3653,0x2672,0x1611,0x0630,0x76d7,0x66f6,0x5695,0x46b4,
    0xb75b,0xa77a,0x9719,0x8738,0xf7df,0xe7fe,0xd79d,0xc7bc,
    0x48c4,0x58e5,0x6886,0x78a7,0x0840,0x1861,0x2802,0x3823,
    0xc9cc,0xd9ed,0xe98e,0xf9af,0x8948,0x9969,0xa90a,0xb92b,
    0x5af5,0x4ad4,0x7ab7,0x6a96,0x1a71,0x0a50,0x3a33,0x2a12,
    0xdbfd,0xcbdc,0xfbbf,0xeb9e,0x9b79,0x8b58,0xbb3b,0xab1a,
    0x6ca6,0x7c87,0x4ce4,0x5cc5,0x2c22,0x3c03,0x0c60,0x1c41,
    0xedae,0xfd8f,0xcdec,0xddcd,0xad2a,0xbd0b,0x8d68,0x9d49,
    0x7e97,0x6eb6,0x5ed5,0x4ef4,0x3e13,0x2e32,0x1e51,0x0e70,
    0xff9f,0xefbe,0xdfdd,0xcffc,0xbf1b,0xaf3a,0x9f59,0x8f78,
    0x9188,0x81a9,0xb1ca,0xa1eb,0xd10c,0xc12d,0xf14e,0xe16f,
    0x1080,0x00a1,0x30c2,0x20e3,0x5004,0x4025,0x7046,0x6067,
    0x83b9,0x9398,0xa3fb,0xb3da,0xc33d,0xd31c,0xe37f,0xf35e,
    0x02b1,0x1290,0x22f3,0x32d2,0x4235,0x5214,0x6277,0x7256,
    0xb5ea,0xa5cb,0x95a8,0x8589,0xf56e,0xe54f,0xd52c,0xc50d,
    0x34e2,0x24c3,0x14a0,0x0481,0x7466,0x6447,0x5424,0x4405,
    0xa7db,0xb7fa,0x8799,0x97b8,0xe75f,0xf77e,0xc71d,0xd73c,
    0x26d3,0x36f2,0x0691,0x16b0,0x6657,0x7676,0x4615,0x5634,
    0xd94c,0xc96d,0xf90e,0xe92f,0x99c8,0x89e9,0xb98a,0xa9ab,
    0x5844,0x4865,0x7806,0x6827,0x18c0,0x08e1,0x3882,0x28a3,
    0xcb7d,0xdb5c,0xeb3f,0xfb1e,0x8bf9,0x9bd8,0xabbb,0xbb9a,
    0x4a75,0x5a54,0x6a37,0x7a16,0x0af1,0x1ad0,0x2ab3,0x3a92,
    0xfd2e,0xed0f,0xdd6c,0xcd4d,0xbdaa,0xad8b,0x9de8,0x8dc9,
    0x7c26,0x6c07,0x5c64,0x4c45,0x3ca2,0x2c83,0x1ce0,0x0cc1,
    0xef1f,0xff3e,0xcf5d,0xdf7c,0xaf9b,0xbfba,0x8fd9,0x9ff8,
    0x6e17,0x7e36,0x4e55,0x5e74,0x2e93,0x3eb2,0x0ed1,0x1ef0
};

static const uint16_t tbl_reduced[16] = {
    0x0000,0x1081,0x2102,0x3183,0x4204,0x5285,0x6306,0x7387,0x8408,0x9489,0xa50a,0xb58b,0xc60c,0xd68d,0xe70e,0xf78f,
};

static const uint16_t tbl_forward_reduced[16] = { 
    0x0000,0x1021,0x2042,0x3063,0x4084,0x50a5,0x60c6,0x70e7,
    0x8108,0x9129,0xa14a,0xb16b,0xc18c,0xd1ad,0xe1ce,0xf1ef,
};

#define REVERSE_8_bit(n) \
        do { \
                    n = ((n & 0xf0f0f0f0) >> 4) | ((n & 0x0f0f0f0f) << 4); \
                    n = ((n & 0xcccccccc) >> 2) | ((n & 0x33333333) << 2); \
                    n = ((n & 0xaaaaaaaa) >> 1) | ((n & 0x55555555) << 1); \
                } while(0)

static inline uint8_t reverse_bit_8(uint8_t n) {
        REVERSE_8_bit(n);
            return n;
}

static inline uint16_t reverse_bit_16(uint16_t n) {
        n = ((n & 0xff00ff00) >> 8) | ((n & 0x00ff00ff) << 8);
            REVERSE_8_bit(n);
                return n;
}

__attribute__((noinline)) uint16_t crc16_orig(const char *buf, int len) {
    int counter;
    uint16_t crc = 0;
    for (counter = 0; counter < len; counter++)
            crc = (crc<<8) ^ crc16tab[((crc>>8) ^ *buf++)&0x00FF];
    return crc;
}


static inline uint16_t crc16_reverse_base(uint16_t crc, uint8_t v) {
        crc ^= v;
        crc = (crc >> 4) ^ tbl_reduced[crc & 0x000f];
        crc = (crc >> 4) ^ tbl_reduced[crc & 0x000f];
        return crc;
}

__attribute__((noinline)) uint16_t crc16_reverse(const char *buf, int len) {
    int counter;
    uint16_t crc = 0;
    for(counter = 0; counter < len; counter++) {
        uint8_t tmp = reverse_bit_8((uint8_t)buf[counter]);
        crc = crc16_reverse_base(crc, tmp);
    }
    return reverse_bit_16(crc); 
}

static inline uint16_t crc16_forward_base(uint16_t crc, uint8_t v) {
        crc ^= v << 8;
        crc = (crc << 4) ^ tbl_forward_reduced[crc >> 12];
        crc = (crc << 4) ^ tbl_forward_reduced[crc >> 12];
        return crc;
}

__attribute__((noinline)) static uint16_t crc16_forward(const char *buf, int len) {
    int counter;
    uint16_t crc = 0;
    for(counter = 0; counter < len; counter++) {
        uint8_t tmp = (uint8_t)buf[counter];
        crc = crc16_forward_base(crc, tmp);
    }  

    return crc;
}

__attribute__((always_inline))
static inline void escape(void *p) {
    __asm__ volatile ("" : : "r"(p) : "memory");
}

int main() {
    const char li[] = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed "
                         "do eiusmod tempor incididunt ut labore et dolore magna "
                         "aliqua. Ut enim ad minim veniam, quis nostrud exercitation "
                         "ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis "
                         "aute irure dolor in reprehenderit in voluptate velit esse "
                         "cillum dolore eu fugiat nulla pariatur. Excepteur sint "
                         "occaecat cupidatat non proident, sunt in culpa qui officia "
                         "deserunt mollit anim id est laborum.";
    uint16_t ret;
    for(int i = 0; i < NROUND; i++) {
        escape(li);
        ret = crc16_forward(li, strlen(li));
        escape(li);
    }
    printf("0x%x\n", ret);
    return 0;
}

Cuda-Chen · 2025-07-07T08:33:08Z

Statistics of Each Implementation

# original
 Performance counter stats for './a.out':

       403,048,189      cycles                                                      
       402,922,173      instructions              #    1.00  insn per cycle         
            24,123      L1-dcache-load-misses                                       

       0.133358807 seconds time elapsed

       0.133398000 seconds user
       0.000000000 seconds sys
       
# forward
 Performance counter stats for './a.out':

       798,062,363      cycles                                                      
       713,941,689      instructions              #    0.89  insn per cycle         
            31,513      L1-dcache-load-misses                                       

       0.260542423 seconds time elapsed

       0.260686000 seconds user
       0.000000000 seconds sys
  
# reverse
 Performance counter stats for './a.out':

       688,547,303      cycles                                                      
     1,250,619,386      instructions              #    1.82  insn per cycle         
            35,560      L1-dcache-load-misses                                       

       0.221963854 seconds time elapsed

       0.222047000 seconds user
       0.000000000 seconds sys

Cuda-Chen · 2025-07-07T08:37:51Z

Detailed L1 Cache Miss Counts of Each Implementation

original

Samples: 56  of event 'L1-dcache-load-misses', 4000 Hz, Event count (approx.): 41442
crc16_orig  /home/jio/c_code/a.out [Percent: local period]
Period      │    ↓ jle     40                                                                                 ▒
            │      movslq  %esi,%rsi                                                                          ▒
            │    uint16_t crc = 0;                                                                            ▒
            │      xor     %eax,%eax                                                                          ▒
            │      lea     (%rdi,%rsi,1),%rcx                                                                 ▒
            │      lea     crc16tab,%rsi                                                                      ▒
            │      nop                                                                                        ▒
            │    crc = (crc<<8) ^ crc16tab[((crc>>8) ^ *buf++)&0x00FF];                                       ▒
            │20:   mov     %eax,%edx                                                                          ▒
            │      add     $0x1,%rdi                                                                          ▒
            │      shl     $0x8,%eax                                                                          ▒
        434 │      shr     $0x8,%dx                                                                           ▒
            │      xor     -0x1(%rdi),%dl                                                                     ▒
        957 │      movzbl  %dl,%edx                                                                           ▒
        575 │      xor     (%rsi,%rdx,2),%ax                                                                  ▒
            │    for (counter = 0; counter < len; counter++)                                                  ▒
       8761 │      cmp     %rcx,%rdi                                                                          ▒
            │    ↑ jne     20                                                                                 ▒
            │    ← ret                                                                                        ▒
            │      nop                                                                                        ▒
            │    uint16_t crc = 0;                                                                            ▒
            │40:   xor     %eax,%eax                                                                          ▒
            │    return crc;                                                                                  ▒
            │    }                                                                                            ▒
            │    ← ret

half-byte LUT forward calculation

Samples: 94  of event 'L1-dcache-load-misses', 4000 Hz, Event count (approx.): 55209
crc16_forward.constprop.0  /home/jio/c_code/a.out [Percent: local period]
Period      │    for(counter = 0; counter < len; counter++) {                                                 ▒
            │      add    $0x1,%rdi                                                                           ▒
            │    crc16_forward_base():                                                                        ▒
            │    crc ^= v << 8;                                                                               ▒
            │      shl    $0x8,%edx                                                                           ▒
        701 │      xor    %edx,%eax                                                                           ▒
            │    crc = (crc << 4) ^ tbl_forward_reduced[crc >> 12];                                           ▒
            │      mov    %eax,%edx                                                                           ▒
            │      shl    $0x4,%eax                                                                           ▒
            │      shr    $0xc,%dx                                                                            ▒
            │      and    $0xf,%edx                                                                           ◆
            │      xor    (%rcx,%rdx,2),%ax                                                                   ▒
            │    crc = (crc << 4) ^ tbl_forward_reduced[crc >> 12];                                           ▒
       1158 │      mov    %eax,%edx                                                                           ▒
            │      shl    $0x4,%eax                                                                           ▒
        637 │      shr    $0xc,%dx                                                                            ▒
         58 │      and    $0xf,%edx                                                                           ▒
            │      xor    (%rcx,%rdx,2),%ax                                                                   ▒
            │    crc16_forward():                                                                             ▒
            │    for(counter = 0; counter < len; counter++) {                                                 ▒
       7255 │      cmp    %rdi,%rsi                                                                           ▒
            │    ↑ jne    10                                                                                  ▒
            │    }                                                                                            ▒
            │                                                                                                 ▒
            │    return crc;

half-byte LUT reverse calculation

Samples: 82  of event 'L1-dcache-load-misses', 4000 Hz, Event count (approx.): 49869
crc16_reverse  /home/jio/c_code/a.out [Percent: local period]
Period      │    for(counter = 0; counter < len; counter++) {                                                 ▒
            │      test    %esi,%esi                                                                          ▒
            │    ↓ jle     c0                                                                                 ▒
       3132 │      lea     -0x1(%rsi),%eax                                                                    ▒
            │    uint16_t crc = 0;                                                                            ▒
            │      xor     %r8d,%r8d                                                                          ▒
            │      lea     tbl_reduced,%rcx                                                                   ▒
            │      lea     0x1(%rdi,%rax,1),%rsi                                                              ▒
            │      xchg    %ax,%ax                                                                            ▒
            │    reverse_bit_8():                                                                             ▒
            │    REVERSE_8_bit(n);                                                                            ◆
            │20:   movzbl  (%rdi),%eax                                                                        ▒
            │    crc16_reverse():                                                                             ▒
            │    for(counter = 0; counter < len; counter++) {                                                 ▒
            │      add     $0x1,%rdi                                                                          ▒
            │    reverse_bit_8():                                                                             ▒
            │    REVERSE_8_bit(n);                                                                            ▒
            │      rol     $0x4,%al                                                                           ▒
       1692 │      mov     %eax,%edx                                                                          ▒
            │      shl     $0x2,%eax                                                                          ▒
            │      shr     $0x2,%dl                                                                           ▒
            │      and     $0xffffffcc,%eax                                                                   ▒
        966 │      and     $0x33,%edx                                                                         ▒
            │      or      %eax,%edx                                                                          ▒
            │      mov     %edx,%eax
            |      add     %edx,%edx                                                                          ▒
       1450 │      shr     %al                                                                                ▒
            │      and     $0xffffffaa,%edx                                                                   ▒
            │      and     $0x55,%eax                                                                         ▒
            │      or      %edx,%eax                                                                          ▒
            │    crc16_reverse_base():                                                                        ▒
            │    crc ^= v;                                                                                    ▒
       1643 │      movzbl  %al,%eax                                                                           ▒
            │      xor     %r8d,%eax                                                                          ▒
            │    crc = (crc >> 4) ^ tbl_reduced[crc & 0x000f];                                                ◆
            │      mov     %eax,%edx                                                                          ▒
            │      and     $0xf,%eax                                                                          ▒
       1180 │      shr     $0x4,%dx                                                                           ▒
            │      xor     (%rcx,%rax,2),%dx                                                                  ▒
        546 │      mov     %edx,%eax                                                                          ▒
            │    crc = (crc >> 4) ^ tbl_reduced[crc & 0x000f];                                                ▒
        281 │      mov     %edx,%r8d                                                                          ▒
            │      and     $0xf,%eax                                                                          ▒
        220 │      shr     $0x4,%r8w                                                                          ▒
            │      xor     (%rcx,%rax,2),%r8w                                                                 ▒
            │    crc16_reverse():                                                                             ▒
            │    for(counter = 0; counter < len; counter++) {                                                 ▒
       1178 │      cmp     %rdi,%rsi                                                                          ▒
            │    ↑ jne     20

zuiderkwast · 2025-07-13T21:26:26Z

Sorry for the delay. I have time off now during the summer. I will get to it eventually.

I'm looking at your micro benchmark. The reason the large LUT is faster is because the LUT is in L1 cache, isn't it? We should test this in a real scenario to see if the LUT will be in L1 cache or not. If it isn't in L1 cache, then I guess the small LUT will be faster, for a ~20 byte payload.

Another idea (regardless of big or small LUT) can be to prefetch the LUT into L1 cache some short time before we need it, i.e. we could provide some API to call __builtin_prefetch(LUT), so later when we do the computation, it's already in L1.

I hope more people can find the time to look at this. @hpatro?

Cuda-Chen · 2025-07-14T06:29:47Z

Hi @zuiderkwast ,

Sorry for the delay. I have time off now during the summer. I will get to it eventually.

It's okey. I thought you're being caught by more than 10 issues for reviewing.

The reason the large LUT is faster is because the LUT is in L1 cache, isn't it?

After thinking for half of day, for my personal guessing, here are what happens under the hood:

For the original implementation, we have a table with 256 entries occupying 512 Bytes of memory. An ordinary L1 cache is 64 Bytes, so we can store at most 32 entries (each entry costs 2 Bytes). Recall that CRC is basically division (XOR operation, precisely), if the permutation of input bytes just need to take these 32 entries already stored in L1 cache, then the original implementation will perform in less running time compared to current proposed implementations.
For the proposed implementations, there is a bottleneck at computing the CRC value: we need to compute twice, and these computations need to be done sequentially.

Another idea (regardless of big or small LUT) can be to prefetch the LUT into L1 cache some short time before we need it, i.e. we could provide some API to call __builtin_prefetch(LUT), so later when we do the computation, it's already in L1.

I consider this might be a direction. What's more, I come up with an idea: we can choose computing method based on the "entropy" (such as the distance of each byte) of input bytes for prefetching the content of LUT.

Cuda-Chen · 2025-07-16T05:58:00Z

I just have been caught by an interview these days. I will start other benchmarks after this week.

hpatro · 2025-07-17T16:41:18Z

@Cuda-Chen Thanks for the update. Feel free to take your time and share the results at your ease.

Cuda-Chen · 2025-07-20T02:46:06Z

Hi @zuiderkwast and @hpatro ,

I would like to make updates on benchmarking.
I will put the result and my findings in the following comments.

For running the benchmark I use ./runtest --single integration/valkey-benchmark.

Cuda-Chen · 2025-07-20T02:49:00Z

Benchmark Results

original

$ ./runtest --single integration/valkey-benchmark               
Cleanup: may take some time... OK        
Starting test server at port 21079                
[ready]: 37876                                         
Testing integration/valkey-benchmark                                                                           
[ready]: 37877                                         
[ready]: 37878                                                                                                 
[ready]: 37879                                         
[ready]: 37880                                         
[ready]: 37881
[ready]: 37882            
[ready]: 37883
[ready]: 37884                                         
[ready]: 37885                                         
[ready]: 37886
[ready]: 37887                                         
[ready]: 37888
[ready]: 37889                                         
[ready]: 37890                                         
[ready]: 37891
[ok]: benchmark: set,get (7 ms)
[ok]: benchmark: connecting using URI set,get (7 ms)
[ok]: benchmark: connecting using URI with authentication set,get (7 ms)
[ok]: benchmark: full test suite (97 ms)
[ok]: benchmark: multi-thread set,get (512 ms)
[ok]: benchmark: pipelined full set,get (100 ms)
[ok]: benchmark: arbitrary command (7 ms)
[ok]: benchmark: arbitrary command sequence (7 ms)
value:VXKeHogKgJ=[5V9_X^b?48OKF2jGA<f:iR@50o7dS3
[ok]: benchmark: arbitrary command with data placeholder (7 ms)
[ok]: benchmark: keyspace length (35 ms)
[ok]: benchmark: clients idle mode should return error when reached maxclients limit (5 ms)
[ok]: benchmark: read last argument from stdin (14 ms)
[1/1 done]: integration/valkey-benchmark (1 seconds)

                   The End

Execution time of different units:
  1 seconds - integration/valkey-benchmark

\o/ All tests passed without errors!

Cleanup: may take some time... OK

half-byte LUT forward calculation

$ ./runtest --single integration/valkey-benchmark
Cleanup: may take some time... OK
Starting test server at port 21079
[ready]: 16624
Testing integration/valkey-benchmark
[ready]: 16625
[ready]: 16623
[ready]: 16626
[ready]: 16628
[ready]: 16629
[ready]: 16627
[ready]: 16630
[ready]: 16631
[ready]: 16633
[ready]: 16634
[ready]: 16635
[ready]: 16636
[ready]: 16637
[ready]: 16632
[ready]: 16638
[ok]: benchmark: set,get (6 ms)
[ok]: benchmark: connecting using URI set,get (5 ms)
[ok]: benchmark: connecting using URI with authentication set,get (7 ms)
[ok]: benchmark: full test suite (92 ms)
[ok]: benchmark: multi-thread set,get (520 ms)
[ok]: benchmark: pipelined full set,get (100 ms)
[ok]: benchmark: arbitrary command (6 ms)
[ok]: benchmark: arbitrary command sequence (10 ms)
value:VXKeHogKgJ=[5V9_X^b?48OKF2jGA<f:iR@50o7dS3
[ok]: benchmark: arbitrary command with data placeholder (8 ms)
[ok]: benchmark: keyspace length (37 ms)
[ok]: benchmark: clients idle mode should return error when reached maxclients limit (3 ms)
[ok]: benchmark: read last argument from stdin (12 ms)
[1/1 done]: integration/valkey-benchmark (1 seconds)

                   The End

Execution time of different units:
  1 seconds - integration/valkey-benchmark

\o/ All tests passed without errors!

Cleanup: may take some time... OK

half-byte LUT reverse calculation

$ ./runtest --single integration/valkey-benchmark               
Cleanup: may take some time... OK        
Starting test server at port 21079                
[ready]: 32023                                         
Testing integration/valkey-benchmark                                                   
[ready]: 32025                                         
[ready]: 32024                                                                                                 
[ready]: 32026                                         
[ready]: 32029                                         
[ready]: 32027             
[ready]: 32028            
[ready]: 32031             
[ready]: 32030                                         
[ready]: 32032                                         
[ready]: 32034             
[ready]: 32033                                         
[ready]: 32035             
[ready]: 32036                                         
[ready]: 32037                                         
[ready]: 32038                                         
[ok]: benchmark: set,get (8 ms)                        
[ok]: benchmark: connecting using URI set,get (6 ms)
[ok]: benchmark: connecting using URI with authentication set,get (7 ms)                                       
[ok]: benchmark: full test suite (99 ms)               
[ok]: benchmark: multi-thread set,get (512 ms)
[ok]: benchmark: pipelined full set,get (110 ms)
[ok]: benchmark: arbitrary command (9 ms)
[ok]: benchmark: arbitrary command sequence (7 ms)
value:VXKeHogKgJ=[5V9_X^b?48OKF2jGA<f:iR@50o7dS3
[ok]: benchmark: arbitrary command with data placeholder (8 ms)
[ok]: benchmark: keyspace length (33 ms)
[ok]: benchmark: clients idle mode should return error when reached maxclients limit (5 ms)
[ok]: benchmark: read last argument from stdin (13 ms)
[1/1 done]: integration/valkey-benchmark (1 seconds)

                   The End

Execution time of different units:
  1 seconds - integration/valkey-benchmark

\o/ All tests passed without errors!

Cleanup: may take some time... OK

Cuda-Chen · 2025-07-20T03:01:53Z

Findings

The running time between original, forward, and reverse calculation shows little difference.
- As reverse needs additional bit-reversing operations, I would like to stick to forward calculation for coding logic simplicity.

What I will Do Next

First will be prefetching LUT as mentioned in the previous comment.

What's more, I would like to conduct experiment on prefetching input data as well (as the half-byte LUT takes 32 Bytes, it may be beneficial to prefetch input data as we set the maximum size of input data to 32 Bytes, utilizing the whole 64 Bytes of L1 cache).

zuiderkwast · 2025-07-20T05:04:21Z

For running the benchmark I use ./runtest --single integration/valkey-benchmark.

This just runs the integration test of the benchmark tool. It just tests that the benchmark tool is working correctly. It does not run any real benchmark.

To run benchmarks, use src/valkey-benchmark with a running server started separately using src/valkey-server. Use src/valkey-benchmark --help for help.

Additionally, the server needs to run in cluster mode. Otherwise, CRC-16 is not used at all.

zuiderkwast · 2025-07-20T06:08:40Z

Actually it might be very difficult to see any difference at all in a benchmark.

I think is a very difficult thing to benchmark these differences without good knowledge about large parts of this code base. I guess we would need to use perf or flamegraph to see the part of execution that consists of the crc16 computation. The crc16 computation is already a very small part of the total execution time.

I wish I had some time to assist you with this, but I'm off for vacation. I guess others are off too during the summer. Maybe it's better to wait for later when people are back from vacation.

What I will Do Next

First will be prefetching LUT as mentioned in the previous comment.

To understand when to call this, we need to read code where the crc16 is called and read backwards to find a good point in time to do the prefetching. It should be done early enough to give enough time for the memory to be read into L1 cache in the background while some other code is executing. Then, when the crc16 code needs the LUT it will already be in L1 cache. We can't really do it inside crc16 itself just before we need to use it. That would be too late.

The main place crc16 is called is from keyHashSlot() which is called from clusterSlotByCommand() in cluster.c. My guess is that the beginning of clusterSlotByCommand() can be a good place to prefetch the LUT. It could be enough time before the crc16 will actually access the LUT.

What's more, I would like to conduct experiment on prefetching input data as well (as the half-byte LUT takes 32 Bytes, it may be beneficial to prefetch input data as we set the maximum size of input data to 32 Bytes, utilizing the whole 64 Bytes of L1 cache).

64 bytes is not the total size of the L1 cache. It is the size of one cache line. Memory is fetched to cache in consecutive blocks of this size. The LUT and the input data will almost definitely be in different cache lines because they are not stored together in the same 64 byte range in memory.

The total L1 cache size is small, but not that small. At least it can store some hundred cache lines, or something like that.

Btw, the input has typically just been accessed by the command parser before the crc16 calculation happens, so I would assume it is already in the L1 cache, or at least in L2 or L3 cache, so my guess is that it is not useful to do prefetching of the input data.

Cuda-Chen · 2025-07-20T06:51:39Z

Hi @zuiderkwast ,

For running the benchmark I use ./runtest --single integration/valkey-benchmark.

This just runs the integration test of the benchmark tool. It just tests that the benchmark tool is working correctly. It does not run any real benchmark.

To run benchmarks, use src/valkey-benchmark with a running server started separately using src/valkey-server. Use src/valkey-benchmark --help for help.

Additionally, the server needs to run in cluster mode. Otherwise, CRC-16 is not used at all.

Thanks for pointing me out. I will adapt myself for the benchmarking programs.

I wish I had some time to assist you with this, but I'm off for vacation.

Got it. Just follow your own paces.

To understand when to call this, we need to read code where the crc16 is called and read backwards to find a good point in time to do the prefetching.

Thanks for the hint. I will mark the functions that calls crc16 and make some experiments.

Apply half-byte lookup table for fitting the whole table into CPU's L1 cacheline (usually with size of 64 Bytes). For an ordinary CRC16 lookup table, it needs 512 Bytes ( 2 Bytes for each polynomial * 256 entries for the bit permutation of one Byte). Whereas this implementation merely needs 32 Bytes ( 2 Bytes for each polynomial * 16 entries for the bit permutation of one Byte). Mention valkey-io#2031. Signed-off-by: Cuda Chen <clh960524@gmail.com>

zuiderkwast reviewed Jul 4, 2025

View reviewed changes

zuiderkwast requested a review from xbasel July 4, 2025 17:30

Cuda-Chen requested a review from zuiderkwast July 13, 2025 15:45

Cuda-Chen force-pushed the optimize-crc16-lut branch 4 times, most recently from c454158 to 670fbe0 Compare July 20, 2025 09:00

Cuda-Chen force-pushed the optimize-crc16-lut branch from 670fbe0 to 1607895 Compare July 20, 2025 10:24

Optimize CRC16 lookup table size #2300

Are you sure you want to change the base?

Optimize CRC16 lookup table size #2300

Conversation

Cuda-Chen commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zuiderkwast left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zuiderkwast Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Cuda-Chen Jul 5, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Cuda-Chen commented Jul 5, 2025

Uh oh!

Cuda-Chen commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Cuda-Chen commented Jul 7, 2025

Test Environments

CPU

Code for Testing

Uh oh!

Cuda-Chen commented Jul 7, 2025

Statistics of Each Implementation

Uh oh!

Cuda-Chen commented Jul 7, 2025

Detailed L1 Cache Miss Counts of Each Implementation

original

half-byte LUT forward calculation

half-byte LUT reverse calculation

Uh oh!

zuiderkwast commented Jul 13, 2025

Uh oh!

Cuda-Chen commented Jul 14, 2025

Uh oh!

Cuda-Chen commented Jul 16, 2025

Uh oh!

hpatro commented Jul 17, 2025

Uh oh!

Cuda-Chen commented Jul 20, 2025

Uh oh!

Cuda-Chen commented Jul 20, 2025

Benchmark Results

original

half-byte LUT forward calculation

half-byte LUT reverse calculation

Uh oh!

Cuda-Chen commented Jul 20, 2025

Findings

What I will Do Next

Uh oh!

zuiderkwast commented Jul 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zuiderkwast commented Jul 20, 2025

What I will Do Next

Uh oh!

Cuda-Chen commented Jul 20, 2025

Uh oh!

Uh oh!

Cuda-Chen commented Jul 3, 2025 •

edited

Loading

zuiderkwast left a comment •

edited

Loading

zuiderkwast Jul 4, 2025 •

edited

Loading

codecov bot commented Jul 4, 2025 •

edited

Loading

Cuda-Chen commented Jul 7, 2025 •

edited

Loading

zuiderkwast commented Jul 20, 2025 •

edited

Loading