Skip to content

Optimize CRC16 lookup table size #2300

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: unstable
Choose a base branch
from

Conversation

Cuda-Chen
Copy link

@Cuda-Chen Cuda-Chen commented Jul 3, 2025

Apply half-byte lookup table for fitting the whole table into CPU's L1 cacheline (usually with size of 64 Bytes).

For an ordinary CRC16 lookup table, it needs 512 Bytes (2 Bytes for each polynomial *
256 entries for the bit permutation of one Byte).
Whereas this implementation merely needs 32 Bytes (2 Bytes for each polynomial *
16 entries for the bit permutation of one Byte).

Mention #2031.

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This is very interesting. Some first thoughts:

  1. Have you done any benchmarks?

    My guess is that the large table is faster if all of the table is in L1 cache, so if the algorithms runs isolated in a benchmark loop, the old one will be faster. However, if the table is not in L1 cache, this new one will be faster.

    An idea is to just test using valkey-benchmark against a single-cluster. Another idea is to use perf and/or a flame graph and look at the part spent on crc16. If you are not familiar with benchmarking Valkey, there are others who can help with this (e.g. @xbasel, @SoftlyRaining, @roshkhatri).

  2. Since the whole file is completely rewritten, we can also drop the copyright and licence header and replace it with our shorter variant

    /*
     * Copyright Valkey Contributors.
     * All rights reserved.
     * SPDX-License-Identifier: BSD 3-Clause
     */
    
  3. Let's move the old implementation with the large table to a unit test and use it to compare the old and new implementation. Let's create a new unit test in src/unit/test_crc16.c. Follow the other files in that directory to see how it's done. It's a very minimal unit test framework.

src/crc16.c Outdated
Comment on lines 83 to 57
/* CRC16 implementation according to CCITT standards.
*
* Note by @antirez: this is actually the XMODEM CRC 16 algorithm, using the
* following parameters:
* Note by @cudachen: this is actually the XMODEM CRC-16 algorithm (with
* bit-reflected), using the following parameters:
*
* Name : "XMODEM", also known as "ZMODEM", "CRC-16/ACORN"
* Width : 16 bit
* Poly : 1021 (That is actually x^16 + x^12 + x^5 + 1)
* Initialization : 0000
* Reflect Input byte : False
* Reflect Output CRC : False
* Xor constant to output CRC : 0000
* Output for "123456789" : 31C3
* Poly : 0x8408 (That is actually 0x1021, or x^16 + x^12 + x^5 + 1)
* Initialization : 0x0000
* Reflect Input byte : True
* Reflect Output CRC : True
* Xor constant to output CRC : 0x0000
* Output for "123456789" : 0x31C3
*/
Copy link
Contributor

@zuiderkwast zuiderkwast Jul 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change the description? It the old description not correct or is it equivalent to your new description, but yours is written in a reflected way...?

I think we can keep the old description, if it's not incorrect. It seems to be the most common way to describe this CRC-16 variant. For example, according to this page, the Xmodem variant has the parameters mentioned by Antirez:

If a reflected and non-reflected variant are equivalent with just different numbers in the table, is it possible to avoid the reverse_8_bit() and reverse_16_bit() calls, if we just use different numbers in the table?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but yours is written in a reflected way

For my opinion, writing in a reflected way has these advantages:

  1. No need to 1) get input byte then shift, 2) "first get value then shift" when retrieving the entry from LUT.
    For example:

    // a reflected way
    for(i = 0; i < len; i++) {
        uint_8 v = (uint8_t)buf[i];
        v = reverse_8_bit(v); // REQUIRED: reverse all bits
        crc ^= v;
        crc = (crc >> 4) ^ tbl_reduced[crc & 0x000f];
        crc = (crc >> 4) ^ tbl_reduced[crc & 0x000f];
    }
    return reverse_16_bit(crc); // REQUIRED: reverse all bits
    
    // a normal way
    for(i = 0; i < len; i++) {
        uint_8 v = (uint8_t)buf[i] << 8;
        crc ^= v;
        crc = (crc << 4) ^ tbl_forward_reduced[crc >> 12];
        crc = (crc << 4) ^ tbl_forward_reduced[crc >> 12];
    }
    
    return crc;
  2. Looks more intuitive.
    Many implementations uses reflected way (including carry-less multiplication ones). Choosing this results in a great comparison for code readability.

However, my current implementation needs additional two XOR operations in order to get the same result compared to a normal calculation.

If a reflected and non-reflected variant are equivalent with just different numbers in the table, is it possible to avoid the reverse_8_bit() and reverse_16_bit() calls, if we just use different numbers in the table?

For my experiment, there exists a non-reflected variant of half-byte table solution as below:

/* Adapted from: https://www.eevblog.com/forum/programming/crc32-half-bytenibble-lookup-table-algorithm/ */

// A forward LUT, just retrieve the first 16 entries from ordinary LUT.
static const uint16_t tbl_forward_reduced[16] = { 
    0x0000,0x1021,0x2042,0x3063,0x4084,0x50a5,0x60c6,0x70e7,
    0x8108,0x9129,0xa14a,0xb16b,0xc18c,0xd1ad,0xe1ce,0xf1ef,
};

// CRC-16 main subroutine
uint16_t crc16(const char *buf, int len) {
    int counter;
    uint16_t crc = 0;
    for(counter = 0; counter < len; counter++) {
        crc ^= (uint16_t)buf[counter] << 8;
        crc = (crc << 4) ^ tbl_forward_reduced[crc >> 12];
        crc = (crc << 4) ^ tbl_forward_reduced[crc >> 12];
    }  
    return crc;
}

For my current understanding, a normal calculation is equivalent to a bit-reversed calculation with reflecting bits of input bytes and output CRC, and vice versa (e.g., CRC32C denotes the normal calculation with reflecting bits of input bytes and output CRC, whereas many implementations choose bit-reversed calculation because there are no needs to reflect the bits of input bytes and output CRC).
You may see this post for further information: https://github.com/Michaelangel007/crc32?tab=readme-ov-file#tldr-crc32-summary

To sum up, I think we may try the normal calculation then benchmarking to choose which implementations may be the best.

@zuiderkwast zuiderkwast requested a review from xbasel July 4, 2025 17:30
Copy link

codecov bot commented Jul 4, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.45%. Comparing base (663dac9) to head (1607895).

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2300      +/-   ##
============================================
+ Coverage     71.30%   71.45%   +0.15%     
============================================
  Files           123      123              
  Lines         67122    67128       +6     
============================================
+ Hits          47859    47967     +108     
+ Misses        19263    19161     -102     
Files with missing lines Coverage Δ
src/crc16.c 100.00% <100.00%> (ø)

... and 12 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Cuda-Chen
Copy link
Author

Hi @zuiderkwast ,

  1. Have you done any benchmarks?
    ...
    If you are not familiar with benchmarking Valkey, there are others who can help with this (e.g. @xbasel, @SoftlyRaining, @roshkhatri).

I haven't done benchmarks, yet.
Still, thanks for pointing some directions of benchmarking, including an initial entrypoint of how ValKey does benckmarking. I am going to adapt myself to the test framework within these days.

By the way, I would like to ask how should I inform these people you have mentioned for help (e.g., GitHub Issues)?

  1. Since the whole file is completely rewritten, we can also drop the copyright and licence header and replace it with our shorter variant

I will replace the copyright and licence header in future commits.

  1. Let's move the old implementation with the large table to a unit test and use it to compare the old and new implementation. Let's create a new unit test in src/unit/test_crc16.c. Follow the other files in that directory to see how it's done. It's a very minimal unit test framework.

Thanks for saving my day! I'm just thinking how ValKey executing unit tests.

@Cuda-Chen
Copy link
Author

Cuda-Chen commented Jul 7, 2025

Hi @zuiderkwast ,

I am going to update the process of benchmarking.
The code and the results will be written in later comments.
For testing, here some factors:

  • I create a code snippet for testing the implementations (original, forward, and reverse calculation).
  • The test program is compiled as follows: gcc test.c -O2 -Wall -g -fno-omit-frame-pointer
  • Use perf record -g -e cycles,instructions,L1-dcache-load-misses ./a.out to save the results.
  • Use perf report -g graph,0.5,caller for viewing the detailed results.
  • Use perf stat -g -e cycles,instructions,L1-dcache-load-misses ./a.out to view the glance of results.

In short, I observe the following phenomena:

  1. Both half-byte LUT implementations get longer execution time.
  2. The forward calculation has fewer L1 cache miss counts compared to original one.
  3. The reverse calculation has more L1 cache miss counts compared to original one.
    • The reason is that the reverse bit functions provide some L1 cache miss counts.

I also provide the L1 cache miss counts table as follows:

original forward reverse
10727 9809 12288

@Cuda-Chen
Copy link
Author

Test Environments

CPU

$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          36 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   4
  On-line CPU(s) list:    0-3
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz
    CPU family:           6
    Model:                58
    Thread(s) per core:   2
    Core(s) per socket:   2
    Socket(s):            1
    Stepping:             9
    CPU max MHz:          3200.0000
    CPU min MHz:          1200.0000
    BogoMIPS:             5187.90
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts a
                          cpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon peb
                          s bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monit
                          or ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadl
                          ine_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti ssbd ibrs ibpb stibp 
                          tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln
                           pts md_clear flush_l1d
Virtualization features:  
  Virtualization:         VT-x
Caches (sum of all):      
  L1d:                    64 KiB (2 instances)
  L1i:                    64 KiB (2 instances)
  L2:                     512 KiB (2 instances)
  L3:                     3 MiB (1 instance)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-3
Vulnerabilities:          
  Gather data sampling:   Not affected
  Itlb multihit:          KVM: Mitigation: VMX disabled
  L1tf:                   Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
  Mds:                    Mitigation; Clear CPU buffers; SMT vulnerable
  Meltdown:               Mitigation; PTI
  Mmio stale data:        Unknown: No mitigations
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP conditional; RSB filling; PB
                          RSB-eIBRS Not affected; BHI Not affected
  Srbds:                  Vulnerable: No microcode
  Tsx async abort:        Not affected

Code for Testing

#include <stdio.h>
#include <stdint.h>
#include <string.h>

#define NROUND 100000

static const uint16_t crc16tab[256]= {
    0x0000,0x1021,0x2042,0x3063,0x4084,0x50a5,0x60c6,0x70e7,
    0x8108,0x9129,0xa14a,0xb16b,0xc18c,0xd1ad,0xe1ce,0xf1ef,
    0x1231,0x0210,0x3273,0x2252,0x52b5,0x4294,0x72f7,0x62d6,
    0x9339,0x8318,0xb37b,0xa35a,0xd3bd,0xc39c,0xf3ff,0xe3de,
    0x2462,0x3443,0x0420,0x1401,0x64e6,0x74c7,0x44a4,0x5485,
    0xa56a,0xb54b,0x8528,0x9509,0xe5ee,0xf5cf,0xc5ac,0xd58d,
    0x3653,0x2672,0x1611,0x0630,0x76d7,0x66f6,0x5695,0x46b4,
    0xb75b,0xa77a,0x9719,0x8738,0xf7df,0xe7fe,0xd79d,0xc7bc,
    0x48c4,0x58e5,0x6886,0x78a7,0x0840,0x1861,0x2802,0x3823,
    0xc9cc,0xd9ed,0xe98e,0xf9af,0x8948,0x9969,0xa90a,0xb92b,
    0x5af5,0x4ad4,0x7ab7,0x6a96,0x1a71,0x0a50,0x3a33,0x2a12,
    0xdbfd,0xcbdc,0xfbbf,0xeb9e,0x9b79,0x8b58,0xbb3b,0xab1a,
    0x6ca6,0x7c87,0x4ce4,0x5cc5,0x2c22,0x3c03,0x0c60,0x1c41,
    0xedae,0xfd8f,0xcdec,0xddcd,0xad2a,0xbd0b,0x8d68,0x9d49,
    0x7e97,0x6eb6,0x5ed5,0x4ef4,0x3e13,0x2e32,0x1e51,0x0e70,
    0xff9f,0xefbe,0xdfdd,0xcffc,0xbf1b,0xaf3a,0x9f59,0x8f78,
    0x9188,0x81a9,0xb1ca,0xa1eb,0xd10c,0xc12d,0xf14e,0xe16f,
    0x1080,0x00a1,0x30c2,0x20e3,0x5004,0x4025,0x7046,0x6067,
    0x83b9,0x9398,0xa3fb,0xb3da,0xc33d,0xd31c,0xe37f,0xf35e,
    0x02b1,0x1290,0x22f3,0x32d2,0x4235,0x5214,0x6277,0x7256,
    0xb5ea,0xa5cb,0x95a8,0x8589,0xf56e,0xe54f,0xd52c,0xc50d,
    0x34e2,0x24c3,0x14a0,0x0481,0x7466,0x6447,0x5424,0x4405,
    0xa7db,0xb7fa,0x8799,0x97b8,0xe75f,0xf77e,0xc71d,0xd73c,
    0x26d3,0x36f2,0x0691,0x16b0,0x6657,0x7676,0x4615,0x5634,
    0xd94c,0xc96d,0xf90e,0xe92f,0x99c8,0x89e9,0xb98a,0xa9ab,
    0x5844,0x4865,0x7806,0x6827,0x18c0,0x08e1,0x3882,0x28a3,
    0xcb7d,0xdb5c,0xeb3f,0xfb1e,0x8bf9,0x9bd8,0xabbb,0xbb9a,
    0x4a75,0x5a54,0x6a37,0x7a16,0x0af1,0x1ad0,0x2ab3,0x3a92,
    0xfd2e,0xed0f,0xdd6c,0xcd4d,0xbdaa,0xad8b,0x9de8,0x8dc9,
    0x7c26,0x6c07,0x5c64,0x4c45,0x3ca2,0x2c83,0x1ce0,0x0cc1,
    0xef1f,0xff3e,0xcf5d,0xdf7c,0xaf9b,0xbfba,0x8fd9,0x9ff8,
    0x6e17,0x7e36,0x4e55,0x5e74,0x2e93,0x3eb2,0x0ed1,0x1ef0
};

static const uint16_t tbl_reduced[16] = {
    0x0000,0x1081,0x2102,0x3183,0x4204,0x5285,0x6306,0x7387,0x8408,0x9489,0xa50a,0xb58b,0xc60c,0xd68d,0xe70e,0xf78f,
};

static const uint16_t tbl_forward_reduced[16] = { 
    0x0000,0x1021,0x2042,0x3063,0x4084,0x50a5,0x60c6,0x70e7,
    0x8108,0x9129,0xa14a,0xb16b,0xc18c,0xd1ad,0xe1ce,0xf1ef,
};

#define REVERSE_8_bit(n) \
        do { \
                    n = ((n & 0xf0f0f0f0) >> 4) | ((n & 0x0f0f0f0f) << 4); \
                    n = ((n & 0xcccccccc) >> 2) | ((n & 0x33333333) << 2); \
                    n = ((n & 0xaaaaaaaa) >> 1) | ((n & 0x55555555) << 1); \
                } while(0)

static inline uint8_t reverse_bit_8(uint8_t n) {
        REVERSE_8_bit(n);
            return n;
}

static inline uint16_t reverse_bit_16(uint16_t n) {
        n = ((n & 0xff00ff00) >> 8) | ((n & 0x00ff00ff) << 8);
            REVERSE_8_bit(n);
                return n;
}

__attribute__((noinline)) uint16_t crc16_orig(const char *buf, int len) {
    int counter;
    uint16_t crc = 0;
    for (counter = 0; counter < len; counter++)
            crc = (crc<<8) ^ crc16tab[((crc>>8) ^ *buf++)&0x00FF];
    return crc;
}


static inline uint16_t crc16_reverse_base(uint16_t crc, uint8_t v) {
        crc ^= v;
        crc = (crc >> 4) ^ tbl_reduced[crc & 0x000f];
        crc = (crc >> 4) ^ tbl_reduced[crc & 0x000f];
        return crc;
}

__attribute__((noinline)) uint16_t crc16_reverse(const char *buf, int len) {
    int counter;
    uint16_t crc = 0;
    for(counter = 0; counter < len; counter++) {
        uint8_t tmp = reverse_bit_8((uint8_t)buf[counter]);
        crc = crc16_reverse_base(crc, tmp);
    }
    return reverse_bit_16(crc); 
}

static inline uint16_t crc16_forward_base(uint16_t crc, uint8_t v) {
        crc ^= v << 8;
        crc = (crc << 4) ^ tbl_forward_reduced[crc >> 12];
        crc = (crc << 4) ^ tbl_forward_reduced[crc >> 12];
        return crc;
}

__attribute__((noinline)) static uint16_t crc16_forward(const char *buf, int len) {
    int counter;
    uint16_t crc = 0;
    for(counter = 0; counter < len; counter++) {
        uint8_t tmp = (uint8_t)buf[counter];
        crc = crc16_forward_base(crc, tmp);
    }  

    return crc;
}

__attribute__((always_inline))
static inline void escape(void *p) {
    __asm__ volatile ("" : : "r"(p) : "memory");
}

int main() {
    const char li[] = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed "
                         "do eiusmod tempor incididunt ut labore et dolore magna "
                         "aliqua. Ut enim ad minim veniam, quis nostrud exercitation "
                         "ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis "
                         "aute irure dolor in reprehenderit in voluptate velit esse "
                         "cillum dolore eu fugiat nulla pariatur. Excepteur sint "
                         "occaecat cupidatat non proident, sunt in culpa qui officia "
                         "deserunt mollit anim id est laborum.";
    uint16_t ret;
    for(int i = 0; i < NROUND; i++) {
        escape(li);
        ret = crc16_forward(li, strlen(li));
        escape(li);
    }
    printf("0x%x\n", ret);
    return 0;
}

@Cuda-Chen
Copy link
Author

Statistics of Each Implementation

# original
 Performance counter stats for './a.out':

       403,048,189      cycles                                                      
       402,922,173      instructions              #    1.00  insn per cycle         
            24,123      L1-dcache-load-misses                                       

       0.133358807 seconds time elapsed

       0.133398000 seconds user
       0.000000000 seconds sys
       
# forward
 Performance counter stats for './a.out':

       798,062,363      cycles                                                      
       713,941,689      instructions              #    0.89  insn per cycle         
            31,513      L1-dcache-load-misses                                       

       0.260542423 seconds time elapsed

       0.260686000 seconds user
       0.000000000 seconds sys
  
# reverse
 Performance counter stats for './a.out':

       688,547,303      cycles                                                      
     1,250,619,386      instructions              #    1.82  insn per cycle         
            35,560      L1-dcache-load-misses                                       

       0.221963854 seconds time elapsed

       0.222047000 seconds user
       0.000000000 seconds sys       

@Cuda-Chen
Copy link
Author

Detailed L1 Cache Miss Counts of Each Implementation

original

Samples: 56  of event 'L1-dcache-load-misses', 4000 Hz, Event count (approx.): 41442
crc16_orig  /home/jio/c_code/a.out [Percent: local period]
Period      │    ↓ jle     40                                                                                 ▒
            │      movslq  %esi,%rsi                                                                          ▒
            │    uint16_t crc = 0;                                                                            ▒
            │      xor     %eax,%eax                                                                          ▒
            │      lea     (%rdi,%rsi,1),%rcx                                                                 ▒
            │      lea     crc16tab,%rsi                                                                      ▒
            │      nop                                                                                        ▒
            │    crc = (crc<<8) ^ crc16tab[((crc>>8) ^ *buf++)&0x00FF];                                       ▒
            │20:   mov     %eax,%edx                                                                          ▒
            │      add     $0x1,%rdi                                                                          ▒
            │      shl     $0x8,%eax                                                                          ▒
        434 │      shr     $0x8,%dx                                                                           ▒
            │      xor     -0x1(%rdi),%dl                                                                     ▒
        957 │      movzbl  %dl,%edx                                                                           ▒
        575 │      xor     (%rsi,%rdx,2),%ax                                                                  ▒
            │    for (counter = 0; counter < len; counter++)                                                  ▒
       8761 │      cmp     %rcx,%rdi                                                                          ▒
            │    ↑ jne     20                                                                                 ▒
            │    ← ret                                                                                        ▒
            │      nop                                                                                        ▒
            │    uint16_t crc = 0;                                                                            ▒
            │40:   xor     %eax,%eax                                                                          ▒
            │    return crc;                                                                                  ▒
            │    }                                                                                            ▒
            │    ← ret

half-byte LUT forward calculation

Samples: 94  of event 'L1-dcache-load-misses', 4000 Hz, Event count (approx.): 55209
crc16_forward.constprop.0  /home/jio/c_code/a.out [Percent: local period]
Period      │    for(counter = 0; counter < len; counter++) {                                                 ▒
            │      add    $0x1,%rdi                                                                           ▒
            │    crc16_forward_base():                                                                        ▒
            │    crc ^= v << 8;                                                                               ▒
            │      shl    $0x8,%edx                                                                           ▒
        701 │      xor    %edx,%eax                                                                           ▒
            │    crc = (crc << 4) ^ tbl_forward_reduced[crc >> 12];                                           ▒
            │      mov    %eax,%edx                                                                           ▒
            │      shl    $0x4,%eax                                                                           ▒
            │      shr    $0xc,%dx                                                                            ▒
            │      and    $0xf,%edx                                                                           ◆
            │      xor    (%rcx,%rdx,2),%ax                                                                   ▒
            │    crc = (crc << 4) ^ tbl_forward_reduced[crc >> 12];                                           ▒
       1158 │      mov    %eax,%edx                                                                           ▒
            │      shl    $0x4,%eax                                                                           ▒
        637 │      shr    $0xc,%dx                                                                            ▒
         58 │      and    $0xf,%edx                                                                           ▒
            │      xor    (%rcx,%rdx,2),%ax                                                                   ▒
            │    crc16_forward():                                                                             ▒
            │    for(counter = 0; counter < len; counter++) {                                                 ▒
       7255 │      cmp    %rdi,%rsi                                                                           ▒
            │    ↑ jne    10                                                                                  ▒
            │    }                                                                                            ▒
            │                                                                                                 ▒
            │    return crc;

half-byte LUT reverse calculation

Samples: 82  of event 'L1-dcache-load-misses', 4000 Hz, Event count (approx.): 49869
crc16_reverse  /home/jio/c_code/a.out [Percent: local period]
Period      │    for(counter = 0; counter < len; counter++) {                                                 ▒
            │      test    %esi,%esi                                                                          ▒
            │    ↓ jle     c0                                                                                 ▒
       3132 │      lea     -0x1(%rsi),%eax                                                                    ▒
            │    uint16_t crc = 0;                                                                            ▒
            │      xor     %r8d,%r8d                                                                          ▒
            │      lea     tbl_reduced,%rcx                                                                   ▒
            │      lea     0x1(%rdi,%rax,1),%rsi                                                              ▒
            │      xchg    %ax,%ax                                                                            ▒
            │    reverse_bit_8():                                                                             ▒
            │    REVERSE_8_bit(n);                                                                            ◆
            │20:   movzbl  (%rdi),%eax                                                                        ▒
            │    crc16_reverse():                                                                             ▒
            │    for(counter = 0; counter < len; counter++) {                                                 ▒
            │      add     $0x1,%rdi                                                                          ▒
            │    reverse_bit_8():                                                                             ▒
            │    REVERSE_8_bit(n);                                                                            ▒
            │      rol     $0x4,%al                                                                           ▒
       1692 │      mov     %eax,%edx                                                                          ▒
            │      shl     $0x2,%eax                                                                          ▒
            │      shr     $0x2,%dl                                                                           ▒
            │      and     $0xffffffcc,%eax                                                                   ▒
        966 │      and     $0x33,%edx                                                                         ▒
            │      or      %eax,%edx                                                                          ▒
            │      mov     %edx,%eax
            |      add     %edx,%edx                                                                          ▒
       1450 │      shr     %al                                                                                ▒
            │      and     $0xffffffaa,%edx                                                                   ▒
            │      and     $0x55,%eax                                                                         ▒
            │      or      %edx,%eax                                                                          ▒
            │    crc16_reverse_base():                                                                        ▒
            │    crc ^= v;                                                                                    ▒
       1643 │      movzbl  %al,%eax                                                                           ▒
            │      xor     %r8d,%eax                                                                          ▒
            │    crc = (crc >> 4) ^ tbl_reduced[crc & 0x000f];                                                ◆
            │      mov     %eax,%edx                                                                          ▒
            │      and     $0xf,%eax                                                                          ▒
       1180 │      shr     $0x4,%dx                                                                           ▒
            │      xor     (%rcx,%rax,2),%dx                                                                  ▒
        546 │      mov     %edx,%eax                                                                          ▒
            │    crc = (crc >> 4) ^ tbl_reduced[crc & 0x000f];                                                ▒
        281 │      mov     %edx,%r8d                                                                          ▒
            │      and     $0xf,%eax                                                                          ▒
        220 │      shr     $0x4,%r8w                                                                          ▒
            │      xor     (%rcx,%rax,2),%r8w                                                                 ▒
            │    crc16_reverse():                                                                             ▒
            │    for(counter = 0; counter < len; counter++) {                                                 ▒
       1178 │      cmp     %rdi,%rsi                                                                          ▒
            │    ↑ jne     20

@Cuda-Chen Cuda-Chen requested a review from zuiderkwast July 13, 2025 15:45
@zuiderkwast
Copy link
Contributor

Sorry for the delay. I have time off now during the summer. I will get to it eventually.

I'm looking at your micro benchmark. The reason the large LUT is faster is because the LUT is in L1 cache, isn't it? We should test this in a real scenario to see if the LUT will be in L1 cache or not. If it isn't in L1 cache, then I guess the small LUT will be faster, for a ~20 byte payload.

Another idea (regardless of big or small LUT) can be to prefetch the LUT into L1 cache some short time before we need it, i.e. we could provide some API to call __builtin_prefetch(LUT), so later when we do the computation, it's already in L1.

I hope more people can find the time to look at this. @hpatro?

@Cuda-Chen
Copy link
Author

Hi @zuiderkwast ,

Sorry for the delay. I have time off now during the summer. I will get to it eventually.

It's okey. I thought you're being caught by more than 10 issues for reviewing.

The reason the large LUT is faster is because the LUT is in L1 cache, isn't it?

After thinking for half of day, for my personal guessing, here are what happens under the hood:

  1. For the original implementation, we have a table with 256 entries occupying 512 Bytes of memory. An ordinary L1 cache is 64 Bytes, so we can store at most 32 entries (each entry costs 2 Bytes). Recall that CRC is basically division (XOR operation, precisely), if the permutation of input bytes just need to take these 32 entries already stored in L1 cache, then the original implementation will perform in less running time compared to current proposed implementations.
  2. For the proposed implementations, there is a bottleneck at computing the CRC value: we need to compute twice, and these computations need to be done sequentially.

Another idea (regardless of big or small LUT) can be to prefetch the LUT into L1 cache some short time before we need it, i.e. we could provide some API to call __builtin_prefetch(LUT), so later when we do the computation, it's already in L1.

I consider this might be a direction. What's more, I come up with an idea: we can choose computing method based on the "entropy" (such as the distance of each byte) of input bytes for prefetching the content of LUT.

@Cuda-Chen
Copy link
Author

I just have been caught by an interview these days. I will start other benchmarks after this week.

@hpatro
Copy link
Collaborator

hpatro commented Jul 17, 2025

@Cuda-Chen Thanks for the update. Feel free to take your time and share the results at your ease.

@Cuda-Chen
Copy link
Author

Hi @zuiderkwast and @hpatro ,

I would like to make updates on benchmarking.
I will put the result and my findings in the following comments.

For running the benchmark I use ./runtest --single integration/valkey-benchmark.

@Cuda-Chen
Copy link
Author

Benchmark Results

original

$ ./runtest --single integration/valkey-benchmark               
Cleanup: may take some time... OK        
Starting test server at port 21079                
[ready]: 37876                                         
Testing integration/valkey-benchmark                                                                           
[ready]: 37877                                         
[ready]: 37878                                                                                                 
[ready]: 37879                                         
[ready]: 37880                                         
[ready]: 37881
[ready]: 37882            
[ready]: 37883
[ready]: 37884                                         
[ready]: 37885                                         
[ready]: 37886
[ready]: 37887                                         
[ready]: 37888
[ready]: 37889                                         
[ready]: 37890                                         
[ready]: 37891
[ok]: benchmark: set,get (7 ms)
[ok]: benchmark: connecting using URI set,get (7 ms)
[ok]: benchmark: connecting using URI with authentication set,get (7 ms)
[ok]: benchmark: full test suite (97 ms)
[ok]: benchmark: multi-thread set,get (512 ms)
[ok]: benchmark: pipelined full set,get (100 ms)
[ok]: benchmark: arbitrary command (7 ms)
[ok]: benchmark: arbitrary command sequence (7 ms)
value:VXKeHogKgJ=[5V9_X^b?48OKF2jGA<f:iR@50o7dS3
[ok]: benchmark: arbitrary command with data placeholder (7 ms)
[ok]: benchmark: keyspace length (35 ms)
[ok]: benchmark: clients idle mode should return error when reached maxclients limit (5 ms)
[ok]: benchmark: read last argument from stdin (14 ms)
[1/1 done]: integration/valkey-benchmark (1 seconds)

                   The End

Execution time of different units:
  1 seconds - integration/valkey-benchmark

\o/ All tests passed without errors!

Cleanup: may take some time... OK

half-byte LUT forward calculation

$ ./runtest --single integration/valkey-benchmark
Cleanup: may take some time... OK
Starting test server at port 21079
[ready]: 16624
Testing integration/valkey-benchmark
[ready]: 16625
[ready]: 16623
[ready]: 16626
[ready]: 16628
[ready]: 16629
[ready]: 16627
[ready]: 16630
[ready]: 16631
[ready]: 16633
[ready]: 16634
[ready]: 16635
[ready]: 16636
[ready]: 16637
[ready]: 16632
[ready]: 16638
[ok]: benchmark: set,get (6 ms)
[ok]: benchmark: connecting using URI set,get (5 ms)
[ok]: benchmark: connecting using URI with authentication set,get (7 ms)
[ok]: benchmark: full test suite (92 ms)
[ok]: benchmark: multi-thread set,get (520 ms)
[ok]: benchmark: pipelined full set,get (100 ms)
[ok]: benchmark: arbitrary command (6 ms)
[ok]: benchmark: arbitrary command sequence (10 ms)
value:VXKeHogKgJ=[5V9_X^b?48OKF2jGA<f:iR@50o7dS3
[ok]: benchmark: arbitrary command with data placeholder (8 ms)
[ok]: benchmark: keyspace length (37 ms)
[ok]: benchmark: clients idle mode should return error when reached maxclients limit (3 ms)
[ok]: benchmark: read last argument from stdin (12 ms)
[1/1 done]: integration/valkey-benchmark (1 seconds)

                   The End

Execution time of different units:
  1 seconds - integration/valkey-benchmark

\o/ All tests passed without errors!

Cleanup: may take some time... OK

half-byte LUT reverse calculation

$ ./runtest --single integration/valkey-benchmark               
Cleanup: may take some time... OK        
Starting test server at port 21079                
[ready]: 32023                                         
Testing integration/valkey-benchmark                                                   
[ready]: 32025                                         
[ready]: 32024                                                                                                 
[ready]: 32026                                         
[ready]: 32029                                         
[ready]: 32027             
[ready]: 32028            
[ready]: 32031             
[ready]: 32030                                         
[ready]: 32032                                         
[ready]: 32034             
[ready]: 32033                                         
[ready]: 32035             
[ready]: 32036                                         
[ready]: 32037                                         
[ready]: 32038                                         
[ok]: benchmark: set,get (8 ms)                        
[ok]: benchmark: connecting using URI set,get (6 ms)
[ok]: benchmark: connecting using URI with authentication set,get (7 ms)                                       
[ok]: benchmark: full test suite (99 ms)               
[ok]: benchmark: multi-thread set,get (512 ms)
[ok]: benchmark: pipelined full set,get (110 ms)
[ok]: benchmark: arbitrary command (9 ms)
[ok]: benchmark: arbitrary command sequence (7 ms)
value:VXKeHogKgJ=[5V9_X^b?48OKF2jGA<f:iR@50o7dS3
[ok]: benchmark: arbitrary command with data placeholder (8 ms)
[ok]: benchmark: keyspace length (33 ms)
[ok]: benchmark: clients idle mode should return error when reached maxclients limit (5 ms)
[ok]: benchmark: read last argument from stdin (13 ms)
[1/1 done]: integration/valkey-benchmark (1 seconds)

                   The End

Execution time of different units:
  1 seconds - integration/valkey-benchmark

\o/ All tests passed without errors!

Cleanup: may take some time... OK

@Cuda-Chen
Copy link
Author

Findings

  • The running time between original, forward, and reverse calculation shows little difference.
    • As reverse needs additional bit-reversing operations, I would like to stick to forward calculation for coding logic simplicity.

What I will Do Next

First will be prefetching LUT as mentioned in the previous comment.

What's more, I would like to conduct experiment on prefetching input data as well (as the half-byte LUT takes 32 Bytes, it may be beneficial to prefetch input data as we set the maximum size of input data to 32 Bytes, utilizing the whole 64 Bytes of L1 cache).

@zuiderkwast
Copy link
Contributor

zuiderkwast commented Jul 20, 2025

For running the benchmark I use ./runtest --single integration/valkey-benchmark.

This just runs the integration test of the benchmark tool. It just tests that the benchmark tool is working correctly. It does not run any real benchmark.

To run benchmarks, use src/valkey-benchmark with a running server started separately using src/valkey-server. Use src/valkey-benchmark --help for help.

Additionally, the server needs to run in cluster mode. Otherwise, CRC-16 is not used at all.

@zuiderkwast
Copy link
Contributor

Actually it might be very difficult to see any difference at all in a benchmark.

I think is a very difficult thing to benchmark these differences without good knowledge about large parts of this code base. I guess we would need to use perf or flamegraph to see the part of execution that consists of the crc16 computation. The crc16 computation is already a very small part of the total execution time.

I wish I had some time to assist you with this, but I'm off for vacation. I guess others are off too during the summer. Maybe it's better to wait for later when people are back from vacation.

What I will Do Next

First will be prefetching LUT as mentioned in the previous comment.

To understand when to call this, we need to read code where the crc16 is called and read backwards to find a good point in time to do the prefetching. It should be done early enough to give enough time for the memory to be read into L1 cache in the background while some other code is executing. Then, when the crc16 code needs the LUT it will already be in L1 cache. We can't really do it inside crc16 itself just before we need to use it. That would be too late.

The main place crc16 is called is from keyHashSlot() which is called from clusterSlotByCommand() in cluster.c. My guess is that the beginning of clusterSlotByCommand() can be a good place to prefetch the LUT. It could be enough time before the crc16 will actually access the LUT.

What's more, I would like to conduct experiment on prefetching input data as well (as the half-byte LUT takes 32 Bytes, it may be beneficial to prefetch input data as we set the maximum size of input data to 32 Bytes, utilizing the whole 64 Bytes of L1 cache).

64 bytes is not the total size of the L1 cache. It is the size of one cache line. Memory is fetched to cache in consecutive blocks of this size. The LUT and the input data will almost definitely be in different cache lines because they are not stored together in the same 64 byte range in memory.

The total L1 cache size is small, but not that small. At least it can store some hundred cache lines, or something like that.

Btw, the input has typically just been accessed by the command parser before the crc16 calculation happens, so I would assume it is already in the L1 cache, or at least in L2 or L3 cache, so my guess is that it is not useful to do prefetching of the input data.

@Cuda-Chen
Copy link
Author

Hi @zuiderkwast ,

For running the benchmark I use ./runtest --single integration/valkey-benchmark.

This just runs the integration test of the benchmark tool. It just tests that the benchmark tool is working correctly. It does not run any real benchmark.

To run benchmarks, use src/valkey-benchmark with a running server started separately using src/valkey-server. Use src/valkey-benchmark --help for help.

Additionally, the server needs to run in cluster mode. Otherwise, CRC-16 is not used at all.

Thanks for pointing me out. I will adapt myself for the benchmarking programs.

I wish I had some time to assist you with this, but I'm off for vacation.

Got it. Just follow your own paces.

To understand when to call this, we need to read code where the crc16 is called and read backwards to find a good point in time to do the prefetching.

Thanks for the hint. I will mark the functions that calls crc16 and make some experiments.

@Cuda-Chen Cuda-Chen force-pushed the optimize-crc16-lut branch 4 times, most recently from c454158 to 670fbe0 Compare July 20, 2025 09:00
Apply half-byte lookup table for fitting the whole table into
CPU's L1 cacheline (usually with size of 64 Bytes).

For an ordinary CRC16 lookup table, it needs 512 Bytes (
2 Bytes for each polynomial *
256 entries for the bit permutation of one Byte).
Whereas this implementation merely needs 32 Bytes (
2 Bytes for each polynomial *
16 entries for the bit permutation of one Byte).

Mention valkey-io#2031.

Signed-off-by: Cuda Chen <clh960524@gmail.com>
@Cuda-Chen Cuda-Chen force-pushed the optimize-crc16-lut branch from 670fbe0 to 1607895 Compare July 20, 2025 10:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants