-
Notifications
You must be signed in to change notification settings - Fork 952
Optimize CRC16 lookup table size #2300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! This is very interesting. Some first thoughts:
-
Have you done any benchmarks?
My guess is that the large table is faster if all of the table is in L1 cache, so if the algorithms runs isolated in a benchmark loop, the old one will be faster. However, if the table is not in L1 cache, this new one will be faster.
An idea is to just test using
valkey-benchmark
against a single-cluster. Another idea is to use perf and/or a flame graph and look at the part spent on crc16. If you are not familiar with benchmarking Valkey, there are others who can help with this (e.g. @xbasel, @SoftlyRaining, @roshkhatri). -
Since the whole file is completely rewritten, we can also drop the copyright and licence header and replace it with our shorter variant
/* * Copyright Valkey Contributors. * All rights reserved. * SPDX-License-Identifier: BSD 3-Clause */
-
Let's move the old implementation with the large table to a unit test and use it to compare the old and new implementation. Let's create a new unit test in
src/unit/test_crc16.c
. Follow the other files in that directory to see how it's done. It's a very minimal unit test framework.
src/crc16.c
Outdated
/* CRC16 implementation according to CCITT standards. | ||
* | ||
* Note by @antirez: this is actually the XMODEM CRC 16 algorithm, using the | ||
* following parameters: | ||
* Note by @cudachen: this is actually the XMODEM CRC-16 algorithm (with | ||
* bit-reflected), using the following parameters: | ||
* | ||
* Name : "XMODEM", also known as "ZMODEM", "CRC-16/ACORN" | ||
* Width : 16 bit | ||
* Poly : 1021 (That is actually x^16 + x^12 + x^5 + 1) | ||
* Initialization : 0000 | ||
* Reflect Input byte : False | ||
* Reflect Output CRC : False | ||
* Xor constant to output CRC : 0000 | ||
* Output for "123456789" : 31C3 | ||
* Poly : 0x8408 (That is actually 0x1021, or x^16 + x^12 + x^5 + 1) | ||
* Initialization : 0x0000 | ||
* Reflect Input byte : True | ||
* Reflect Output CRC : True | ||
* Xor constant to output CRC : 0x0000 | ||
* Output for "123456789" : 0x31C3 | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change the description? It the old description not correct or is it equivalent to your new description, but yours is written in a reflected way...?
I think we can keep the old description, if it's not incorrect. It seems to be the most common way to describe this CRC-16 variant. For example, according to this page, the Xmodem variant has the parameters mentioned by Antirez:
If a reflected and non-reflected variant are equivalent with just different numbers in the table, is it possible to avoid the reverse_8_bit()
and reverse_16_bit()
calls, if we just use different numbers in the table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but yours is written in a reflected way
For my opinion, writing in a reflected way has these advantages:
-
No need to 1) get input byte then shift, 2) "first get value then shift" when retrieving the entry from LUT.
For example:// a reflected way for(i = 0; i < len; i++) { uint_8 v = (uint8_t)buf[i]; v = reverse_8_bit(v); // REQUIRED: reverse all bits crc ^= v; crc = (crc >> 4) ^ tbl_reduced[crc & 0x000f]; crc = (crc >> 4) ^ tbl_reduced[crc & 0x000f]; } return reverse_16_bit(crc); // REQUIRED: reverse all bits // a normal way for(i = 0; i < len; i++) { uint_8 v = (uint8_t)buf[i] << 8; crc ^= v; crc = (crc << 4) ^ tbl_forward_reduced[crc >> 12]; crc = (crc << 4) ^ tbl_forward_reduced[crc >> 12]; } return crc;
-
Looks more intuitive.
Many implementations uses reflected way (including carry-less multiplication ones). Choosing this results in a great comparison for code readability.
However, my current implementation needs additional two XOR operations in order to get the same result compared to a normal calculation.
If a reflected and non-reflected variant are equivalent with just different numbers in the table, is it possible to avoid the reverse_8_bit() and reverse_16_bit() calls, if we just use different numbers in the table?
For my experiment, there exists a non-reflected variant of half-byte table solution as below:
/* Adapted from: https://www.eevblog.com/forum/programming/crc32-half-bytenibble-lookup-table-algorithm/ */
// A forward LUT, just retrieve the first 16 entries from ordinary LUT.
static const uint16_t tbl_forward_reduced[16] = {
0x0000,0x1021,0x2042,0x3063,0x4084,0x50a5,0x60c6,0x70e7,
0x8108,0x9129,0xa14a,0xb16b,0xc18c,0xd1ad,0xe1ce,0xf1ef,
};
// CRC-16 main subroutine
uint16_t crc16(const char *buf, int len) {
int counter;
uint16_t crc = 0;
for(counter = 0; counter < len; counter++) {
crc ^= (uint16_t)buf[counter] << 8;
crc = (crc << 4) ^ tbl_forward_reduced[crc >> 12];
crc = (crc << 4) ^ tbl_forward_reduced[crc >> 12];
}
return crc;
}
For my current understanding, a normal calculation is equivalent to a bit-reversed calculation with reflecting bits of input bytes and output CRC, and vice versa (e.g., CRC32C denotes the normal calculation with reflecting bits of input bytes and output CRC, whereas many implementations choose bit-reversed calculation because there are no needs to reflect the bits of input bytes and output CRC).
You may see this post for further information: https://github.com/Michaelangel007/crc32?tab=readme-ov-file#tldr-crc32-summary
To sum up, I think we may try the normal calculation then benchmarking to choose which implementations may be the best.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #2300 +/- ##
============================================
+ Coverage 71.30% 71.45% +0.15%
============================================
Files 123 123
Lines 67122 67128 +6
============================================
+ Hits 47859 47967 +108
+ Misses 19263 19161 -102
🚀 New features to boost your workflow:
|
Hi @zuiderkwast ,
I haven't done benchmarks, yet. By the way, I would like to ask how should I inform these people you have mentioned for help (e.g., GitHub Issues)?
I will replace the copyright and licence header in future commits.
Thanks for saving my day! I'm just thinking how ValKey executing unit tests. |
Hi @zuiderkwast , I am going to update the process of benchmarking.
In short, I observe the following phenomena:
I also provide the L1 cache miss counts table as follows:
|
Test EnvironmentsCPU
Code for Testing#include <stdio.h>
#include <stdint.h>
#include <string.h>
#define NROUND 100000
static const uint16_t crc16tab[256]= {
0x0000,0x1021,0x2042,0x3063,0x4084,0x50a5,0x60c6,0x70e7,
0x8108,0x9129,0xa14a,0xb16b,0xc18c,0xd1ad,0xe1ce,0xf1ef,
0x1231,0x0210,0x3273,0x2252,0x52b5,0x4294,0x72f7,0x62d6,
0x9339,0x8318,0xb37b,0xa35a,0xd3bd,0xc39c,0xf3ff,0xe3de,
0x2462,0x3443,0x0420,0x1401,0x64e6,0x74c7,0x44a4,0x5485,
0xa56a,0xb54b,0x8528,0x9509,0xe5ee,0xf5cf,0xc5ac,0xd58d,
0x3653,0x2672,0x1611,0x0630,0x76d7,0x66f6,0x5695,0x46b4,
0xb75b,0xa77a,0x9719,0x8738,0xf7df,0xe7fe,0xd79d,0xc7bc,
0x48c4,0x58e5,0x6886,0x78a7,0x0840,0x1861,0x2802,0x3823,
0xc9cc,0xd9ed,0xe98e,0xf9af,0x8948,0x9969,0xa90a,0xb92b,
0x5af5,0x4ad4,0x7ab7,0x6a96,0x1a71,0x0a50,0x3a33,0x2a12,
0xdbfd,0xcbdc,0xfbbf,0xeb9e,0x9b79,0x8b58,0xbb3b,0xab1a,
0x6ca6,0x7c87,0x4ce4,0x5cc5,0x2c22,0x3c03,0x0c60,0x1c41,
0xedae,0xfd8f,0xcdec,0xddcd,0xad2a,0xbd0b,0x8d68,0x9d49,
0x7e97,0x6eb6,0x5ed5,0x4ef4,0x3e13,0x2e32,0x1e51,0x0e70,
0xff9f,0xefbe,0xdfdd,0xcffc,0xbf1b,0xaf3a,0x9f59,0x8f78,
0x9188,0x81a9,0xb1ca,0xa1eb,0xd10c,0xc12d,0xf14e,0xe16f,
0x1080,0x00a1,0x30c2,0x20e3,0x5004,0x4025,0x7046,0x6067,
0x83b9,0x9398,0xa3fb,0xb3da,0xc33d,0xd31c,0xe37f,0xf35e,
0x02b1,0x1290,0x22f3,0x32d2,0x4235,0x5214,0x6277,0x7256,
0xb5ea,0xa5cb,0x95a8,0x8589,0xf56e,0xe54f,0xd52c,0xc50d,
0x34e2,0x24c3,0x14a0,0x0481,0x7466,0x6447,0x5424,0x4405,
0xa7db,0xb7fa,0x8799,0x97b8,0xe75f,0xf77e,0xc71d,0xd73c,
0x26d3,0x36f2,0x0691,0x16b0,0x6657,0x7676,0x4615,0x5634,
0xd94c,0xc96d,0xf90e,0xe92f,0x99c8,0x89e9,0xb98a,0xa9ab,
0x5844,0x4865,0x7806,0x6827,0x18c0,0x08e1,0x3882,0x28a3,
0xcb7d,0xdb5c,0xeb3f,0xfb1e,0x8bf9,0x9bd8,0xabbb,0xbb9a,
0x4a75,0x5a54,0x6a37,0x7a16,0x0af1,0x1ad0,0x2ab3,0x3a92,
0xfd2e,0xed0f,0xdd6c,0xcd4d,0xbdaa,0xad8b,0x9de8,0x8dc9,
0x7c26,0x6c07,0x5c64,0x4c45,0x3ca2,0x2c83,0x1ce0,0x0cc1,
0xef1f,0xff3e,0xcf5d,0xdf7c,0xaf9b,0xbfba,0x8fd9,0x9ff8,
0x6e17,0x7e36,0x4e55,0x5e74,0x2e93,0x3eb2,0x0ed1,0x1ef0
};
static const uint16_t tbl_reduced[16] = {
0x0000,0x1081,0x2102,0x3183,0x4204,0x5285,0x6306,0x7387,0x8408,0x9489,0xa50a,0xb58b,0xc60c,0xd68d,0xe70e,0xf78f,
};
static const uint16_t tbl_forward_reduced[16] = {
0x0000,0x1021,0x2042,0x3063,0x4084,0x50a5,0x60c6,0x70e7,
0x8108,0x9129,0xa14a,0xb16b,0xc18c,0xd1ad,0xe1ce,0xf1ef,
};
#define REVERSE_8_bit(n) \
do { \
n = ((n & 0xf0f0f0f0) >> 4) | ((n & 0x0f0f0f0f) << 4); \
n = ((n & 0xcccccccc) >> 2) | ((n & 0x33333333) << 2); \
n = ((n & 0xaaaaaaaa) >> 1) | ((n & 0x55555555) << 1); \
} while(0)
static inline uint8_t reverse_bit_8(uint8_t n) {
REVERSE_8_bit(n);
return n;
}
static inline uint16_t reverse_bit_16(uint16_t n) {
n = ((n & 0xff00ff00) >> 8) | ((n & 0x00ff00ff) << 8);
REVERSE_8_bit(n);
return n;
}
__attribute__((noinline)) uint16_t crc16_orig(const char *buf, int len) {
int counter;
uint16_t crc = 0;
for (counter = 0; counter < len; counter++)
crc = (crc<<8) ^ crc16tab[((crc>>8) ^ *buf++)&0x00FF];
return crc;
}
static inline uint16_t crc16_reverse_base(uint16_t crc, uint8_t v) {
crc ^= v;
crc = (crc >> 4) ^ tbl_reduced[crc & 0x000f];
crc = (crc >> 4) ^ tbl_reduced[crc & 0x000f];
return crc;
}
__attribute__((noinline)) uint16_t crc16_reverse(const char *buf, int len) {
int counter;
uint16_t crc = 0;
for(counter = 0; counter < len; counter++) {
uint8_t tmp = reverse_bit_8((uint8_t)buf[counter]);
crc = crc16_reverse_base(crc, tmp);
}
return reverse_bit_16(crc);
}
static inline uint16_t crc16_forward_base(uint16_t crc, uint8_t v) {
crc ^= v << 8;
crc = (crc << 4) ^ tbl_forward_reduced[crc >> 12];
crc = (crc << 4) ^ tbl_forward_reduced[crc >> 12];
return crc;
}
__attribute__((noinline)) static uint16_t crc16_forward(const char *buf, int len) {
int counter;
uint16_t crc = 0;
for(counter = 0; counter < len; counter++) {
uint8_t tmp = (uint8_t)buf[counter];
crc = crc16_forward_base(crc, tmp);
}
return crc;
}
__attribute__((always_inline))
static inline void escape(void *p) {
__asm__ volatile ("" : : "r"(p) : "memory");
}
int main() {
const char li[] = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed "
"do eiusmod tempor incididunt ut labore et dolore magna "
"aliqua. Ut enim ad minim veniam, quis nostrud exercitation "
"ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis "
"aute irure dolor in reprehenderit in voluptate velit esse "
"cillum dolore eu fugiat nulla pariatur. Excepteur sint "
"occaecat cupidatat non proident, sunt in culpa qui officia "
"deserunt mollit anim id est laborum.";
uint16_t ret;
for(int i = 0; i < NROUND; i++) {
escape(li);
ret = crc16_forward(li, strlen(li));
escape(li);
}
printf("0x%x\n", ret);
return 0;
} |
Statistics of Each Implementation
|
Detailed L1 Cache Miss Counts of Each Implementationoriginal
half-byte LUT forward calculation
half-byte LUT reverse calculation
|
Sorry for the delay. I have time off now during the summer. I will get to it eventually. I'm looking at your micro benchmark. The reason the large LUT is faster is because the LUT is in L1 cache, isn't it? We should test this in a real scenario to see if the LUT will be in L1 cache or not. If it isn't in L1 cache, then I guess the small LUT will be faster, for a ~20 byte payload. Another idea (regardless of big or small LUT) can be to prefetch the LUT into L1 cache some short time before we need it, i.e. we could provide some API to call I hope more people can find the time to look at this. @hpatro? |
Hi @zuiderkwast ,
It's okey. I thought you're being caught by more than 10 issues for reviewing.
After thinking for half of day, for my personal guessing, here are what happens under the hood:
I consider this might be a direction. What's more, I come up with an idea: we can choose computing method based on the "entropy" (such as the distance of each byte) of input bytes for prefetching the content of LUT. |
I just have been caught by an interview these days. I will start other benchmarks after this week. |
@Cuda-Chen Thanks for the update. Feel free to take your time and share the results at your ease. |
Hi @zuiderkwast and @hpatro , I would like to make updates on benchmarking. For running the benchmark I use |
Benchmark Resultsoriginal
half-byte LUT forward calculation
half-byte LUT reverse calculation
|
Findings
What I will Do NextFirst will be prefetching LUT as mentioned in the previous comment. What's more, I would like to conduct experiment on prefetching input data as well (as the half-byte LUT takes 32 Bytes, it may be beneficial to prefetch input data as we set the maximum size of input data to 32 Bytes, utilizing the whole 64 Bytes of L1 cache). |
This just runs the integration test of the benchmark tool. It just tests that the benchmark tool is working correctly. It does not run any real benchmark. To run benchmarks, use Additionally, the server needs to run in cluster mode. Otherwise, CRC-16 is not used at all. |
Actually it might be very difficult to see any difference at all in a benchmark. I think is a very difficult thing to benchmark these differences without good knowledge about large parts of this code base. I guess we would need to use perf or flamegraph to see the part of execution that consists of the crc16 computation. The crc16 computation is already a very small part of the total execution time. I wish I had some time to assist you with this, but I'm off for vacation. I guess others are off too during the summer. Maybe it's better to wait for later when people are back from vacation.
To understand when to call this, we need to read code where the crc16 is called and read backwards to find a good point in time to do the prefetching. It should be done early enough to give enough time for the memory to be read into L1 cache in the background while some other code is executing. Then, when the crc16 code needs the LUT it will already be in L1 cache. We can't really do it inside crc16 itself just before we need to use it. That would be too late. The main place crc16 is called is from keyHashSlot() which is called from clusterSlotByCommand() in cluster.c. My guess is that the beginning of clusterSlotByCommand() can be a good place to prefetch the LUT. It could be enough time before the crc16 will actually access the LUT.
64 bytes is not the total size of the L1 cache. It is the size of one cache line. Memory is fetched to cache in consecutive blocks of this size. The LUT and the input data will almost definitely be in different cache lines because they are not stored together in the same 64 byte range in memory. The total L1 cache size is small, but not that small. At least it can store some hundred cache lines, or something like that. Btw, the input has typically just been accessed by the command parser before the crc16 calculation happens, so I would assume it is already in the L1 cache, or at least in L2 or L3 cache, so my guess is that it is not useful to do prefetching of the input data. |
Hi @zuiderkwast ,
Thanks for pointing me out. I will adapt myself for the benchmarking programs.
Got it. Just follow your own paces.
Thanks for the hint. I will mark the functions that calls crc16 and make some experiments. |
c454158
to
670fbe0
Compare
Apply half-byte lookup table for fitting the whole table into CPU's L1 cacheline (usually with size of 64 Bytes). For an ordinary CRC16 lookup table, it needs 512 Bytes ( 2 Bytes for each polynomial * 256 entries for the bit permutation of one Byte). Whereas this implementation merely needs 32 Bytes ( 2 Bytes for each polynomial * 16 entries for the bit permutation of one Byte). Mention valkey-io#2031. Signed-off-by: Cuda Chen <clh960524@gmail.com>
670fbe0
to
1607895
Compare
Apply half-byte lookup table for fitting the whole table into CPU's L1 cacheline (usually with size of 64 Bytes).
For an ordinary CRC16 lookup table, it needs 512 Bytes (2 Bytes for each polynomial *
256 entries for the bit permutation of one Byte).
Whereas this implementation merely needs 32 Bytes (2 Bytes for each polynomial *
16 entries for the bit permutation of one Byte).
Mention #2031.