Skip to content

Improve VPCLMULQDQ to use 512-bit wide registers #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

onethumb
Copy link
Contributor

@onethumb onethumb commented Jun 7, 2025

The Problem

The current implementation uses 4 x 256-bit registers, but modern VPCLMULQDQ CPUs support 512-bit operations, which should be faster.

The Solution

Implement 4 x 512-bit operations for increased throughput. CRC-64/NVME calculations went from ~56.4 GiB/s to ~96.1 GiB/s on a Sapphire Rapids AWS EC c7i.8xlarge instance.

Changes

  • Calculate a new 256-byte distance folding coefficient for all CRC variants
  • Update VPCLMULQDQ calculations to use 512-bit wide registers and intrinsics

Planned version bump

  • Which: MINOR
  • Why: non-breaking new functionality (gated behind nightly and the vpclmulqdq feature flag)

onethumb added 4 commits June 6, 2025 20:07
When using 512-bit registers, we need to use coefficient pairs for
folding 256 byte distances, as opposed to the 128 byte folding
differences for smaller registers.
Only for x86_64 CPUs supporting VPCLMULQD. Gated behind builds using
+nightly with the “vpclmulqdq” feature flag.

Provides nearly a 2X boost in throughput. CRC-64/NVME is now ~96GiB/s on
Intel Sapphire Rapids (AWS EC2 c7i.metal-48xl), up from ~56GiB/s.
Represents the new performance impact from the wider AVX512 registers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant