PDX is a data layout that transposes vectors in a column-major order. This layout unleashes the true potential of dimension pruning, accelerating vanilla IVF indexes by factors:
- ⚡ Up to 10x faster IVF searches than FAISS+AVX512.
- ⚡ Up to 30x faster exhaustive search.
- ⚡ Up to 2x faster raw distance kernels.
Pruning means avoiding checking all the dimensions of a vector to determine if it is a neighbour of a query. The PDX layout unleashes the true potential of these algorithms (e.g., ADSampling), accelerating vanilla IVF indexes by factors.
Pruning is especially effective for large embeddings (d > 512
) and when targeting high recalls (> 0.90
) or nearly exact results.
Down below, you will find benchmarks against FAISS.
Try PDX with your data using our Python bindings and examples. We have implemented PDX on Flat (float32
) and Quantized (8-bit
) IVF indexes and exhaustive search settings.
- PDX is available for x86_64 (with AVX512), ARM, and Apple silicon
- Python 3.11 or higher
- FAISS with Python Bindings
- Clang++17 or higher
- CMake 3.26 or higher
- Clone and init submodules
git clone https://github.com/cwida/PDX
git submodule init
git submodule update
- [Optional] Install FFTW (for higher throughput)
wget https://www.fftw.org/fftw-3.3.10.tar.gz
tar -xvzf fftw-3.3.10.tar.gz
cd fftw-3.3.10
./configure --enable-float --enable-shared
sudo make
sudo make install
ldconfig
- Install Python dependencies and the bindings.
export CXX="/usr/bin/clang++-18" # Set proper CXX first
pip install -r requirements.txt
python setup.py clean --all
python -m pip install .
- Run the examples under
/examples
# Creates an IVF index with FAISS on random data
# Then, it compares the search performance of PDXearch and FAISS
python ./examples/pdx_simple.py
For more details on the available examples and how to use your own data, refer to /examples/README.md.
Note
We heavily rely on FAISS to create the underlying IVF indexes.
We present single-threaded benchmarks against FAISS+AVX512 on an r7iz.xlarge
(Intel Sapphire Rapids) instance.
IVF2 tackles a bottleneck of IVF indexes: finding the nearest centroids. By clustering the original IVF centroids, we can use PDX to quickly scan them (thanks to pruning) without sacrificing recall. This achieves significant throughput improvements when paired with 8-bit
quantization.
Here, PDX, paired with the pruning algorithm ADSampling on float32
, achieves significant speedups.
An exhaustive search scans all the vectors in the collection. Having an IVF index with PDX can EXTREMELY accelerate this without sacrificing recall, thanks to the reliable pruning of ADSampling.
The key observation here is that thanks to the underlying IVF index, the exhaustive search starts with the most promising clusters. A tight threshold is found early on, which enables the quick pruning of most candidates.
By creating random clusters with the PDX layout, you can still accelerate exhaustive search without an index. Unlike ADSampling, with BOND (our pruning algorithm), you can use the raw vectors. Gains vary depending on the distribution of the data.
Even without pruning, PDX distance kernels can be faster than SIMD ones in most CPU microarchitectures. For detailed information, check Figure 3 of our publication. You can also try it yourself in our playground here.
PDX is a transposed layout (a.k.a. columnar, or decomposed layout), which means that the same dimensions of different vectors are stored sequentially. This decomposition occurs within a block (e.g., a cluster in an IVF index).
We have evolved our layout from the one presented in our publication to reduce random access, and adapted it to work with 8-bit
and (in the future) 1-bit
vectors.
For float32
, the first 25% of the dimensions are fully decomposed. We refer to this as the "vertical block." The rest (75%) are decomposed into subvectors of 64 dimensions. We refer to this as the "horizontal block." The vertical block is used for efficient pruning, and the horizontal block is accessed on the candidates that were not pruned. This horizontal block is still decomposed every 64 dimensions. The idea behind this is that we still have a chance to prune the few remaining candidates every 64 dimensions.
The following image shows this layout. Storage is sequential from left to right, and from top to bottom.
Smaller data types are not friendly to PDX, as we must accumulate distances on wider types, resulting in asymmetry. We can work around this by changing the PDX layout. For 8 bits
, the vertical block is decomposed every 4 dimensions. This allows us to use dot product instructions (VPDPBUSD
in x86 and UDOT/SDOT
in NEON) to calculate L2 or IP kernels while still benefiting from PDX. The horizontal block remains decomposed every 64 dimensions.
For Hamming/Jaccard kernels, we use a layout decomposed every 8 dimensions (naturally grouped into bytes). The population count accumulation can be done in bytes
. If d > 256, we flush the popcounts into a wider type every 32 words (corresponding to 256 dimensions). This has not been implemented in this repository yet, but you can find some promising benchmarks here.
- Add PDX to the VIBE benchmark.
- Benchmark filtering capabilities in PDX.
- Add unit tests.
- Implement multi-threading capabilities.
- Adaptive quantization on 8-bit and 4-bit.
- Create a documentation.
Important
PDX is an ongoing research project. In its current state, it is not production-ready code.
To run our benchmark suite in C++, refer to BENCHMARKING.md.
If you use PDX for your research, consider citing us:
@article{kuffo2025pdx,
title={PDX: A Data Layout for Vector Similarity Search},
author={Kuffo, Leonardo and Krippner, Elena and Boncz, Peter},
journal={Proceedings of the ACM on Management of Data},
volume={3},
number={3},
pages={1--26},
year={2025},
publisher={ACM New York, NY, USA}
}
The code used for the experiments presented at SIGMOD'25 can be found in the sigmod
branch.