FPGA Accelerator For Graph-based Vector Search

This is a repository for our VLDB'25 paper Fast Graph Vector Search via Hardware Acceleration and Delayed-Synchronization Traversal

@article{jiang2024accelerating,
  title={Fast Graph Vector Search via Hardware Acceleration and Delayed-Synchronization Traversal},
  author={Jiang, Wenqi and Hu, Hang and Hoefler, Torsten and Alonso, Gustavo},
  journal={Proceedings of the VLDB Endowment},
  year={2025}
}

Software baseline: https://github.com/HangHu-sys/vector_search_baselines

Related Projects

ISCA'25: RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving.

System Performance Optimization for RAG.

Code: https://github.com/google/rago

@inproceedings{rago:isca:2025,
  title={RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving},
  author={Jiang, Wenqi and Subramanian, Suvinay and Graves, Cat and Alonso, Gustavo and Yazdanbakhsh, Amir and Dadu, Vidushi},
  booktitle = {Proceedings of the 52th Annual International Symposium on Computer Architecture}
  year={2025}
}

KDD'25 PipeRAG: Fast retrieval-augmented generation via adaptive pipeline parallelism

Efficient algorithm for iterative RAG.

Code: https://github.com/amazon-science/piperag

@article{jiang2025piperag,
  title={PipeRAG: Fast retrieval-augmented generation via adaptive pipeline parallelism},
  author={Jiang, Wenqi and Zhang, Shuai and Han, Boran and Wang, Jie and Wang, Yuyang Bernie and Kraska, Tim},
  journal={Proceedings of the 31th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
  year={2025}
}

VLDB'25 Chameleon: A Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models

Chameleon is a heterogeneous accelerator system for RAG serving. It prototypes FPGA-based accelerators for retrieval and GPU-based LLM inference.

Code: https://github.com/fpgasystems/Chameleon-RAG-Acceleration

@article{jiang2023chameleon,
  title={Chameleon: a heterogeneous and disaggregated accelerator system for retrieval-augmented language models},
  author={Jiang, Wenqi and Zeller, Marco and Waleffe, Roger and Hoefler, Torsten and Alonso, Gustavo},
  journal={Proceedings of the VLDB Endowment},
  year={2025}
}

SC'23 Co-design Hardware and Algorithm for Vector Search

Accelerating product-quantization-based vector search.

Code: https://github.com/WenqiJiang/SC-ANN-FPGA

@inproceedings{jiang2023co,
  title={Co-design hardware and algorithm for vector search},
  author={Jiang, Wenqi and Li, Shigang and Zhu, Yu and de Fine Licht, Johannes and He, Zhenhao and Shi, Runbin and Renggli, Cedric and Zhang, Shuai and Rekatsinas, Theodoros and Hoefler, Torsten and others},
  booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
  pages={1--15},
  year={2023}
}

All Bitstreams & Experiments

Local Bitstreams & Experiments

Bitstreams

├── FPGA_multi_DDR
│   ├── FPGA_inter_query_v1.3_longer_FIFO_alt_PR
│   ├── FPGA_intra_query_v1.5_support_batching_longer_FIFO

Each bitstream may have different FIFO lengths that can pass P&R (already written in constant.hpp); Also double-check the connectivity.cfg file in case for different dimensionality, we need different P&R directives.

3 bitstreams for inter-query:

FPGA_inter_query_v1.3_longer_FIFO_alt_PR, with D=96, 128, 384 for Deep, SIFT, and SBERT; 4 channels

3 x 3 = 9 bitstreams for inter-query:

FPGA_intra_query_v1.5_support_batching_longer_FIFO , with D=96, 128, 384 for Deep, SIFT, and SBERT; and 1, 2, 4 channels for scalability tests

Experiments

See perf_test_scripts/README.md for the evaluation of throughput, latency, and recall for various settings.

Networked bitstreams

Bitstreams

├── networked_FPGA
│   ├── kernel
│   │   └── user_krnl
│   │       ├── FPGA_inter_query_v1_3
│   │       ├── FPGA_intra_query_v1_5

Each bitstream may have different FIFO lengths that can pass P&R (already written in constant.hpp); Also double-check the connectivity.cfg file in case for different dimensionality, we need different P&R directives.

For networked version, SBERT dataset would not work due to P&R problems with D=384.

2 bitstreams for inter-query:

FPGA_inter_query_v1_3, with D=96, 128; 4 channels

2 bitstreams for intra-query:

FPGA_intra_query_v1_5, with D=96, 128; 4 channels

Experiments

See networked_FPGA/CPU_programs/README.md for the evaluation of latency for various settings.

Folder Organization

Here shows the most useful folders

├── FPGA_multi_DDR
│   ├── FPGA_inter_query_v1.3_longer_FIFO_alt_PR
│   ├── FPGA_intra_query_v1.5_support_batching_longer_FIFO
├── FPGA_single_DDR
├── networked_FPGA
│   ├── CPU_programs
│   ├── kernel
│   │   └── user_krnl
│   │       ├── FPGA_inter_query_v1_3
│   │       ├── FPGA_intra_query_v1_5
│   ├── host
│   │   ├── FPGA_inter_query_v1_3
│   │   ├── FPGA_intra_query_v1_5
├── perf_test_scripts
│   └── saved_df
├── plots
│   ├── images
├── test_dataflow_feedback
└── unit_tests
    ├── bloom_and_hash
    ├── bloom_fetch_compute
    ├── compute_PE
    ├── fetch_vectors
    └── priority_queue

FPGA_multi_DDR: multi-DDR version FPGA implementations

networked_FPGA: networked version of multi-DDR version FPGA implementations, with exactly the same kernel implemented in FPGA_multi_DDR

perf_test_scripts: all the plots used to measure the performance of local FPGAs

plots: all the plotting scripts

(unused) FPGA_single_DDR: single-DDR version FPGA implementations. Used only for developments.

(unused) test_dataflow_feedback: test the dataflow feedback loop behavior. Used only for developments.

(unused) unit_tests: benchmark the performance of different building blocks. Used only for developments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FPGA Accelerator For Graph-based Vector Search

Related Projects

ISCA'25: RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving.

KDD'25 PipeRAG: Fast retrieval-augmented generation via adaptive pipeline parallelism

VLDB'25 Chameleon: A Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models

SC'23 Co-design Hardware and Algorithm for Vector Search

All Bitstreams & Experiments

Local Bitstreams & Experiments

Bitstreams

Experiments

Networked bitstreams

Bitstreams

Experiments

Folder Organization

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
FPGA_multi_DDR		FPGA_multi_DDR
FPGA_single_DDR		FPGA_single_DDR
networked_FPGA		networked_FPGA
perf_test_scripts		perf_test_scripts
plots		plots
test_dataflow_feedback		test_dataflow_feedback
unit_tests		unit_tests
vector_search_baselines		vector_search_baselines
Commands.md		Commands.md
README.md		README.md

fpgasystems/Falcon-accelerate-graph-vector-search

Folders and files

Latest commit

History

Repository files navigation

FPGA Accelerator For Graph-based Vector Search

Related Projects

ISCA'25: RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving.

KDD'25 PipeRAG: Fast retrieval-augmented generation via adaptive pipeline parallelism

VLDB'25 Chameleon: A Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models

SC'23 Co-design Hardware and Algorithm for Vector Search

All Bitstreams & Experiments

Local Bitstreams & Experiments

Bitstreams

Experiments

Networked bitstreams

Bitstreams

Experiments

Folder Organization

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages