Approximate Bag-of-Words Top-k Corpus Graphs

This repo contains the code corresponding to the ECIR 2025 short paper Approximate Bag-of-Words Top-k Corpus Graphs by Lachlan Dunn, Luke Gallagher, and Joel Mackenzie.

Citation information

@inproceedings{dg+25ecir,
 title = {Approximate Bag-of-Words Top-$k$ Corpus Graphs},
 author = {L. Dunn and L. Gallagher and J. Mackenzie},
 booktitle = {Proc. ECIR},
 year = {2025},
 pages = {174--182},
}

Acknowledgements

This work builds on the prior work from Kulkarni, et al. Lexically-Accelerated Dense Retrieval and MacAvaney, et al. Adaptive Re-Ranking with a Corpus Graph.

Setup

Configure Python environment

$ mkdir -p ~/.venvs
$ python3 -m venv ~/.venvs/docgraph
$ source ~/.venvs/docgraph/bin/activate
$ pip install -r requirements.txt

Download data
```
$ ./tools/download_data.sh
```

Setup dependencies.

Build PISA from revision bb2b3df and apply patch for LimitPairs implemented by Joel Mackenzie.

$ mkdir -p deps
$ git clone https://github.com/pisa-engine/pisa deps/pisa
$ cd deps/pisa
$ git reset --hard bb2b3df
$ git submodule update --init --recursive --depth 1
$ git am ../../graph/0001-joel-limitpairs.patch
$ mkdir -p build
$ cd build
$ cmake -DPISA_ENABLE_TESTING=OFF -D PISA_ENABLE_BENCHMARKING=OFF ..
$ make -j$(nproc)

It is assumed the ciff2pisa binary is available. If required, refer to the PISA ciff repo for installation instructions.

Build inverted indexes.
```
$ unxz -v data/msmarco-passage.pisa.bp.ciff.xz data/msmarco-passage.dt5q.pisa.bp.ciff.xz
$ ./index/build.sh
```
The indexes are provided in CIFF format. For reproducibility, note that document reordering was performed using faster graph bisection (revision 4ba3bb2) with the loggap gain function and minimum postings length of 128.

Running the graph construction experiments

Run the graph construction timings.
```
$ unxz -v data/*.xz
$ ./graph/build.sh
```
Timing results can be found in graph/*.log.

Running the retrieval experiments

Run the non-graph baselines.

$ ./sysrun/baseline

The system runfiles will be in the runs directory. The stage0 runfiles are the combined BM25 runfiles from each track and are used in the re-ranking experiments.

runs
├── dt5q-bm25-dl19.res.gz
├── dt5q-bm25-dl20.res.gz
├── dt5q-bm25-tasb-dl19.res.gz
├── dt5q-bm25-tasb-dl20.res.gz
├── dt5q-stage0.res.gz
├── original-bm25-dl19.res.gz
├── original-bm25-dl20.res.gz
├── original-bm25-tasb-dl19.res.gz
├── original-bm25-tasb-dl20.res.gz
├── original-stage0.res.gz
├── tasb-dl19.res.gz
└── tasb-dl20.res.gz

Run the re-ranking phase.
```
$ ./sysrun/timing
```
Timing results and runfiles can be found in the runs directory.

Evaluation

This work used trec_eval (v9.0.8) for evaluation.

Query reduction heuristics

To build the query sets (rather than using the pre-computed version from download_data.sh), each query reduction heuristic has an associated script in the tools directory corresponding to title+url, tfidf and dt5q. To build them run ./tools/build_qryheur.sh.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
graph		graph
index		index
sysrun		sysrun
tools		tools
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Approximate Bag-of-Words Top-k Corpus Graphs

Citation information

Acknowledgements

Setup

Running the graph construction experiments

Running the retrieval experiments

Evaluation

Query reduction heuristics

About

Uh oh!

Releases

Packages

Uh oh!

Languages

lgrz/approx-bow-corpusgraph

Folders and files

Latest commit

History

Repository files navigation

Approximate Bag-of-Words Top-k Corpus Graphs

Citation information

Acknowledgements

Setup

Running the graph construction experiments

Running the retrieval experiments

Evaluation

Query reduction heuristics

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages