This repo contains the code corresponding to the ECIR 2025 short paper Approximate Bag-of-Words Top-k Corpus Graphs by Lachlan Dunn, Luke Gallagher, and Joel Mackenzie.
@inproceedings{dg+25ecir,
title = {Approximate Bag-of-Words Top-$k$ Corpus Graphs},
author = {L. Dunn and L. Gallagher and J. Mackenzie},
booktitle = {Proc. ECIR},
year = {2025},
pages = {174--182},
}
This work builds on the prior work from Kulkarni, et al. Lexically-Accelerated Dense Retrieval and MacAvaney, et al. Adaptive Re-Ranking with a Corpus Graph.
-
Configure Python environment
$ mkdir -p ~/.venvs $ python3 -m venv ~/.venvs/docgraph $ source ~/.venvs/docgraph/bin/activate $ pip install -r requirements.txt
-
Download data
$ ./tools/download_data.sh
-
Setup dependencies.
Build PISA from revision
bb2b3df
and apply patch for LimitPairs implemented by Joel Mackenzie.$ mkdir -p deps $ git clone https://github.com/pisa-engine/pisa deps/pisa $ cd deps/pisa $ git reset --hard bb2b3df $ git submodule update --init --recursive --depth 1 $ git am ../../graph/0001-joel-limitpairs.patch $ mkdir -p build $ cd build $ cmake -DPISA_ENABLE_TESTING=OFF -D PISA_ENABLE_BENCHMARKING=OFF .. $ make -j$(nproc)
It is assumed the
ciff2pisa
binary is available. If required, refer to the PISA ciff repo for installation instructions. -
Build inverted indexes.
$ unxz -v data/msmarco-passage.pisa.bp.ciff.xz data/msmarco-passage.dt5q.pisa.bp.ciff.xz $ ./index/build.sh
The indexes are provided in CIFF format. For reproducibility, note that document reordering was performed using faster graph bisection (revision
4ba3bb2
) with theloggap
gain function and minimum postings length of 128.
-
Run the graph construction timings.
$ unxz -v data/*.xz $ ./graph/build.sh
-
Timing results can be found in
graph/*.log
.
-
Run the non-graph baselines.
$ ./sysrun/baseline
The system runfiles will be in the
runs
directory. The stage0 runfiles are the combined BM25 runfiles from each track and are used in the re-ranking experiments.runs ├── dt5q-bm25-dl19.res.gz ├── dt5q-bm25-dl20.res.gz ├── dt5q-bm25-tasb-dl19.res.gz ├── dt5q-bm25-tasb-dl20.res.gz ├── dt5q-stage0.res.gz ├── original-bm25-dl19.res.gz ├── original-bm25-dl20.res.gz ├── original-bm25-tasb-dl19.res.gz ├── original-bm25-tasb-dl20.res.gz ├── original-stage0.res.gz ├── tasb-dl19.res.gz └── tasb-dl20.res.gz
-
Run the re-ranking phase.
$ ./sysrun/timing
-
Timing results and runfiles can be found in the
runs
directory.
This work used trec_eval
(v9.0.8) for evaluation.
To build the query sets (rather than using the pre-computed version from download_data.sh
), each
query reduction heuristic has an associated script in the tools
directory corresponding to
title+url
, tfidf
and dt5q
. To build them run ./tools/build_qryheur.sh
.