Skip to content

lgrz/approx-bow-corpusgraph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Approximate Bag-of-Words Top-k Corpus Graphs

This repo contains the code corresponding to the ECIR 2025 short paper Approximate Bag-of-Words Top-k Corpus Graphs by Lachlan Dunn, Luke Gallagher, and Joel Mackenzie.

Citation information

@inproceedings{dg+25ecir,
 title = {Approximate Bag-of-Words Top-$k$ Corpus Graphs},
 author = {L. Dunn and L. Gallagher and J. Mackenzie},
 booktitle = {Proc. ECIR},
 year = {2025},
 pages = {174--182},
}

Acknowledgements

This work builds on the prior work from Kulkarni, et al. Lexically-Accelerated Dense Retrieval and MacAvaney, et al. Adaptive Re-Ranking with a Corpus Graph.

Setup

  1. Configure Python environment

    $ mkdir -p ~/.venvs
    $ python3 -m venv ~/.venvs/docgraph
    $ source ~/.venvs/docgraph/bin/activate
    $ pip install -r requirements.txt
    
  2. Download data

    $ ./tools/download_data.sh
    
  3. Setup dependencies.

    Build PISA from revision bb2b3df and apply patch for LimitPairs implemented by Joel Mackenzie.

    $ mkdir -p deps
    $ git clone https://github.com/pisa-engine/pisa deps/pisa
    $ cd deps/pisa
    $ git reset --hard bb2b3df
    $ git submodule update --init --recursive --depth 1
    $ git am ../../graph/0001-joel-limitpairs.patch
    $ mkdir -p build
    $ cd build
    $ cmake -DPISA_ENABLE_TESTING=OFF -D PISA_ENABLE_BENCHMARKING=OFF ..
    $ make -j$(nproc)
    

    It is assumed the ciff2pisa binary is available. If required, refer to the PISA ciff repo for installation instructions.

  4. Build inverted indexes.

    $ unxz -v data/msmarco-passage.pisa.bp.ciff.xz data/msmarco-passage.dt5q.pisa.bp.ciff.xz
    $ ./index/build.sh
    

    The indexes are provided in CIFF format. For reproducibility, note that document reordering was performed using faster graph bisection (revision 4ba3bb2) with the loggap gain function and minimum postings length of 128.

Running the graph construction experiments

  1. Run the graph construction timings.

    $ unxz -v data/*.xz
    $ ./graph/build.sh
    
  2. Timing results can be found in graph/*.log.

Running the retrieval experiments

  1. Run the non-graph baselines.

    $ ./sysrun/baseline
    

    The system runfiles will be in the runs directory. The stage0 runfiles are the combined BM25 runfiles from each track and are used in the re-ranking experiments.

    runs
    ├── dt5q-bm25-dl19.res.gz
    ├── dt5q-bm25-dl20.res.gz
    ├── dt5q-bm25-tasb-dl19.res.gz
    ├── dt5q-bm25-tasb-dl20.res.gz
    ├── dt5q-stage0.res.gz
    ├── original-bm25-dl19.res.gz
    ├── original-bm25-dl20.res.gz
    ├── original-bm25-tasb-dl19.res.gz
    ├── original-bm25-tasb-dl20.res.gz
    ├── original-stage0.res.gz
    ├── tasb-dl19.res.gz
    └── tasb-dl20.res.gz
    
  2. Run the re-ranking phase.

    $ ./sysrun/timing
    
  3. Timing results and runfiles can be found in the runs directory.

Evaluation

This work used trec_eval (v9.0.8) for evaluation.

Query reduction heuristics

To build the query sets (rather than using the pre-computed version from download_data.sh), each query reduction heuristic has an associated script in the tools directory corresponding to title+url, tfidf and dt5q. To build them run ./tools/build_qryheur.sh.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published