UniIR for Pyserini

🌐 Homepage | 🤗 Dataset(M-BEIR Benchmark) | 🤗 Checkpoints(UniIR models) | 📖 arXiv | Original UniIR GitHub

This repository contains a fork of the original UniIR codebase, modified for easy Pyserini integration and repackaged as a PyPI package.

current_version = "0.1.0"

Installation

Install the package directly from PyPI:

pip install uniir_for_pyserini

Or, install from source:

git clone https://github.com/castorini/UniIR-for-Pyserini.git
cd UniIR-for-Pyserini
pip install .

Then, install the CLIP model:

pip install git+https://github.com/openai/CLIP.git

Quick Start

The following code snippet shows how UniIR models can be used with Pyserini's encoding and indexing pipeline. In this example, clip-sf-large model is used to encode the cirr_task7 corpus into dense vector representations. Similar steps can be done for on-the-fly query encoding using the QueryEncoder.

For full compatible use and features, please use/refer to these wrapper classes in Pyserini.

# Encoding and Indexing Steps
from pyserini.encode import JsonlCollectionIterator
from pyserini.encode.optional import FaissRepresentationWriter
from uniir_for_pyserini.uniir_corpus_encoder import CorpusEncoder

MBEIR_FIELDS = ['img_path', 'txt', 'modality', 'did']

mbeir_corpus_encoder = CorpusEncoder("clip_sf_large")

collection_iterator = JsonlCollectionIterator(  
    'collections/M-BEIR/mbeir_cirr_task7_cand_pool.jsonl',  
    fields=MBEIR_FIELDS,
    docid_field='did'
)

embedding_writer = FaissRepresentationWriter(
    'indexes/cirr.clip-sf-large'
)

with embedding_writer:
    for batch_info in collection_iterator(32):
        kwargs = {'fp16': True}
        for field_name in MBEIR_FIELDS:
            kwargs[f'{field_name}s'] = batch_info[field_name] 
        
        embeddings = mbeir_corpus_encoder.encode(**kwargs)
        batch_info['vector'] = embeddings
        embedding_writer.write(batch_info, MBEIR_FIELDS) 

# Searching Step
from pyserini.search.faiss import FaissSearcher
from pyserini.query_iterator import MBEIRQueryIterator
from uniir_for_pyserini.uniir_query_encoder import QueryEncoder

mbeir_query_encoder = QueryEncoder("clip_sf_large")

searcher = FaissSearcher(  
        'indexes/cirr.clip-sf-large',
        mbeir_query_encoder  
    )

query_iterator = MBEIRQueryIterator.from_topics('mbeir_cirr_task7_test.jsonl')

results = {}    
for qid, query_data in query_iterator:  
    # query_data now contains the structured M-BEIR format:  
    # {'qid', 'query_txt', 'query_img_path', 'query_modality', 'pos_cand_list'}  
      
    hits = searcher.search(query_data, k=1000) 
    results[qid] = [(hit.docid, hit.score) for hit in hits]

Available Models

Note: L2 Norm isn't applied during encoding because it is applied in the UniIR wrapper classes in Pyserini

This package supports the following UniIR models from the TIGER-Lab UniIR Hugging Face Hub:

clip_sf_large
blip_ff_large

Contact

For contact regarding the Pyserini integration section, please email Sahel Sharifymoghaddam or Daniel Guo.

For contact regarding the original UniIR codebase, please email the authors of the original UniIR repository.

Citation

If you use this work with Pyserini, please cite Pyserini in addition to the original UniIR paper:

@article{wei2023uniir,
  title={Uniir: Training and benchmarking universal multimodal information retrievers},
  author={Wei, Cong and Chen, Yang and Chen, Haonan and Hu, Hexiang and Zhang, Ge and Fu, Jie and Ritter, Alan and Chen, Wenhu},
  journal={arXiv preprint arXiv:2311.17136},
  year={2023}
}

@INPROCEEDINGS{Lin_etal_SIGIR2021_Pyserini,
   author = "Jimmy Lin and Xueguang Ma and Sheng-Chieh Lin and Jheng-Hong Yang and Ronak Pradeep and Rodrigo Nogueira",
   title = "{Pyserini}: A {Python} Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations",
   booktitle = "Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021)",
   year = 2021,
   pages = "2356--2362",
}

📄 License

This project is licensed under the Apache 2.0 License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
docs/images		docs/images
src		src
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

UniIR for Pyserini

Installation

Quick Start

Available Models

Contact

Citation

📄 License

About

Uh oh!

Releases

Packages

Languages

License

castorini/UniIR-for-Pyserini

Folders and files

Latest commit

History

Repository files navigation

UniIR for Pyserini

Installation

Quick Start

Available Models

Contact

Citation

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages