Skip to content
This repository was archived by the owner on Oct 31, 2023. It is now read-only.
This repository was archived by the owner on Oct 31, 2023. It is now read-only.

mismatch between encoded results and wiki passages #250

@Hannibal046

Description

@Hannibal046

Hi, thanks so much for the great work. I have a question about the size of wiki passages and encoded index. After downloading the data as instructed, I found the size of index doesn't match that of passages:

import pickle,csv

n_embedding = 0
for idx in range(50):
    index_path = f"DPR/dpr/downloads/data/retriever_results/nq/single/wikipedia_passages_{idx}.pkl"
    data = pickle.load(open(index_path,'rb'))
    n_embedding += len(data)


n_doc = 0
wikidata_path = "DPR/dpr/downloads/data/wikipedia_split/psgs_w100.tsv"
docs = []
with open(wikidata_path) as f:
    reader = csv.reader(f, delimiter="\t")
    for row in reader:
        if row[0] == "id":continue
        n_doc += 1

print("n_embedding=",n_embedding)
print("n_doc=",n_doc)

The results are:

n_embedding= 21015300
n_doc= 21015324

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions