Skip to content
This repository was archived by the owner on Oct 31, 2023. It is now read-only.
This repository was archived by the owner on Oct 31, 2023. It is now read-only.

'dpr_all_documents' is not defined #249

@golubovic

Description

@golubovic

Hi,

I experience issue with global variable ‘dpr_all_documents’ involving tokenizer parallelism, please see logs below. This issue has been raised before for DPR repo.

Note that:

  1. all_docs size has value as expected (I use test document(s) of 5 entries, for testing purposes rather than wiki dataset, please see the log below)
  2. validation_workers is set to 1 in dense_retreiver.yaml (saying that, that setting isn't a problem, I have set it to one just as a safety measure)
  3. I have tried setting TOKENIZERS_PARALLELISM=false (doesn't make a difference). NOTE: Transformers library "0.8.0rc4" has issue with this setting not taking effect currently
  4. I have tried downgrading transformers and tokenizers library to previous versions (no success), good article/comment by [Allohvk] on what is going on with RUST tokenizers used by Huggingface can be found in here https://stackoverflow.com/questions/62691279/how-to-disable-tokenizers-parallelism-true-false-warning
  5. I have tried refactoring dpr_all_documents and passing it as a regular method/function parameter and removing ‘global’ definition, that however results in ‘KeyError’ exception for the given id_prefix of the defined datasource in default_sources.yaml

Please let me know if you have any questions.

Thanks,
Mladen

Logs:
[2023-09-21 07:35:40,997][dpr.models.hf_models][INFO] - Initializing HF BERT Encoder. cfg_name=bert-base-uncased
[2023-09-21 07:35:43,260][dpr.models.hf_models][INFO] - Initializing HF BERT Encoder. cfg_name=bert-base-uncased
[2023-09-21 07:35:44,405][root][INFO] - Loading saved model state ...
[2023-09-21 07:35:44,611][root][INFO] - Selecting standard question encoder
[2023-09-21 07:35:44,677][root][INFO] - Encoder vector_size=768
[2023-09-21 07:35:44,677][root][INFO] - qa_dataset: dpr_ds_retreiving_questions
[2023-09-21 07:35:44,680][root][INFO] - questions len 6
[2023-09-21 07:35:44,680][root][INFO] - questions_text len 0
[2023-09-21 07:35:44,680][root][INFO] - Local Index class <class 'dpr.indexer.faiss_indexers.DenseFlatIndexer'>
[2023-09-21 07:35:44,680][root][INFO] - Using special token None
[2023-09-21 07:35:45,875][root][INFO] - Total encoded queries tensor torch.Size([6, 768])
[2023-09-21 07:35:45,877][root][INFO] - ctx_sources: <class 'dpr.data.retriever_data.CsvCtxSrc'>
[2023-09-21 07:35:45,877][root][INFO] - id_prefixes per dataset: ['ds_default_sources_yaml_prefix:']
[2023-09-21 07:35:45,877][root][INFO] - ctx_files_patterns: ['/Users/directory/Developer/DPR-main/checkpoints/generated_embeddings_0']
[2023-09-21 07:35:45,878][root][INFO] - Embeddings files id prefixes: ['ds_default_sources_yaml_prefix:']
[2023-09-21 07:35:45,878][root][INFO] - Reading all passages data from files: ['/Users/directory/Developer/DPR-main/checkpoints/generated_embeddings_0']
[2023-09-21 07:35:45,878][root][INFO] - Reading file /Users/directory/Developer/DPR-main/checkpoints/generated_embeddings_0
[2023-09-21 07:35:45,880][root][INFO] - data indexed 5
[2023-09-21 07:35:45,880][root][INFO] - Total data indexed 5
[2023-09-21 07:35:45,880][root][INFO] - Data indexing completed.
[2023-09-21 07:35:45,880][root][INFO] - Serializing index to /Users/directory/Developer/DPR-main/checkpoints/faiss_index_ctx
[2023-09-21 07:35:45,883][root][INFO] - index search time: 0.002260 sec.
[2023-09-21 07:35:45,884][dpr.data.retriever_data][INFO] - Reading file /Users/directory/Developer/DPR-main/dpr/downloads/data/wikipedia_split/psgs_w100-s.tsv
[2023-09-21 07:35:45,885][root][INFO] - Loaded ctx data: 5
[2023-09-21 07:35:45,885][root][INFO] - validating passages. size=5
[2023-09-21 07:35:45,885][dpr.data.qa_validation][INFO] - all_docs size 5
[2023-09-21 07:35:45,885][dpr.data.qa_validation][INFO] - dpr_all_documents size 5
[2023-09-21 07:35:45,925][dpr.data.qa_validation][INFO] - Matching answers in top docs...
2023-09-21 07:35:49,689 [INFO] faiss.loader: Loading faiss with AVX2 support.
2023-09-21 07:35:49,717 [INFO] faiss.loader: Successfully loaded faiss with AVX2 support.
/Users/directory/Developer/DPR-main/dense_retriever.py:472: UserWarning:
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
@hydra.main(config_path="conf", config_name="dense_retriever")
Error executing job with overrides: []
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/directory/.pyenv/versions/3.9.12/lib/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/Users/directory/.pyenv/versions/3.9.12/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/Users/directory/Developer/DPR-main/dpr/data/qa_validation.py", line 127, in check_answer
doc = dpr_all_documents[doc_id]
NameError: name 'dpr_all_documents' is not defined
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Users/directory/Developer/DPR-main/dense_retriever.py", line 628, in main
questions_doc_hits = validate(
File "/Users/directory/Developer/DPR-main/dense_retriever.py", line 309, in validate
match_stats = calculate_matches(passages, answers, result_ctx_ids, workers_num, match_type)
File "/Users/directory/Developer/DPR-main/dpr/data/qa_validation.py", line 68, in calculate_matches
scores = processes.map(get_score_partial, questions_answers_docs)
File "/Users/directory/.pyenv/versions/3.9.12/lib/python3.9/multiprocessing/pool.py", line 364, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/Users/directory/.pyenv/versions/3.9.12/lib/python3.9/multiprocessing/pool.py", line 771, in get
raise self._value
NameError: name 'dpr_all_documents' is not defined

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions