What are the files that haystack downloads to cache directory? and how can I create a local package? #2066

asharm0662 · 2022-01-26T02:21:42Z

asharm0662
Jan 26, 2022

Hi,

When I am using a new model in the reader variable like so:

reader = TransformersReader(tokenizer="deepset/roberta-base-squad2", use_gpu=-1)

I notice there are files downloaded to ~/.cache/hugging/transformers/.

are these files static?
what are these files exactly?

I tried downloading the model locally from hugging face using these steps:

https://huggingface.co/distilbert-base-uncased
git lfs install
git clone https://huggingface.co/distilbert-base-uncased

But I still need what is in the /.cache directory. How can I create a local hugging face model/ cache files for deployment in cloud?

Answered by julian-risch

Jan 26, 2022

Hi @asharm0662
the files downloaded to ~/.cache/hugging/transformers/ are language models that interpret queries and find an answer in a text document (for example) and they include the corresponding tokenizers that split any arbitrary input strings, e.g., queries, into sequences of tokens. When you run reader = TransformersReader(tokenizer="deepset/roberta-base-squad2", use_gpu=-1) the tokenizer is loaded from https://huggingface.co/deepset/roberta-base-squad2
These files are about 2 GB large, sometimes even larger. We cache these models inside the transformers library because it would take a long time to download them on-the-fly every time you want to run a query. The models won't chang…

View full answer

julian-risch · 2022-01-26T17:38:46Z

julian-risch
Jan 26, 2022
Maintainer

Hi @asharm0662
the files downloaded to ~/.cache/hugging/transformers/ are language models that interpret queries and find an answer in a text document (for example) and they include the corresponding tokenizers that split any arbitrary input strings, e.g., queries, into sequences of tokens. When you run reader = TransformersReader(tokenizer="deepset/roberta-base-squad2", use_gpu=-1) the tokenizer is loaded from https://huggingface.co/deepset/roberta-base-squad2
These files are about 2 GB large, sometimes even larger. We cache these models inside the transformers library because it would take a long time to download them on-the-fly every time you want to run a query. The models won't change so they are static, yes.

If you would like to fill the cache, I recommend that you have a look at how we create our docker files with the models already cached: https://github.com/deepset-ai/haystack/pull/1978/files
Another way is to copy the cache from a different machine or to save the model on disk: https://stackoverflow.com/questions/62261602/downloading-transformers-models-to-use-offline

However, when you run your cloud-deployed application for the first time (with internet connection) the cache will be filled automatically. Is there any particular reason why you want to make sure that the model is already in the cache even before using it for the first time?

3 replies

asharm0662 Mar 16, 2022
Author

this worked thank you.

OlivierBondu Aug 4, 2023

Hi !

I have a similar use-case and to give a bit more context into why I need to do this: I need to wrap up the python code into a docker images for portability. However the docker container will run on an infrastructure that cannot access the internet. Thus, I need to embed the models in the docker image at build time, using the caching mechanism above.

I hope some day we would be able to run some kind of huggingface mirror (à la pypi) so that we can keep the docker images to their minimum and download the models only when needed and from a trusted internal repo, but in the meantime this is the easiest workaround.

However, I cannot have this workaround work at the moment and my application dies after the timeout to huggingface... any idea ?

Cheers,

anakin87 Aug 4, 2023
Maintainer

Hello @OlivierBondu!

In the past, we prepared a repository to cover similar uses cases: Airgapped REST API.

Have a look and see if it helps you...
If you need more help, please open another discussion (we don't usually monitor older ones).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What are the files that haystack downloads to cache directory? and how can I create a local package? #2066

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

What are the files that haystack downloads to cache directory? and how can I create a local package? #2066

Uh oh!

Uh oh!

asharm0662 Jan 26, 2022

Replies: 1 comment · 3 replies

Uh oh!

julian-risch Jan 26, 2022 Maintainer

Uh oh!

asharm0662 Mar 16, 2022 Author

Uh oh!

OlivierBondu Aug 4, 2023

Uh oh!

anakin87 Aug 4, 2023 Maintainer

asharm0662
Jan 26, 2022

Replies: 1 comment 3 replies

julian-risch
Jan 26, 2022
Maintainer

asharm0662 Mar 16, 2022
Author

anakin87 Aug 4, 2023
Maintainer