-
Notifications
You must be signed in to change notification settings - Fork 43
Hugging Face Hub integration
Annif integration with Hugging Face Hub allows to easily upload and download projects and their associated vocabularies to and from Hugging Face Hub model repositories. This integration provides a convenient way to share Annif projects and collaborate on text classification tasks within the Hugging Face ecosystem.
Downloads are possible from public repositories without logging in to Hugging Face Hub, but for all uploads and for downloads from private repositories you need to have logged in using the huggingface-cli login
command or to give a User Access token with the --token
option of the annif upload
or annif download
commands.
The integration utilizes the Hugging Face Hub cache-system; to explore the cache, use huggingface-cli scan-cache
command, and to delete items from the cache, use huggingface-cli delete-cache
.
Note that using glob wildcards (e.g. *
or ?
) in the projects id patterns for the below commands can make the shell to expand the wildcards to matching filenames, instead of the intended projects, because the shell expansion happens before Annif processes the command arguments. At least when targeting all projects with *
pattern the shell expands it to all filenames in the directory. To avoid shell expansion, use quotion marks around the pattern, e.g. "*"
.
The download command enables to download selected projects and their associated vocabularies from a specified Hugging Face Hub repository. This command retrieves project and vocabulary archives along with their configuration files from the designated repository and extracts them to the local data/
directory and projects.d/
directory, respectively. After download the projects are directly usable by Annif (if projects.{cfg,toml} does not exists or by using the --projects
option to override them).
For example, download the yso-mllm-en project from NatLibFi/FintoAI-data-YSO repository and show the fetched files:
$ annif download yso-mllm-en NatLibFi/FintoAI-data-YSO
Downloading project(s): yso-mllm-en
projects/yso-mllm-en.zip: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.11M/2.11M [00:00<00:00, 8.72MB/s]
yso-mllm-en.cfg: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 165/165 [00:00<00:00, 256kB/s]
vocabs/yso.zip: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 61.6M/61.6M [00:05<00:00, 11.2MB/s]
$ tree data projects.d
data
├── projects
│ └── yso-mllm-en
│ └── mllm-model.gz
└── vocabs
└── yso
├── subjects.csv
├── subjects.dump.gz
└── subjects.ttl
projects.d
└── yso-mllm-en.cfg
To download all YSO projects with English language, assuming their project ids end with "en", use projects id pattern "yso-*en"
in place of yso-mllm-en
.
If any file contained in the archive already exists locally, the extraction does not overwrite the local file, but skips its extraction. To force overwrite of the existing local file, you can use the --force/-f
option.
A specific revision (commit hash, branch name or tag) to download can be selected using the --revision
option; see below for versioning projects.
The upload command allows to upload selected projects and their vocabularies to a specified Hugging Face Hub repository. The command zips the project directories and vocabularies of the projects whose project IDs match the given pattern to archive files, and uploads these archives along with the project configurations to the designated repository.
The yso-mllm-en project can be uploaded to NatLibFi/FintoAI-data-YSO repository like this:
$ annif upload yso-mllm-en NatLibFi/FintoAI-data-YSO
Uploading project(s): yso-mllm-en
Also the upload command targets the projects with a pattern of project IDs, so "yso-*en"
could be used to upload all local YSO English projects.
The commit message can be specified with --commit-message
option; the default commit message is "Upload project(s) <project-ID-pattern> with Annif".
The branch to upload to can set by the --revision
option, the default branch is main
. Note that the branch needs to exist before upload, see below how to create branches.
Git branches and tags can be used for versioning Annif projects in Hugging Face Hub. Currently (by release 0.22) the Hugging Face Hub commandline tool does not support accessing git branches or tags, but this will change in the future. However, the tags can be accessed and manipulated using the Hugging Face Hub Python client. For example, to list branches and tags in the NatLibFi/FintoAI-data-YSO repository, use the command
$ python -c "from huggingface_hub import HfApi; client=HfApi(); refs=client.list_repo_refs(repo_id='NatLibFi/FintoAI-data-YSO'); print([t.ref for t in refs.branches]); print([t.ref for t in refs.tags])"
A new branch, e.g. release-2024-04
, can be created with command
$ python -c "from huggingface_hub import HfApi; client=HfApi(); client.create_branch(repo_id='NatLibFi/FintoAI-data-YSO', branch='release-2024-04')"
A new tag, e.g. 2024-04
, can be created with command
$ python -c "from huggingface_hub import HfApi; client=HfApi(); client.create_tag(repo_id='NatLibFi/FintoAI-data-YSO', tag='2024-04')"
- Home
- Getting started
- System requirements
- Optional features and dependencies
- Usage with Docker
- Architecture
- Commands
- Web user interface
- REST API
- Corpus formats
- Project configuration
- Analyzers
- Transforms
- Language detection
- Hugging Face Hub integration
- Achieving good results
- Reusing preprocessed training data
- Running as a WSGI service
- Backward compatibility between Annif releases
- Backends
- Development flow, branches and tags
- Release process
- Creating a new backend