feat: File hashing and duplicate prevention during import #7765
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Feature
This feature introduces the ability to record the SHA256 hash of uploaded files and prevent task creation from files that have a duplicate hash within the same project.
This functionality is controlled by the
DATA_UPLOAD_IGNORE_DUPLICATES
environment variable. If set toTrue
, files will be hashed, and duplicates (based on content hash) will be ignored during the import process.If
False
or not set, the system behaves as before.Duplicate File Handling:
label_studio/core/settings/base.py
: AddedDATA_UPLOAD_IGNORE_DUPLICATES
setting to allow skipping duplicate file uploads based on file hashes.label_studio/data_import/uploader.py
: Addedget_file_hash
function to compute SHA256 hashes for files and updatedcreate_file_upload
to check for duplicates using the hash. If duplicates are ignored, the function returnsNone
. [1] [2]Database Changes:
label_studio/data_import/migrations/0003_fileupload_file_hash.py
: Added a migration to introduce thefile_hash
field in theFileUpload
model and populate it for existing records using thefull_files_hash
function.label_studio/data_import/models.py
: Updated theFileUpload
model to include thefile_hash
field with indexing for efficient duplicate checks.Utility Functions:
label_studio/data_import/uploader.py
: Added thefull_files_hash
function to calculate hashes for allFileUpload
instances missing a hash, ensuring backward compatibility with existing records.