Release Release v1.4.2: Python > 3.10 are supported; Data Attribution OPs; External OPs are supported; Install with "uv" · modelscope/data-juicer

Major Updates

💪🏻 Data-Juicer now is compatible with Python 3.11 & 3.12. #749
🧩 5 OPs for data attribution are added. #735
🤝 Now Data-Juicer support register and apply custom OPs in external paths using the argument custom_operator_paths. #758
🔧 "uv" is the first choice to installing Data-Juicer now due to its capability to solve the dependency conflicts. #760

Validation-free
- llm_perplexity_filter: Filter to keep samples with perplexity score, computed using a specified llm, within a specific range. #735
- instruction_following_difficulty_filter: Filter to keep texts whose instruction follows difficulty (IFD, https://arxiv.org/abs/2308.12032) falls within a specific range. #735
Validation-based
- in_context_influence_filter: Filter to keep texts whose in-context influence upon validation set within a specific range. #735
- llm_task_relevance_filter: Filter to keep sample with high relevance score to validation tasks estimated by LLM. #735
- text_embd_similarity_filter: Filter to keep texts whose average embedding similarity to a set of given validation texts falls within a specific range. #735

A new environment variable DATA_JUICER_EXTERNAL_MODELS_HOME is added to allow to specify some private or read-only paths to store external and extra models. #740
Optimize the video link transformation and multi-version maintainence in the docs. Update demo videos with higher-resolution versions. #746
Support custom save_dir for OPs that produce extra multimodal data. #751
Add official and detailed docs about Data-Juicer Agent. #759
Enhance unit tests: show the name of the current test cases; recycle resources after each test cases for ray mode. #749
Refining developer guide for better practice on building new OPs. #760

Move the updating of special tokens of multimodal data in initialization of base_op, which fix the bug that special tokens might not be synced with the main process when processing data in parallel. #752
Fix some test cases. #754

Full Changelog: v1.4.1...v1.4.2