Release v1.4.2: Python > 3.10 are supported; Data Attribution OPs; External OPs are supported; Install with "uv"
Major Updates
- 💪🏻 Data-Juicer now is compatible with Python 3.11 & 3.12. #749
- 🧩 5 OPs for data attribution are added. #735
- 🤝 Now Data-Juicer support register and apply custom OPs in external paths using the argument
custom_operator_paths
. #758 - 🔧 "uv" is the first choice to installing Data-Juicer now due to its capability to solve the dependency conflicts. #760
New Operators
Filter
- Validation-free
llm_perplexity_filter
: Filter to keep samples with perplexity score, computed using a specified llm, within a specific range. #735instruction_following_difficulty_filter
: Filter to keep texts whose instruction follows difficulty (IFD, https://arxiv.org/abs/2308.12032) falls within a specific range. #735
- Validation-based
in_context_influence_filter
: Filter to keep texts whose in-context influence upon validation set within a specific range. #735llm_task_relevance_filter
: Filter to keep sample with high relevance score to validation tasks estimated by LLM. #735text_embd_similarity_filter
: Filter to keep texts whose average embedding similarity to a set of given validation texts falls within a specific range. #735
Enhancements
- A new environment variable DATA_JUICER_EXTERNAL_MODELS_HOME is added to allow to specify some private or read-only paths to store external and extra models. #740
- Optimize the video link transformation and multi-version maintainence in the docs. Update demo videos with higher-resolution versions. #746
- Support custom save_dir for OPs that produce extra multimodal data. #751
- Add official and detailed docs about Data-Juicer Agent. #759
- Enhance unit tests: show the name of the current test cases; recycle resources after each test cases for ray mode. #749
- Refining developer guide for better practice on building new OPs. #760
Bugs Fixed
- Move the updating of special tokens of multimodal data in initialization of base_op, which fix the bug that special tokens might not be synced with the main process when processing data in parallel. #752
- Fix some test cases. #754
Acknowledgement
- @ShenQianli made their first contribution to 5 new OPs. #735
Full Changelog: v1.4.1...v1.4.2