Releases: modelscope/data-juicer
Release v1.4.3: OP Doc Enhancement; Optimized Auto Parallelism; Optimized Sandbox
Major Updates
- 🤝 OP Document Updates: Optimized multi-version docs; Doc strings are rewritten and enhanced by qwen-max. #755 #765 #768 #769 #787
- 💪🏻 Auto Parallelism Optimization: support cpu/gpu/mem requirement specification for each OP; optimize
calculate_np
for ray mode. #679 #774 #782 #786 - 🛠️ Sandbox Optimization: support iterative pipelines and early-stop targets; refactor the context infos; a new example on auto prompt optimization and several related hooks are added. #757
- 📈 Upgrade spacy from 3.8.0 to 3.8.7 due to the previous one is yanked. #763
New OPs
image_detection_yolo_mapper
: perform object detection (with YOLO) on images and return the bounding box values and class labels. #764optimize_prompt_mapper
: optimize prompts based on the existing ones. #757
Enhancements
- Support shard_size and extra args for write methods in
export_extra_args
for RayExporter. #739 - Support min/max_closed_interval args to control filtering with open/closed intervals and reversed_range arg to allow keeping samples outside a specified range for Filters. #741
- Support API models for existing
optimize_qa_mapper
. #771
Fixed Bugs
- Fix and re-enable the disabled op_list_to_trace argument. #766
- Add missing
skip
tag to several API-based test cases for forked repos. #767 - Limit the version of
transformers
to "<4.55.0" to avoid computing on None value. #781 - Fix out-of-date invoking methods in several tools. #785 (from issue #750)
- Fix 500 error in API service. #785 (from issue #777)
- Remove
specified_xxx_filter
from NON_STATS_FILTER. #785 (from issue #783)
Full Changelog: v1.4.2...v1.4.3
Release v1.4.2: Python > 3.10 are supported; Data Attribution OPs; External OPs are supported; Install with "uv"
Major Updates
- 💪🏻 Data-Juicer now is compatible with Python 3.11 & 3.12. #749
- 🧩 5 OPs for data attribution are added. #735
- 🤝 Now Data-Juicer support register and apply custom OPs in external paths using the argument
custom_operator_paths
. #758 - 🔧 "uv" is the first choice to installing Data-Juicer now due to its capability to solve the dependency conflicts. #760
New Operators
Filter
- Validation-free
llm_perplexity_filter
: Filter to keep samples with perplexity score, computed using a specified llm, within a specific range. #735instruction_following_difficulty_filter
: Filter to keep texts whose instruction follows difficulty (IFD, https://arxiv.org/abs/2308.12032) falls within a specific range. #735
- Validation-based
in_context_influence_filter
: Filter to keep texts whose in-context influence upon validation set within a specific range. #735llm_task_relevance_filter
: Filter to keep sample with high relevance score to validation tasks estimated by LLM. #735text_embd_similarity_filter
: Filter to keep texts whose average embedding similarity to a set of given validation texts falls within a specific range. #735
Enhancements
- A new environment variable DATA_JUICER_EXTERNAL_MODELS_HOME is added to allow to specify some private or read-only paths to store external and extra models. #740
- Optimize the video link transformation and multi-version maintainence in the docs. Update demo videos with higher-resolution versions. #746
- Support custom save_dir for OPs that produce extra multimodal data. #751
- Add official and detailed docs about Data-Juicer Agent. #759
- Enhance unit tests: show the name of the current test cases; recycle resources after each test cases for ray mode. #749
- Refining developer guide for better practice on building new OPs. #760
Bugs Fixed
- Move the updating of special tokens of multimodal data in initialization of base_op, which fix the bug that special tokens might not be synced with the main process when processing data in parallel. #752
- Fix some test cases. #754
Acknowledgement
- @ShenQianli made their first contribution to 5 new OPs. #735
Full Changelog: v1.4.1...v1.4.2
Release v1.4.1: MCP server; GPU-based Minhash deduplicator; Improved unit test coverage.
Major Updates
- 🔧 Introduce Data-Juicer MCP server. Users can make use of the data processing capabilities in the MCP way conveniently. #690 #737
- 💪🏻 Unit test coverage rate is improved to 85%+ and several bugs in test cases are resolved (OOM, encoding error, and so on), which makes Data-Juicer more reliable. #698 #717 #720 #727
- 🤝 Minhash deduplication based on GPU is supported, collaborated with developers from Nvidia. #694 #644
- 🧩 RayExporter supports more formats to export a ray dataset in addition to json/jsonl. #687
- 🎥 Two demo videos are added to introduce the Data-Juicer core functions, agentic usages, and sandbox. #738
New Operators
download_file_mapper
downloads data from URLs to local files or specified fields. #709
Enhancements
- New analysis method: correlation analysis among stats is added. #663
- Several core dependencies are updated and fixed to a newer version, and dependency conflicts are resolved. #715 #717 #723
- The EasyAnimate pipelines in the sandbox are updated to follow the refactoring of sandbox. #710
- Apply more reliable pre-commit tools to improve the code style of Data-Juicer. #714
- Support store and process bytes data of images in the dataset. #725
Bugs Fixed
- The wheel & docker image building bug is fixed. #706
- Fix bugs in log_summarization. #710
- Fix "no module named data_juicer" error after installing from the wheel file. #727
Acknowledgement
- @fanronghai helps to fix the param error in dataset_splitting_by_language tool. #713
- @ayushdg helps to support a GPU-version Minhash deduplicator. #644
- @ricksun2023 helps to fix the bugs when there are more than one same-name OPs in the configs. #730
Full Changelog: v1.4.0...v1.4.1
v1.4.0 Major Refactor for Env Management, Doc, Sandbox; Derivative Works (TPAMI Survey; Trinity-RFT & DetailMaster)
Summarization: 200+ files changed with 18,535 additions and 3,720 deletions.
🔧 Major Refactors & Improvements
-
🔄 Sandbox Usability (#686):
- Support for multiple pipelines, context info, and an environment manager to run different commands in various environments.
- Includes the InternVL example as a showcase.
-
📘 DJ-Doc Redesign (#675):
- Now with multilingual support (English / Chinese) and a modernized style.
-
📦 Dependency Management Update (#660, #680):
- Migrated to
uv
for faster dependency resolution. - Added sub-groups for better organization.
- Migrated to
🌍 New Features & Integrations (#683, #688, #692)
-
🆕 Additional Repo Supported:
- Trinity-RFT now supported by Data-Juicer.
-
📜 DJ-Awesome-List:
- A survey paper accepted by TPAMI'25!
-
🧪 Synthetic Benchmark Added:
- DetailMaster – a new benchmark for synthetic data evaluation.
-
🛠️ New Operators Introduced (#673, #701):
llm_analysis_filter
general_field_filter
🚀 Core Optimizations & Bug Fixes
-
✅ Ray Executor Enhancements (#697):
- File extension detection added.
- Support for more data formats.
-
⏱️ Startup Time Optimization:
- Improved startup performance. (#684)
-
🧠 Text Embedding Support:
- Added support for text embedding via API and local model. (#681)
-
🐳 Docker Build Improvement:
- Ignore installed
distutils
libraries during Docker image building. (#668)
- Ignore installed
-
🛠️ Mapper Module Fix:
- Fixed issue with module initialization. (#700)
-
🗑️ Warning Suppression:
- Suppressed unnecessary warnings from fasttext. (#696)
📚 Full Changelog
Release v1.3.3: Sandbox is accepted as Spotlight by ICML 2025; Add Img-Diff recipes.
Major Updates
- 🎉 Our work of Data-Juicer Sandbox has been accepted as a Spotlight by ICML 2025 (top 2.6% of all submissions)!
- Add new OPs and recipes for Img-Diff. #658
Enhancements
- Support HF llm for two llm_xxx_score_filter OPs. #655
- Sync docker image to Aliyun OSS for downloading if docker hub is not accessed. #657
- Split standalone and distributed unit tests to save time when re-running failed ones. #666
Bugs Fixed
- Address possibly missing cfg in
unify_format
. #653 - Improve clarity & fix bad links for some docs. #659
Acknowledgement
Full Changelog: v1.3.2...v1.3.3
Release v1.3.2: Enhancements on usability & two OPs; some bugs fixes
What's Changed
- Human OP enhancements, in #642 #645
- update label-studio version
- make service script more robust
- add documentation
- optimizing fields mapping
- OP efficiency optimization of
document_minhash_deduplicator
, in #639 - set temp_parser.usage to argparse.SUPPRESS, skip too much help log in #643
- fix date typo by in #648
- Fix docker building failure in #650
- Fix StreamToLoguru compatibility issue with torch._dynamo in #651
- add init file for annotation module, fix dj-process command error in #652
New Contributor
Release v1.3.1: added HumanOPs & fixed some bugs
Major Updates
- 💥 prototype Implementation for HumanOps (annotation). #617 Included features:
- boilerplate code for supporting label studio powered human annotation ops
- a human preference annotation reference implementation is provided
- label studio service script; can start up local instance using docker or pip, whichever is available
- reference configs and data
- event driven and notification mixins framework for ops
New OPs
extract_tables_from_html_mapper
: extract tables from html texts. #634general_fused_op
: an explicitly fused operator designed to execute multiple sequential operations (OPs) on the same batch, enabling fine-grained control over data processing. #626
Bug Fixed
- fix dataset builder initialization failure #630
- update Executor references from Executor to DefaultExecutor #632 #633
- switch the backend of
plt
to avoid sub-process/thread error #633 - fix some boundary condition bugs in several deduplicators #635 #637
Others
- check dataset when loading to support to pass dataset in the
DefaultExecutor.run
method. #633 - update docs to highlight light env installation part. #636
Acknowledgement
- @liuyuhanalex helps to add a new OP and fix some of the boundary condition bugs. #634 #635
Full Changelog: v1.3.0...v1.3.1
Release v1.3.0: Refactor of dataset builder and executor!
The Big Change 🚀
Refactor of dataset builder and executor, see #537, @cyruszhang
📜 YAML explicitly defines different sources of datasets; local and remote are defined separately.
🔧 More flexible parameterized control; supports source-specific parameters, validations, and extensible configurations.
🔌 Unbind Executor's hardcode support: No longer restricted to local JSON formats; input format is determined dynamically via formatters/downloaders.
🚀 Enhanced Executor extensibility to natively support engines like Nemo, Dask, Spark, etc.
🔍 Add data format validation to ensure consistency and correctness.
🌐 Expanded data source support:
a. 📦 ModelScope integration.
b. 📚 ArXiv dataset import (download, decompress, ingest).
c. 📚 Wikipedia dataset support (download, decompress, ingest).
d. 🌐 Common Crawl integration (download, decompress, ingest).
🔗 Backward compatibility with existing dataset_path command-line syntax.
🔀 Support for data mixtures to combine multiple datasets dynamically.
🔧 Support for empty formatters/generated datasets without pre-defined config files.
Others 💡
🔊 New audio processing operator: audio_add_gaussian_noise (PR #622), @liuyuhanalex
📊 Added dynamic coverage rate badge to the README for transparency (PR #625)
Release v1.2.2
Major Updates
- 🧪 Add document for API service. Add parameter transmission using
json.dumps
to support API calls for arbitrary registration functions and classes. #613 - 🚀 Add unit tests for the analysis module and utils module to increase test coverage. #604 #616
A new data synthesis method is proposed, which encourages LLMs to self-generate challenging cognitive questions, achieving superior data efficiency, cross-modality generalization, and SFT effects over SOTA baselines (e.g., 16% gain on MathVision using only 400 samples). See more details in MindGym: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions.
New OPs
llm_quality_score_filter
: Filter to keep sample with high quality score estimated by LLM, standing for API calling and local VLLM calling. #606 #614 #620llm_difficulty_score_filter
: Filter to keep sample with high difficulty score estimated by LLM, standing for API calling and local VLLM calling. #606 #614 #620
Others
Release v1.2.1
Major Updates
DJ has been integrated in Ray's official Ecosystem and Example Gallery. Besides, our patch in DJ2.0 for the streaming JSON reader has been officially integrated by Apache Arrow.
Our work on contrastive data synthesis, ImgDiff, has been accepted by CVPR 2025!
- Unit test optimization:
- split unit tests to partial and regression: partial test is triggered by PR and only test on corresponding test cases of changed files; regression test on all cases and triggered at 7:00 on every Friday in Beijing time. #598
- use primitive
@unittest.skip
and removeSKIPPED_TESTS
. #586 - upload test coverage reports to GitHub artifacts. #586
New OPs
image_remove_background_mapper
: remove the background of images. #589
Others
- add missing LOADED_AUDIOS to ALL_INTER_VARS to enable OP fusion and context sharing. #585
- only build doc for py3.10. #586
- move dependency on
ray
to minimal requirements. #586 #594 #595 - allow executor and other tool functions to consume a loaded dataset in addition to the config file. #596 #597
- fix undefined
fileno
bug of the logger. #594
Acknowledgement
- @liuyuhanalex helps simplify the code logic of OP fusion, add a new OP
image_remove_background_mapper
, and fix some minor bugs. #581 #585 #589 - @co63oc helps to fix typos in code and documents. #582 #583 #588 #591 #593
- @danielhjz helps to fix the implicit memory leak problem in
image_nsfw_filter
. #590