Releases: modelscope/data-juicer
Release v1.3.3: Sandbox is accepted as Spotlight by ICML 2025; Add Img-Diff recipes.
Major Updates
- 🎉 Our work of Data-Juicer Sandbox has been accepted as a Spotlight by ICML 2025 (top 2.6% of all submissions)!
- Add new OPs and recipes for Img-Diff. #658
Enhancements
- Support HF llm for two llm_xxx_score_filter OPs. #655
- Sync docker image to Aliyun OSS for downloading if docker hub is not accessed. #657
- Split standalone and distributed unit tests to save time when re-running failed ones. #666
Bugs Fixed
- Address possibly missing cfg in
unify_format
. #653 - Improve clarity & fix bad links for some docs. #659
Acknowledgement
Full Changelog: v1.3.2...v1.3.3
Release v1.3.2: Enhancements on usability & two OPs; some bugs fixes
What's Changed
- Human OP enhancements, in #642 #645
- update label-studio version
- make service script more robust
- add documentation
- optimizing fields mapping
- OP efficiency optimization of
document_minhash_deduplicator
, in #639 - set temp_parser.usage to argparse.SUPPRESS, skip too much help log in #643
- fix date typo by in #648
- Fix docker building failure in #650
- Fix StreamToLoguru compatibility issue with torch._dynamo in #651
- add init file for annotation module, fix dj-process command error in #652
New Contributor
Release v1.3.1: added HumanOPs & fixed some bugs
Major Updates
- 💥 prototype Implementation for HumanOps (annotation). #617 Included features:
- boilerplate code for supporting label studio powered human annotation ops
- a human preference annotation reference implementation is provided
- label studio service script; can start up local instance using docker or pip, whichever is available
- reference configs and data
- event driven and notification mixins framework for ops
New OPs
extract_tables_from_html_mapper
: extract tables from html texts. #634general_fused_op
: an explicitly fused operator designed to execute multiple sequential operations (OPs) on the same batch, enabling fine-grained control over data processing. #626
Bug Fixed
- fix dataset builder initialization failure #630
- update Executor references from Executor to DefaultExecutor #632 #633
- switch the backend of
plt
to avoid sub-process/thread error #633 - fix some boundary condition bugs in several deduplicators #635 #637
Others
- check dataset when loading to support to pass dataset in the
DefaultExecutor.run
method. #633 - update docs to highlight light env installation part. #636
Acknowledgement
- @liuyuhanalex helps to add a new OP and fix some of the boundary condition bugs. #634 #635
Full Changelog: v1.3.0...v1.3.1
Release v1.3.0: Refactor of dataset builder and executor!
The Big Change 🚀
Refactor of dataset builder and executor, see #537, @cyruszhang
📜 YAML explicitly defines different sources of datasets; local and remote are defined separately.
🔧 More flexible parameterized control; supports source-specific parameters, validations, and extensible configurations.
🔌 Unbind Executor's hardcode support: No longer restricted to local JSON formats; input format is determined dynamically via formatters/downloaders.
🚀 Enhanced Executor extensibility to natively support engines like Nemo, Dask, Spark, etc.
🔍 Add data format validation to ensure consistency and correctness.
🌐 Expanded data source support:
a. 📦 ModelScope integration.
b. 📚 ArXiv dataset import (download, decompress, ingest).
c. 📚 Wikipedia dataset support (download, decompress, ingest).
d. 🌐 Common Crawl integration (download, decompress, ingest).
🔗 Backward compatibility with existing dataset_path command-line syntax.
🔀 Support for data mixtures to combine multiple datasets dynamically.
🔧 Support for empty formatters/generated datasets without pre-defined config files.
Others 💡
🔊 New audio processing operator: audio_add_gaussian_noise (PR #622), @liuyuhanalex
📊 Added dynamic coverage rate badge to the README for transparency (PR #625)
Release v1.2.2
Major Updates
- 🧪 Add document for API service. Add parameter transmission using
json.dumps
to support API calls for arbitrary registration functions and classes. #613 - 🚀 Add unit tests for the analysis module and utils module to increase test coverage. #604 #616
A new data synthesis method is proposed, which encourages LLMs to self-generate challenging cognitive questions, achieving superior data efficiency, cross-modality generalization, and SFT effects over SOTA baselines (e.g., 16% gain on MathVision using only 400 samples). See more details in MindGym: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions.
New OPs
llm_quality_score_filter
: Filter to keep sample with high quality score estimated by LLM, standing for API calling and local VLLM calling. #606 #614 #620llm_difficulty_score_filter
: Filter to keep sample with high difficulty score estimated by LLM, standing for API calling and local VLLM calling. #606 #614 #620
Others
Release v1.2.1
Major Updates
DJ has been integrated in Ray's official Ecosystem and Example Gallery. Besides, our patch in DJ2.0 for the streaming JSON reader has been officially integrated by Apache Arrow.
Our work on contrastive data synthesis, ImgDiff, has been accepted by CVPR 2025!
- Unit test optimization:
- split unit tests to partial and regression: partial test is triggered by PR and only test on corresponding test cases of changed files; regression test on all cases and triggered at 7:00 on every Friday in Beijing time. #598
- use primitive
@unittest.skip
and removeSKIPPED_TESTS
. #586 - upload test coverage reports to GitHub artifacts. #586
New OPs
image_remove_background_mapper
: remove the background of images. #589
Others
- add missing LOADED_AUDIOS to ALL_INTER_VARS to enable OP fusion and context sharing. #585
- only build doc for py3.10. #586
- move dependency on
ray
to minimal requirements. #586 #594 #595 - allow executor and other tool functions to consume a loaded dataset in addition to the config file. #596 #597
- fix undefined
fileno
bug of the logger. #594
Acknowledgement
- @liuyuhanalex helps simplify the code logic of OP fusion, add a new OP
image_remove_background_mapper
, and fix some minor bugs. #581 #585 #589 - @co63oc helps to fix typos in code and documents. #582 #583 #588 #591 #593
- @danielhjz helps to fix the implicit memory leak problem in
image_nsfw_filter
. #590
v1.2.0 Doc refactored; New algorithm proposed
What's New
- 📚 The DJ doc is refactored and improved, e.g., RecipeGallery, DeveloperGuide, DistributedProcess, DJ-related Competitions, typos bad links
- 🔎 More unit-tests added.
- 🎛 The data pre-split and export are improved.
- 🔮 A new data selection method, DaaR, is proposed. See Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data.
Detailed PRs
- fix export error when export_stats columns is null in #557
- Resplit input dataset in ray mode in #549
- Refactor and improve doc for RecipeGallery, DeveloperGuide, DistributedProcess and DJ-related Competitions in #561
- Resolve most skipped unit-tests by in #559
- fix translation error in #562
- Add unittest for ray text dedup in #540
- [Typo]correct a small typo in #563
- update the 2.0 paper link & the DaaR news in #566
- Fix typos in #571
- Optimization for sdxl_prompt2prompt_mapper dependency importing by in #570
- Fix typos in #572
Acknowledgment
- @liuyuhanalex @co63oc made their first PRs
Full Changelog: v1.1.0...v1.2.0
Release v1.1.0
Major Updates
- 🧪 User now can run ray-based distributed data processing under the guidance of added docs. #523
- 🧪 The DJ-Cookbook has gathered numerous high-quality data processing recipes from various vertical fields, and the related documents have been updated on the homepage. #542
- 💥 Change Task mode to Actor mode for ray deduplication, allowing users to use these operators without installing Redis. #526
- 🚀 Append a log summarization for warnings and errors at the running ending to make them recognizable under the sample fault tolerance mechanism. #534
- 🚀 Automatically update relevant documents when adding OPs to reduce the development burden. #527
- 🛝 Add usability tags for OPs:
alpha
tag for OPs in which only the basic OP implementations are finished;beta
tag for OPs in which unittests are added based on thealpha
version;stable
tag for OPs in which OP optimizations related to DJ (e.g. model management, batched processing, OP fusion, ...) are added based on thebeta
version.
New OPs
image_segment_mapper
: Perform segment-anything on images and return the bounding boxes. #550mllm_mapper
: Mapper to use MLLMs to generate texts for images. #550sdxl_prompt2prompt_mapper
: Use the generative model SDXL and image editing technique Prompt-to-Prompt to generate pairs of similar images. #550sentence_augmentation_mapper
: Augment sentences using LLMs. #550text_pair_similarity_filter
: Filter samples according to the similarity score between the text pair. #550
Bug Fixed
- Add global
skip_op_error
param to enable fault-tolerant when execute DataJuicer analyzer and executor, but disable fault-tolerant for unit test. #528 - Fix model force download bug. #529
- Fix
IndexError
if the number of samples in the result dataset is less than the number of workers when saving dataset to disk. #536 - Fix missing field meta tag on ray mode. #538
- Update
max_tokens
ormax_new_tokens
for vllm-based OPs to avoid too short generation. #544 - Fix bug in the role playing data generation demo. #545
Others
- Enhance unit test for API calling OPs. #528
- Remove sandbox requirements installation from Dockerfile. #530
- Update the
datasource
related APIs to be compatible with the latest version of Ray. #532 - Limit the generated qa num for each text in
generate_qa_from_text_mapper
. #541 - Update docs for preparing DJ2.0 release. #542
- Update a quick cdn link for arch figure. #543
- Add a video demo for role playing data generation. #545
- Optimize op doc for global textual search. #552
- Use a more stable and fast translator than google translator for automatic OP doc building. #554
Acknowledgement
- @Qirui-jiao made great contributions to enrich the Data-Juicer OP pool. #550
Release v1.0.3: More Powerful Distributed MinHashLSH Deduplicator; Post-Tuning Formats & OPs; Ray Actor for GPU-based OPs
Major Updates
- 💥 Support Ray-based MinHashLSH deduplicator, which implemented a multi-process Union-Find set based on Ray Actor and BTS algorithm to complete equivalence class merging. #502
- 💥 Support post-tuning dataset formats in LLaMA-Factory and ModelScope-Swift.
- Data-Juicer chooses the Query-Response format as the intermediate format for the post-tuning dataset. #514
- Refine the overall intermediate format of Data-Juicer to support various dataset formats better. (
meta
,stats
) #514 #518 - Provide several format conversion tools for converting to Data-Juicer format and vice versa. #514
- 🚀 Add 10 more post-tuning OPs to process post-tuning datasets better. It's listed in detail in the below New OPs section. #513
- 🚀 Support Ray Actor mode for GPU-based OPs. #511
New OPs
Post-tuning OPs for fine-grained analysis of dialog data. #513
Mapper
dialog_intent_detection_mapper
: Mapper to generate user's intent labels in feed back dialog data.dialog_sentiment_detection_mapper
: Mapper to generate user's sentiment labels in feed back dialog data.dialog_sentiment_intensity_mapper
: Mapper to predict user's sentiment intensity (from -5 to 5 in default
prompt) in feed back dialog data.dialog_topic_detection_mapper
: Mapper to generate user's topic labels in feed back dialog data.query_intent_detection_mapper
: Mapper to predict user's Intent label in a query.query_sentiment_detection_mapper
: Mapper to predict user's sentiment label ('negative', 'neutral' and
'positive') in a query.query_topic_detection_mapper
: Mapper to predict user's topic label in a query.
Aggregator
meta_tags_aggregator
: Merge similar meta tags to one tag.
Selector
tags_specified_field_selector
: Select samples based on the tags of specified field.
Grouper
naive_reverse_grouper
: Split bathed sample to samples.
Bug Fixed
- Fix the wrong argument passing in
generate_qa_from_example_mapper
. #517 - Update the out-of-date Dingding QR code on the main page. #513
Acknowledgement
- @jackylee-ch made their first contribution to help fix several invalid links in the document. #521
Full Changelog: v1.0.2...v1.0.3
Release v1.0.2
Major Updates
- Added more mapper/grouper/aggregator OPs for post-tuning scenarios.
- Optimized the distributed mode performance and usability with more automatic features.
DJ-Operators
extract_support_text_mapper
,relation_identity_mapper
,python_file_mapper
, #500naive_grouper
,key_value_grouper
, #500nested_aggregator
,entity_attribute_aggregator
,most_relavant_entities_aggregator
, #500video_extract_frames_mapper
, #507
Performance
- Optimize ray mode performance, #442
- Patch for Performance Benchmark in CI/CD workflows, #506
- DJ Ray mode supports streaming loading of
jsonl
files, #515
Usability and Analysis
- support dj-install in recipe-level, #508
- support dj-analyze with --auto mode, #512
- support op-wise insight auto mining, #516
Acknowledgment
Thanks to Data-Juicer users and contributors for their helpful feedback, issues and PRs!