09 May 10:20

HYLcool

444537e

Release v1.3.3: Sandbox is accepted as Spotlight by ICML 2025; Add Img-Diff recipes. Latest

Latest

Major Updates

🎉 Our work of Data-Juicer Sandbox has been accepted as a Spotlight by ICML 2025 (top 2.6% of all submissions)!
Add new OPs and recipes for Img-Diff. #658

Enhancements

Support HF llm for two llm_xxx_score_filter OPs. #655
Sync docker image to Aliyun OSS for downloading if docker hub is not accessed. #657
Split standalone and distributed unit tests to save time when re-running failed ones. #666

Bugs Fixed

Address possibly missing cfg in unify_format. #653
Improve clarity & fix bad links for some docs. #659

Acknowledgement

@co63oc helps to fix some typos. #654

Full Changelog: v1.3.2...v1.3.3

Contributors

co63oc

Assets 3

25 Apr 11:17

yxdyc

v1.3.2

2172698

Release v1.3.2: Enhancements on usability & two OPs; some bugs fixes

What's Changed

Human OP enhancements, in #642 #645
- update label-studio version
- make service script more robust
- add documentation
- optimizing fields mapping
OP efficiency optimization of document_minhash_deduplicator, in #639
set temp_parser.usage to argparse.SUPPRESS, skip too much help log in #643
fix date typo by in #648
Fix docker building failure in #650
Fix StreamToLoguru compatibility issue with torch._dynamo in #651
add init file for annotation module, fix dj-process command error in #652

New Contributor

@cmgzn made their first contribution in #651

Contributors

cmgzn

Assets 3

11 Apr 09:48

HYLcool

v1.3.1

e90a759

Release v1.3.1: added HumanOPs & fixed some bugs

Major Updates

💥 prototype Implementation for HumanOps (annotation). #617 Included features:
- boilerplate code for supporting label studio powered human annotation ops
- a human preference annotation reference implementation is provided
- label studio service script; can start up local instance using docker or pip, whichever is available
- reference configs and data
- event driven and notification mixins framework for ops

New OPs

extract_tables_from_html_mapper: extract tables from html texts. #634
general_fused_op: an explicitly fused operator designed to execute multiple sequential operations (OPs) on the same batch, enabling fine-grained control over data processing. #626

Bug Fixed

fix dataset builder initialization failure #630
update Executor references from Executor to DefaultExecutor #632 #633
switch the backend of plt to avoid sub-process/thread error #633
fix some boundary condition bugs in several deduplicators #635 #637

Others

check dataset when loading to support to pass dataset in the DefaultExecutor.run method. #633
update docs to highlight light env installation part. #636

Acknowledgement

@liuyuhanalex helps to add a new OP and fix some of the boundary condition bugs. #634 #635

Full Changelog: v1.3.0...v1.3.1

Contributors

liuyuhanalex

Assets 3

28 Mar 12:08

yxdyc

v1.3.0

1b9afd1

Release v1.3.0: Refactor of dataset builder and executor!

The Big Change 🚀

Refactor of dataset builder and executor, see #537, @cyruszhang
📜 YAML explicitly defines different sources of datasets; local and remote are defined separately.
🔧 More flexible parameterized control; supports source-specific parameters, validations, and extensible configurations.
🔌 Unbind Executor's hardcode support: No longer restricted to local JSON formats; input format is determined dynamically via formatters/downloaders.
🚀 Enhanced Executor extensibility to natively support engines like Nemo, Dask, Spark, etc.
🔍 Add data format validation to ensure consistency and correctness.
🌐 Expanded data source support:
a. 📦 ModelScope integration.
b. 📚 ArXiv dataset import (download, decompress, ingest).
c. 📚 Wikipedia dataset support (download, decompress, ingest).
d. 🌐 Common Crawl integration (download, decompress, ingest).
🔗 Backward compatibility with existing dataset_path command-line syntax.
🔀 Support for data mixtures to combine multiple datasets dynamically.
🔧 Support for empty formatters/generated datasets without pre-defined config files.

Others 💡

🔊 New audio processing operator: audio_add_gaussian_noise (PR #622), @liuyuhanalex
📊 Added dynamic coverage rate badge to the README for transparency (PR #625)

Contributors

cyruszhang and liuyuhanalex

Assets 3

14 Mar 09:58

BeachWang

v1.2.2

8d09410

Release v1.2.2

Major Updates

🧪 Add document for API service. Add parameter transmission using json.dumps to support API calls for arbitrary registration functions and classes. #613
🚀 Add unit tests for the analysis module and utils module to increase test coverage. #604 #616
A new data synthesis method is proposed, which encourages LLMs to self-generate challenging cognitive questions, achieving superior data efficiency, cross-modality generalization, and SFT effects over SOTA baselines (e.g., 16% gain on MathVision using only 400 samples). See more details in MindGym: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions.

New OPs

llm_quality_score_filter: Filter to keep sample with high quality score estimated by LLM, standing for API calling and local VLLM calling. #606 #614 #620
llm_difficulty_score_filter: Filter to keep sample with high difficulty score estimated by LLM, standing for API calling and local VLLM calling. #606 #614 #620

Others

Fix config in LLaVa pretrain recipe. #610
Update news for MindGYM and fix doc. #615
Fix decode error through UTF-8 decoding. #618

Assets 3

28 Feb 07:50

HYLcool

v1.2.1

6014bcc

Release v1.2.1

Major Updates

DJ has been integrated in Ray's official Ecosystem and Example Gallery. Besides, our patch in DJ2.0 for the streaming JSON reader has been officially integrated by Apache Arrow.
Our work on contrastive data synthesis, ImgDiff, has been accepted by CVPR 2025!
Unit test optimization:
- split unit tests to partial and regression: partial test is triggered by PR and only test on corresponding test cases of changed files; regression test on all cases and triggered at 7:00 on every Friday in Beijing time. #598
- use primitive @unittest.skip and remove SKIPPED_TESTS. #586
- upload test coverage reports to GitHub artifacts. #586

New OPs

image_remove_background_mapper: remove the background of images. #589

Others

add missing LOADED_AUDIOS to ALL_INTER_VARS to enable OP fusion and context sharing. #585
only build doc for py3.10. #586
move dependency on ray to minimal requirements. #586 #594 #595
allow executor and other tool functions to consume a loaded dataset in addition to the config file. #596 #597
fix undefined fileno bug of the logger. #594

Acknowledgement

@liuyuhanalex helps simplify the code logic of OP fusion, add a new OP image_remove_background_mapper, and fix some minor bugs. #581 #585 #589
@co63oc helps to fix typos in code and documents. #582 #583 #588 #591 #593
@danielhjz helps to fix the implicit memory leak problem in image_nsfw_filter. #590

Contributors

co63oc, danielhjz, and liuyuhanalex

Assets 3

14 Feb 09:40

yxdyc

v1.2.0

7820a4d

v1.2.0 Doc refactored; New algorithm proposed

What's New

📚 The DJ doc is refactored and improved, e.g., RecipeGallery, DeveloperGuide, DistributedProcess, DJ-related Competitions, typos bad links
🔎 More unit-tests added.
🎛 The data pre-split and export are improved.
🔮 A new data selection method, DaaR, is proposed. See Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data.

Detailed PRs

fix export error when export_stats columns is null in #557
Resplit input dataset in ray mode in #549
Refactor and improve doc for RecipeGallery, DeveloperGuide, DistributedProcess and DJ-related Competitions in #561
Resolve most skipped unit-tests by in #559
fix translation error in #562
Add unittest for ray text dedup in #540
[Typo]correct a small typo in #563
update the 2.0 paper link & the DaaR news in #566
Fix typos in #571
Optimization for sdxl_prompt2prompt_mapper dependency importing by in #570
Fix typos in #572

Acknowledgment

@liuyuhanalex @co63oc made their first PRs

Full Changelog: v1.1.0...v1.2.0

Contributors

co63oc and liuyuhanalex

Assets 3

17 Jan 09:46

BeachWang

v1.1.0

030e786

Release v1.1.0

Major Updates

🧪 User now can run ray-based distributed data processing under the guidance of added docs. #523
🧪 The DJ-Cookbook has gathered numerous high-quality data processing recipes from various vertical fields, and the related documents have been updated on the homepage. #542
💥 Change Task mode to Actor mode for ray deduplication, allowing users to use these operators without installing Redis. #526
🚀 Append a log summarization for warnings and errors at the running ending to make them recognizable under the sample fault tolerance mechanism. #534
🚀 Automatically update relevant documents when adding OPs to reduce the development burden. #527
🛝 Add usability tags for OPs:
- alpha tag for OPs in which only the basic OP implementations are finished;
- beta tag for OPs in which unittests are added based on the alpha version;
- stable tag for OPs in which OP optimizations related to DJ (e.g. model management, batched processing, OP fusion, ...) are added based on the beta version.

New OPs

image_segment_mapper: Perform segment-anything on images and return the bounding boxes. #550
mllm_mapper: Mapper to use MLLMs to generate texts for images. #550
sdxl_prompt2prompt_mapper: Use the generative model SDXL and image editing technique Prompt-to-Prompt to generate pairs of similar images. #550
sentence_augmentation_mapper: Augment sentences using LLMs. #550
text_pair_similarity_filter: Filter samples according to the similarity score between the text pair. #550

Bug Fixed

Add global skip_op_error param to enable fault-tolerant when execute DataJuicer analyzer and executor, but disable fault-tolerant for unit test. #528
Fix model force download bug. #529
Fix IndexError if the number of samples in the result dataset is less than the number of workers when saving dataset to disk. #536
Fix missing field meta tag on ray mode. #538
Update max_tokens or max_new_tokens for vllm-based OPs to avoid too short generation. #544
Fix bug in the role playing data generation demo. #545

Others

Enhance unit test for API calling OPs. #528
Remove sandbox requirements installation from Dockerfile. #530
Update the datasource related APIs to be compatible with the latest version of Ray. #532
Limit the generated qa num for each text in generate_qa_from_text_mapper. #541
Update docs for preparing DJ2.0 release. #542
Update a quick cdn link for arch figure. #543
Add a video demo for role playing data generation. #545
Optimize op doc for global textual search. #552
Use a more stable and fast translator than google translator for automatic OP doc building. #554

Acknowledgement

@Qirui-jiao made great contributions to enrich the Data-Juicer OP pool. #550

Contributors

Qirui-jiao

Assets 3

03 Jan 10:59

HYLcool

v1.0.3

87efd5e

Release v1.0.3: More Powerful Distributed MinHashLSH Deduplicator; Post-Tuning Formats & OPs; Ray Actor for GPU-based OPs

Major Updates

💥 Support Ray-based MinHashLSH deduplicator, which implemented a multi-process Union-Find set based on Ray Actor and BTS algorithm to complete equivalence class merging. #502
💥 Support post-tuning dataset formats in LLaMA-Factory and ModelScope-Swift.
- Data-Juicer chooses the Query-Response format as the intermediate format for the post-tuning dataset. #514
- Refine the overall intermediate format of Data-Juicer to support various dataset formats better. (meta, stats) #514 #518
- Provide several format conversion tools for converting to Data-Juicer format and vice versa. #514
🚀 Add 10 more post-tuning OPs to process post-tuning datasets better. It's listed in detail in the below New OPs section. #513
🚀 Support Ray Actor mode for GPU-based OPs. #511

New OPs

Post-tuning OPs for fine-grained analysis of dialog data. #513

Mapper

dialog_intent_detection_mapper: Mapper to generate user's intent labels in feed back dialog data.
dialog_sentiment_detection_mapper: Mapper to generate user's sentiment labels in feed back dialog data.
dialog_sentiment_intensity_mapper: Mapper to predict user's sentiment intensity (from -5 to 5 in default
prompt) in feed back dialog data.
dialog_topic_detection_mapper: Mapper to generate user's topic labels in feed back dialog data.
query_intent_detection_mapper: Mapper to predict user's Intent label in a query.
query_sentiment_detection_mapper: Mapper to predict user's sentiment label ('negative', 'neutral' and
'positive') in a query.
query_topic_detection_mapper: Mapper to predict user's topic label in a query.

Aggregator

meta_tags_aggregator: Merge similar meta tags to one tag.

Selector

tags_specified_field_selector: Select samples based on the tags of specified field.

Grouper

naive_reverse_grouper: Split bathed sample to samples.

Bug Fixed

Fix the wrong argument passing in generate_qa_from_example_mapper. #517
Update the out-of-date Dingding QR code on the main page. #513

Acknowledgement

@jackylee-ch made their first contribution to help fix several invalid links in the document. #521

Full Changelog: v1.0.2...v1.0.3

Contributors

jackylee-ch

Assets 3

20 Dec 12:15

yxdyc

v1.0.2

a26dcc7

Release v1.0.2

Major Updates

Added more mapper/grouper/aggregator OPs for post-tuning scenarios.
Optimized the distributed mode performance and usability with more automatic features.

DJ-Operators

extract_support_text_mapper, relation_identity_mapper, python_file_mapper, #500
naive_grouper, key_value_grouper, #500
nested_aggregator, entity_attribute_aggregator, most_relavant_entities_aggregator, #500
video_extract_frames_mapper, #507

Performance

Optimize ray mode performance, #442
Patch for Performance Benchmark in CI/CD workflows, #506
DJ Ray mode supports streaming loading of jsonl files, #515

Usability and Analysis

support dj-install in recipe-level, #508
support dj-analyze with --auto mode, #512
support op-wise insight auto mining, #516

Acknowledgment

Thanks to Data-Juicer users and contributors for their helpful feedback, issues and PRs!

Assets 3

Releases: modelscope/data-juicer

Release v1.3.3: Sandbox is accepted as Spotlight by ICML 2025; Add Img-Diff recipes.

Major Updates

Enhancements

Bugs Fixed

Acknowledgement

Contributors

Uh oh!

Release v1.3.2: Enhancements on usability & two OPs; some bugs fixes

What's Changed

New Contributor

Contributors

Uh oh!

Release v1.3.1: added HumanOPs & fixed some bugs

Major Updates

New OPs

Bug Fixed

Others

Acknowledgement

Contributors

Uh oh!

Release v1.3.0: Refactor of dataset builder and executor!

The Big Change 🚀

Others 💡

Contributors

Uh oh!

Release v1.2.2

Major Updates

New OPs

Others

Uh oh!

Release v1.2.1

Major Updates

New OPs

Others

Acknowledgement

Contributors

Uh oh!

v1.2.0 Doc refactored; New algorithm proposed

What's New

Detailed PRs

Acknowledgment

Contributors

Uh oh!

Release v1.1.0

Major Updates

New OPs

Bug Fixed

Others

Acknowledgement

Contributors

Uh oh!

Release v1.0.3: More Powerful Distributed MinHashLSH Deduplicator; Post-Tuning Formats & OPs; Ray Actor for GPU-based OPs

Major Updates

New OPs

Post-tuning OPs for fine-grained analysis of dialog data. #513

Mapper

Aggregator

Selector

Grouper

Bug Fixed

Acknowledgement

Contributors

Uh oh!

Release v1.0.2

Major Updates

DJ-Operators

Performance

Usability and Analysis

Acknowledgment

Uh oh!