Skip to content

Releases: modelscope/data-juicer

Release v1.3.3: Sandbox is accepted as Spotlight by ICML 2025; Add Img-Diff recipes.

09 May 10:20
444537e
Compare
Choose a tag to compare

Major Updates

  • 🎉 Our work of Data-Juicer Sandbox has been accepted as a Spotlight by ICML 2025 (top 2.6% of all submissions)!
  • Add new OPs and recipes for Img-Diff. #658

Enhancements

  • Support HF llm for two llm_xxx_score_filter OPs. #655
  • Sync docker image to Aliyun OSS for downloading if docker hub is not accessed. #657
  • Split standalone and distributed unit tests to save time when re-running failed ones. #666

Bugs Fixed

  • Address possibly missing cfg in unify_format. #653
  • Improve clarity & fix bad links for some docs. #659

Acknowledgement

Full Changelog: v1.3.2...v1.3.3

Release v1.3.2: Enhancements on usability & two OPs; some bugs fixes

25 Apr 11:17
2172698
Compare
Choose a tag to compare

What's Changed

  • Human OP enhancements, in #642 #645
    • update label-studio version
    • make service script more robust
    • add documentation
    • optimizing fields mapping
  • OP efficiency optimization of document_minhash_deduplicator, in #639
  • set temp_parser.usage to argparse.SUPPRESS, skip too much help log in #643
  • fix date typo by in #648
  • Fix docker building failure in #650
  • Fix StreamToLoguru compatibility issue with torch._dynamo in #651
  • add init file for annotation module, fix dj-process command error in #652

New Contributor

Release v1.3.1: added HumanOPs & fixed some bugs

11 Apr 09:48
e90a759
Compare
Choose a tag to compare

Major Updates

  • 💥 prototype Implementation for HumanOps (annotation). #617 Included features:
    • boilerplate code for supporting label studio powered human annotation ops
    • a human preference annotation reference implementation is provided
    • label studio service script; can start up local instance using docker or pip, whichever is available
    • reference configs and data
    • event driven and notification mixins framework for ops

New OPs

  • extract_tables_from_html_mapper: extract tables from html texts. #634
  • general_fused_op: an explicitly fused operator designed to execute multiple sequential operations (OPs) on the same batch, enabling fine-grained control over data processing. #626

Bug Fixed

  • fix dataset builder initialization failure #630
  • update Executor references from Executor to DefaultExecutor #632 #633
  • switch the backend of plt to avoid sub-process/thread error #633
  • fix some boundary condition bugs in several deduplicators #635 #637

Others

  • check dataset when loading to support to pass dataset in the DefaultExecutor.run method. #633
  • update docs to highlight light env installation part. #636

Acknowledgement

Full Changelog: v1.3.0...v1.3.1

Release v1.3.0: Refactor of dataset builder and executor!

28 Mar 12:08
1b9afd1
Compare
Choose a tag to compare

The Big Change 🚀

Refactor of dataset builder and executor, see #537, @cyruszhang
📜 YAML explicitly defines different sources of datasets; local and remote are defined separately.
🔧 More flexible parameterized control; supports source-specific parameters, validations, and extensible configurations.
🔌 Unbind Executor's hardcode support: No longer restricted to local JSON formats; input format is determined dynamically via formatters/downloaders.
🚀 Enhanced Executor extensibility to natively support engines like Nemo, Dask, Spark, etc.
🔍 Add data format validation to ensure consistency and correctness.
🌐 Expanded data source support:
a. 📦 ModelScope integration.
b. 📚 ArXiv dataset import (download, decompress, ingest).
c. 📚 Wikipedia dataset support (download, decompress, ingest).
d. 🌐 Common Crawl integration (download, decompress, ingest).
🔗 Backward compatibility with existing dataset_path command-line syntax.
🔀 Support for data mixtures to combine multiple datasets dynamically.
🔧 Support for empty formatters/generated datasets without pre-defined config files.

Others 💡

🔊 New audio processing operator: audio_add_gaussian_noise (PR #622), @liuyuhanalex
📊 Added dynamic coverage rate badge to the README for transparency (PR #625)

Release v1.2.2

14 Mar 09:58
8d09410
Compare
Choose a tag to compare

Major Updates

  • 🧪 Add document for API service. Add parameter transmission using json.dumps to support API calls for arbitrary registration functions and classes. #613
  • 🚀 Add unit tests for the analysis module and utils module to increase test coverage. #604 #616
  • new A new data synthesis method is proposed, which encourages LLMs to self-generate challenging cognitive questions, achieving superior data efficiency, cross-modality generalization, and SFT effects over SOTA baselines (e.g., 16% gain on MathVision using only 400 samples). See more details in MindGym: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions.

New OPs

  • llm_quality_score_filter: Filter to keep sample with high quality score estimated by LLM, standing for API calling and local VLLM calling. #606 #614 #620
  • llm_difficulty_score_filter: Filter to keep sample with high difficulty score estimated by LLM, standing for API calling and local VLLM calling. #606 #614 #620

Others

  • Fix config in LLaVa pretrain recipe. #610
  • Update news for MindGYM and fix doc. #615
  • Fix decode error through UTF-8 decoding. #618

Release v1.2.1

28 Feb 07:50
6014bcc
Compare
Choose a tag to compare

Major Updates

  • new DJ has been integrated in Ray's official Ecosystem and Example Gallery. Besides, our patch in DJ2.0 for the streaming JSON reader has been officially integrated by Apache Arrow.
  • new Our work on contrastive data synthesis, ImgDiff, has been accepted by CVPR 2025!
  • Unit test optimization:
    • split unit tests to partial and regression: partial test is triggered by PR and only test on corresponding test cases of changed files; regression test on all cases and triggered at 7:00 on every Friday in Beijing time. #598
    • use primitive @unittest.skip and remove SKIPPED_TESTS. #586
    • upload test coverage reports to GitHub artifacts. #586

New OPs

  • image_remove_background_mapper: remove the background of images. #589

Others

  • add missing LOADED_AUDIOS to ALL_INTER_VARS to enable OP fusion and context sharing. #585
  • only build doc for py3.10. #586
  • move dependency on ray to minimal requirements. #586 #594 #595
  • allow executor and other tool functions to consume a loaded dataset in addition to the config file. #596 #597
  • fix undefined fileno bug of the logger. #594

Acknowledgement

v1.2.0 Doc refactored; New algorithm proposed

14 Feb 09:40
7820a4d
Compare
Choose a tag to compare

What's New

Detailed PRs

  • fix export error when export_stats columns is null in #557
  • Resplit input dataset in ray mode in #549
  • Refactor and improve doc for RecipeGallery, DeveloperGuide, DistributedProcess and DJ-related Competitions in #561
  • Resolve most skipped unit-tests by in #559
  • fix translation error in #562
  • Add unittest for ray text dedup in #540
  • [Typo]correct a small typo in #563
  • update the 2.0 paper link & the DaaR news in #566
  • Fix typos in #571
  • Optimization for sdxl_prompt2prompt_mapper dependency importing by in #570
  • Fix typos in #572

Acknowledgment

Full Changelog: v1.1.0...v1.2.0

Release v1.1.0

17 Jan 09:46
030e786
Compare
Choose a tag to compare

Major Updates

  • 🧪 User now can run ray-based distributed data processing under the guidance of added docs. #523
  • 🧪 The DJ-Cookbook has gathered numerous high-quality data processing recipes from various vertical fields, and the related documents have been updated on the homepage. #542
  • 💥 Change Task mode to Actor mode for ray deduplication, allowing users to use these operators without installing Redis. #526
  • 🚀 Append a log summarization for warnings and errors at the running ending to make them recognizable under the sample fault tolerance mechanism. #534
  • 🚀 Automatically update relevant documents when adding OPs to reduce the development burden. #527
  • 🛝 Add usability tags for OPs:
    • alpha tag for OPs in which only the basic OP implementations are finished;
    • beta tag for OPs in which unittests are added based on the alpha version;
    • stable tag for OPs in which OP optimizations related to DJ (e.g. model management, batched processing, OP fusion, ...) are added based on the beta version.

New OPs

  • image_segment_mapper: Perform segment-anything on images and return the bounding boxes. #550
  • mllm_mapper: Mapper to use MLLMs to generate texts for images. #550
  • sdxl_prompt2prompt_mapper: Use the generative model SDXL and image editing technique Prompt-to-Prompt to generate pairs of similar images. #550
  • sentence_augmentation_mapper: Augment sentences using LLMs. #550
  • text_pair_similarity_filter: Filter samples according to the similarity score between the text pair. #550

Bug Fixed

  • Add global skip_op_error param to enable fault-tolerant when execute DataJuicer analyzer and executor, but disable fault-tolerant for unit test. #528
  • Fix model force download bug. #529
  • Fix IndexError if the number of samples in the result dataset is less than the number of workers when saving dataset to disk. #536
  • Fix missing field meta tag on ray mode. #538
  • Update max_tokens or max_new_tokens for vllm-based OPs to avoid too short generation. #544
  • Fix bug in the role playing data generation demo. #545

Others

  • Enhance unit test for API calling OPs. #528
  • Remove sandbox requirements installation from Dockerfile. #530
  • Update the datasource related APIs to be compatible with the latest version of Ray. #532
  • Limit the generated qa num for each text in generate_qa_from_text_mapper. #541
  • Update docs for preparing DJ2.0 release. #542
  • Update a quick cdn link for arch figure. #543
  • Add a video demo for role playing data generation. #545
  • Optimize op doc for global textual search. #552
  • Use a more stable and fast translator than google translator for automatic OP doc building. #554

Acknowledgement

  • @Qirui-jiao made great contributions to enrich the Data-Juicer OP pool. #550

Release v1.0.3: More Powerful Distributed MinHashLSH Deduplicator; Post-Tuning Formats & OPs; Ray Actor for GPU-based OPs

03 Jan 10:59
87efd5e
Compare
Choose a tag to compare

Major Updates

  • 💥 Support Ray-based MinHashLSH deduplicator, which implemented a multi-process Union-Find set based on Ray Actor and BTS algorithm to complete equivalence class merging. #502
  • 💥 Support post-tuning dataset formats in LLaMA-Factory and ModelScope-Swift.
    • Data-Juicer chooses the Query-Response format as the intermediate format for the post-tuning dataset. #514
    • Refine the overall intermediate format of Data-Juicer to support various dataset formats better. (meta, stats) #514 #518
    • Provide several format conversion tools for converting to Data-Juicer format and vice versa. #514
  • 🚀 Add 10 more post-tuning OPs to process post-tuning datasets better. It's listed in detail in the below New OPs section. #513
  • 🚀 Support Ray Actor mode for GPU-based OPs. #511

New OPs

Post-tuning OPs for fine-grained analysis of dialog data. #513

Mapper

  • dialog_intent_detection_mapper: Mapper to generate user's intent labels in feed back dialog data.
  • dialog_sentiment_detection_mapper: Mapper to generate user's sentiment labels in feed back dialog data.
  • dialog_sentiment_intensity_mapper: Mapper to predict user's sentiment intensity (from -5 to 5 in default
    prompt) in feed back dialog data.
  • dialog_topic_detection_mapper: Mapper to generate user's topic labels in feed back dialog data.
  • query_intent_detection_mapper: Mapper to predict user's Intent label in a query.
  • query_sentiment_detection_mapper: Mapper to predict user's sentiment label ('negative', 'neutral' and
    'positive') in a query.
  • query_topic_detection_mapper: Mapper to predict user's topic label in a query.

Aggregator

  • meta_tags_aggregator: Merge similar meta tags to one tag.

Selector

  • tags_specified_field_selector: Select samples based on the tags of specified field.

Grouper

  • naive_reverse_grouper: Split bathed sample to samples.

Bug Fixed

  • Fix the wrong argument passing in generate_qa_from_example_mapper. #517
  • Update the out-of-date Dingding QR code on the main page. #513

Acknowledgement

  • @jackylee-ch made their first contribution to help fix several invalid links in the document. #521

Full Changelog: v1.0.2...v1.0.3

Release v1.0.2

20 Dec 12:15
a26dcc7
Compare
Choose a tag to compare

Major Updates

  • Added more mapper/grouper/aggregator OPs for post-tuning scenarios.
  • Optimized the distributed mode performance and usability with more automatic features.

DJ-Operators

  • extract_support_text_mapper, relation_identity_mapper, python_file_mapper, #500
  • naive_grouper, key_value_grouper, #500
  • nested_aggregator, entity_attribute_aggregator, most_relavant_entities_aggregator, #500
  • video_extract_frames_mapper, #507

Performance

  • Optimize ray mode performance, #442
  • Patch for Performance Benchmark in CI/CD workflows, #506
  • DJ Ray mode supports streaming loading of jsonl files, #515

Usability and Analysis

  • support dj-install in recipe-level, #508
  • support dj-analyze with --auto mode, #512
  • support op-wise insight auto mining, #516

Acknowledgment

Thanks to Data-Juicer users and contributors for their helpful feedback, issues and PRs!