Skip to content

Release v1.4.1: MCP server; GPU-based Minhash deduplicator; Improved unit test coverage.

Choose a tag to compare

@HYLcool HYLcool released this 16 Jul 13:05
· 38 commits to main since this release
7505686

Major Updates

  • 🔧 Introduce Data-Juicer MCP server. Users can make use of the data processing capabilities in the MCP way conveniently. #690 #737
  • 💪🏻 Unit test coverage rate is improved to 85%+ and several bugs in test cases are resolved (OOM, encoding error, and so on), which makes Data-Juicer more reliable. #698 #717 #720 #727
  • 🤝 Minhash deduplication based on GPU is supported, collaborated with developers from Nvidia. #694 #644
  • 🧩 RayExporter supports more formats to export a ray dataset in addition to json/jsonl. #687
  • 🎥 Two demo videos are added to introduce the Data-Juicer core functions, agentic usages, and sandbox. #738

New Operators

  • download_file_mapper downloads data from URLs to local files or specified fields. #709

Enhancements

  • New analysis method: correlation analysis among stats is added. #663
  • Several core dependencies are updated and fixed to a newer version, and dependency conflicts are resolved. #715 #717 #723
  • The EasyAnimate pipelines in the sandbox are updated to follow the refactoring of sandbox. #710
  • Apply more reliable pre-commit tools to improve the code style of Data-Juicer. #714
  • Support store and process bytes data of images in the dataset. #725

Bugs Fixed

  • The wheel & docker image building bug is fixed. #706
  • Fix bugs in log_summarization. #710
  • Fix "no module named data_juicer" error after installing from the wheel file. #727

Acknowledgement

  • @fanronghai helps to fix the param error in dataset_splitting_by_language tool. #713
  • @ayushdg helps to support a GPU-version Minhash deduplicator. #644
  • @ricksun2023 helps to fix the bugs when there are more than one same-name OPs in the configs. #730

Full Changelog: v1.4.0...v1.4.1