Release v1.4.1: MCP server; GPU-based Minhash deduplicator; Improved unit test coverage.
Major Updates
- 🔧 Introduce Data-Juicer MCP server. Users can make use of the data processing capabilities in the MCP way conveniently. #690 #737
- 💪🏻 Unit test coverage rate is improved to 85%+ and several bugs in test cases are resolved (OOM, encoding error, and so on), which makes Data-Juicer more reliable. #698 #717 #720 #727
- 🤝 Minhash deduplication based on GPU is supported, collaborated with developers from Nvidia. #694 #644
- 🧩 RayExporter supports more formats to export a ray dataset in addition to json/jsonl. #687
- 🎥 Two demo videos are added to introduce the Data-Juicer core functions, agentic usages, and sandbox. #738
New Operators
download_file_mapper
downloads data from URLs to local files or specified fields. #709
Enhancements
- New analysis method: correlation analysis among stats is added. #663
- Several core dependencies are updated and fixed to a newer version, and dependency conflicts are resolved. #715 #717 #723
- The EasyAnimate pipelines in the sandbox are updated to follow the refactoring of sandbox. #710
- Apply more reliable pre-commit tools to improve the code style of Data-Juicer. #714
- Support store and process bytes data of images in the dataset. #725
Bugs Fixed
- The wheel & docker image building bug is fixed. #706
- Fix bugs in log_summarization. #710
- Fix "no module named data_juicer" error after installing from the wheel file. #727
Acknowledgement
- @fanronghai helps to fix the param error in dataset_splitting_by_language tool. #713
- @ayushdg helps to support a GPU-version Minhash deduplicator. #644
- @ricksun2023 helps to fix the bugs when there are more than one same-name OPs in the configs. #730
Full Changelog: v1.4.0...v1.4.1