Skip to content

Release v1.4.2: Python > 3.10 are supported; Data Attribution OPs; External OPs are supported; Install with "uv"

Choose a tag to compare

@HYLcool HYLcool released this 18 Aug 03:22
· 27 commits to main since this release
14f6594

Major Updates

  • 💪🏻 Data-Juicer now is compatible with Python 3.11 & 3.12. #749
  • 🧩 5 OPs for data attribution are added. #735
  • 🤝 Now Data-Juicer support register and apply custom OPs in external paths using the argument custom_operator_paths. #758
  • 🔧 "uv" is the first choice to installing Data-Juicer now due to its capability to solve the dependency conflicts. #760

New Operators

Filter

  • Validation-free
    • llm_perplexity_filter: Filter to keep samples with perplexity score, computed using a specified llm, within a specific range. #735
    • instruction_following_difficulty_filter: Filter to keep texts whose instruction follows difficulty (IFD, https://arxiv.org/abs/2308.12032) falls within a specific range. #735
  • Validation-based
    • in_context_influence_filter: Filter to keep texts whose in-context influence upon validation set within a specific range. #735
    • llm_task_relevance_filter: Filter to keep sample with high relevance score to validation tasks estimated by LLM. #735
    • text_embd_similarity_filter: Filter to keep texts whose average embedding similarity to a set of given validation texts falls within a specific range. #735

Enhancements

  • A new environment variable DATA_JUICER_EXTERNAL_MODELS_HOME is added to allow to specify some private or read-only paths to store external and extra models. #740
  • Optimize the video link transformation and multi-version maintainence in the docs. Update demo videos with higher-resolution versions. #746
  • Support custom save_dir for OPs that produce extra multimodal data. #751
  • Add official and detailed docs about Data-Juicer Agent. #759
  • Enhance unit tests: show the name of the current test cases; recycle resources after each test cases for ray mode. #749
  • Refining developer guide for better practice on building new OPs. #760

Bugs Fixed

  • Move the updating of special tokens of multimodal data in initialization of base_op, which fix the bug that special tokens might not be synced with the main process when processing data in parallel. #752
  • Fix some test cases. #754

Acknowledgement

Full Changelog: v1.4.1...v1.4.2