Skip to content

Release v1.3.0: Refactor of dataset builder and executor!

Compare
Choose a tag to compare
@yxdyc yxdyc released this 28 Mar 12:08
· 33 commits to main since this release
1b9afd1

The Big Change 🚀

Refactor of dataset builder and executor, see #537, @cyruszhang
📜 YAML explicitly defines different sources of datasets; local and remote are defined separately.
🔧 More flexible parameterized control; supports source-specific parameters, validations, and extensible configurations.
🔌 Unbind Executor's hardcode support: No longer restricted to local JSON formats; input format is determined dynamically via formatters/downloaders.
🚀 Enhanced Executor extensibility to natively support engines like Nemo, Dask, Spark, etc.
🔍 Add data format validation to ensure consistency and correctness.
🌐 Expanded data source support:
a. 📦 ModelScope integration.
b. 📚 ArXiv dataset import (download, decompress, ingest).
c. 📚 Wikipedia dataset support (download, decompress, ingest).
d. 🌐 Common Crawl integration (download, decompress, ingest).
🔗 Backward compatibility with existing dataset_path command-line syntax.
🔀 Support for data mixtures to combine multiple datasets dynamically.
🔧 Support for empty formatters/generated datasets without pre-defined config files.

Others 💡

🔊 New audio processing operator: audio_add_gaussian_noise (PR #622), @liuyuhanalex
📊 Added dynamic coverage rate badge to the README for transparency (PR #625)