update to v1.3.0 (#627)

yxdyc · web-flow · commit 1b9afd169dba · 2025-03-28T20:01:07.000+08:00
* improve docs for refactoring PR

* update to v1.3.0

* remove redudant toc
diff --git a/README.md b/README.md
@@ -76,36 +76,6 @@ Besides, our paper is also updated to [v3](https://arxiv.org/abs/2309.02033).
 Table of Contents
 =================
 
-- [Data Processing for and with Foundation Models](#data-processing-for-and-with-foundation-models)
-  - [News](#news)
-- [Table of Contents](#table-of-contents)
-  - [Why Data-Juicer?](#why-data-juicer)
-  - [DJ-Cookbook](#dj-cookbook)
-    - [Curated Resources](#curated-resources)
-    - [Coding with Data-Juicer (DJ)](#coding-with-data-juicer-dj)
-    - [Use Cases \& Data Recipes](#use-cases--data-recipes)
-    - [Interactive Examples](#interactive-examples)
-  - [Installation](#installation)
-    - [Prerequisites](#prerequisites)
-    - [From Source](#from-source)
-    - [Using pip](#using-pip)
-    - [Using Docker](#using-docker)
-    - [Installation check](#installation-check)
-    - [For Video-related Operators](#for-video-related-operators)
-  - [Quick Start](#quick-start)
-    - [Dataset configuration](#dataset-configuration)
-    - [Data Processing](#data-processing)
-    - [Distributed Data Processing](#distributed-data-processing)
-    - [Data Analysis](#data-analysis)
-    - [Data Visualization](#data-visualization)
-    - [Build Up Config Files](#build-up-config-files)
-    - [Sandbox](#sandbox)
-    - [Preprocess Raw Data (Optional)](#preprocess-raw-data-optional)
-    - [For Docker Users](#for-docker-users)
-  - [License](#license)
-  - [Contributing](#contributing)
-  - [Acknowledgement](#acknowledgement)
-  - [References](#references)
 - [News](#news)
 - [Why Data-Juicer?](#why-data-juicer)
 - [DJ-Cookbook](#dj-cookbook)
@@ -121,6 +91,7 @@ Table of Contents
   - [Installation check](#installation-check)
   - [For Video-related Operators](#for-video-related-operators)
 - [Quick Start](#quick-start)
+  - [Dataset Configuration](#dataset-configuration)
   - [Data Processing](#data-processing)
   - [Distributed Data Processing](#distributed-data-processing)
   - [Data Analysis](#data-analysis)
@@ -332,19 +303,19 @@ Check if your environment path is set correctly by running the ffmpeg command fr
 DJ supports various dataset input types, including local files, remote datasets like huggingface; it also supports data validation and data mixture.
 
 Two ways to configure a input file
-- legacy way 
+- Simple scenarios, single path for local/HF file
 ```yaml
 dataset_path: '/path/to/your/dataset'  # path to your dataset directory or file
 ```
-- updated way
+- advanced method, supports sub-configuration items and more features
 ```yaml
 dataset:
   configs:
     - type: 'local'
       path: 'path/to/your/dataset' # path to your dataset directory or file
 ```
 
-Refer to [Dataset Configuration Guide](data_juicer/core/data/README.md) for more details.
+Refer to [Dataset Configuration Guide](docs/DatasetCfg.md) for more details.
 
 
 
diff --git a/README_ZH.md b/README_ZH.md
@@ -85,6 +85,7 @@ Data-Juicer正在积极更新和维护中，我们将定期强化和新增更多
   - [安装校验](#安装校验)
   - [使用视频相关算子](#使用视频相关算子)
 - [快速上手](#快速上手)
+  - [数据集配置](#数据集配置)
   - [数据处理](#数据处理)
   - [分布式数据处理](#分布式数据处理)
   - [数据分析](#数据分析)
@@ -281,6 +282,24 @@ print(dj.__version__)
 <p align="right"><a href="#table">🔼 back to index</a></p>
 
 ## 快速上手
+### 数据集配置
+
+DJ 支持多种数据集输入类型，包括本地文件、远程数据集（如 huggingface）；还支持数据验证和数据混合。
+
+配置输入文件的两种方法
+- 简单场景，本地/HF 文件的单一路径
+```yaml
+dataset_path: '/path/to/your/dataset' # 数据集目录或文件的路径
+```
+- 高级方法，支持子配置项和更多功能
+```yaml
+dataset:
+configs:
+- type: 'local'
+path: 'path/to/your/dataset' # 数据集目录或文件的路径
+```
+
+更多详细信息，请参阅 [数据集配置指南](docs/DatasetCfg_ZH.md)。
 
 ### 数据处理
 
diff --git a/data_juicer/__init__.py b/data_juicer/__init__.py
@@ -1,4 +1,4 @@
-__version__ = '1.2.2'
+__version__ = '1.3.0'
 
 import os
 import subprocess
diff --git a/docs/DatasetCfg.md b/docs/DatasetCfg.md
@@ -1,4 +1,5 @@
 # Dataset Configuration Guide
+EN | [中文](DatasetCfg_ZH.md)
 
 This guide provides an overview of how to configure datasets using YAML format in the Data-Juicer framework. The configurations allow you to specify local and remote datasets, with data validation rules.
 
diff --git a/docs/DatasetCfg_ZH.md b/docs/DatasetCfg_ZH.md
@@ -0,0 +1,125 @@
+# 数据集配置指南
+中文 | [EN](DatasetCfg.md)
+
+本指南概述了如何在 Data-Juicer 框架中使用 YAML 格式配置数据集。允许您指定本地和远程数据集以及数据验证规则。
+
+## 支持的数据集格式
+
+### 本地数据集
+
+`local_json.yaml` 配置文件用于指定以 JSON 格式本地存储的数据集。*path* 是必需的，用于指定本地数据集路径，可以是单个文件或目录。*format* 是可选的，用于指定数据集格式。
+对于本地文件，DJ 将自动检测文件格式并相应地加载数据集。支持 parquet、jsonl、json、csv、tsv、txt 和 jsonl.gz 等格式
+有关更多详细信息，请参阅 [local_json.yaml](https://github.com/data-juicer/data-juicer/blob/main/configs/datasets/local_json.yaml)。
+```yaml
+dataset:
+configs:
+- type: local
+path: path/to/your/local/dataset.json
+format: json
+```
+
+```yaml
+dataset:
+configs:
+- type: local
+path: path/to/your/local/dataset.parquet
+format: parquet
+```
+
+### Remote Huggingface 数据集
+
+`remote_huggingface.yaml` 配置文件用于指定 huggingface 数据集。*type* 和 *source* 固定为 'remote' 和 'huggingface'，以定位 huggingface 加载逻辑。*path* 是必需的，用于标识 huggingface 数据集。*name*、*split* 和 *limit* 是可选的，用于指定数据集名称/拆分并限制要加载的样本数量。
+更多详细信息请参阅 [remote_huggingface.yaml](https://github.com/data-juicer/data-juicer/blob/main/configs/datasets/remote_huggingface.yaml)。
+
+```yaml
+dataset:
+configs:
+- type: 'remote'
+source: 'huggingface'
+path: "HuggingFaceFW/fineweb"
+name: "CC-MAIN-2024-10"
+split: "train"
+limit: 1000
+```
+
+### 远程 Arxiv 数据集
+
+`remote_arxiv.yaml` 配置文件用于指定以 JSON 格式远程存储的数据集。*type* 和 *source* 固定为 'remote' 和 'arxiv'，以定位 arxiv 加载逻辑。 *lang*、*dump_date*、*force_download* 和 *url_limit* 是可选的，用于指定数据集语言、转储日期、强制下载和 URL 限制。
+有关更多详细信息，请参阅 [remote_arxiv.yaml](https://github.com/data-juicer/data-juicer/blob/main/configs/datasets/remote_arxiv.yaml)。
+
+```yaml
+dataset:
+configs:
+- type: 'remote'
+source: 'arxiv'
+lang: 'en'
+dump_date: 'latest'
+force_download: false
+url_limit: 2
+```
+
+### 其他支持的数据集格式
+
+有关更多详细信息和支持的数据集格式，请参阅 [load_strategy.py](https://github.com/data-juicer/data-juicer/blob/main/data_juicer/core/data/load_strategy.py)。
+
+## 其他功能
+
+### 数据混合
+
+`mixture.yaml` 配置文件演示了如何指定数据混合规则。DJ 将通过对数据集的一部分进行采样并应用适当的权重来混合数据集。
+有关更多详细信息，请参阅 [mixture.yaml](https://github.com/data-juicer/data-juicer/blob/main/configs/datasets/mixture.yaml)。
+```yaml
+dataset:
+max_sample_num: 10000
+configs:
+- type: 'local'
+weight: 1.0
+path: 'path/to/json/file'
+- type: 'local'
+weight: 1.0
+path: 'path/to/csv/file'
+```
+
+### 数据验证
+
+`validator.yaml` 配置文件演示了如何指定数据验证规则。DJ 将通过对数据集的一部分进行采样并应用验证规则来验证数据集。
+有关更多详细信息和支持的验证器，请参阅 [data_validator.py](https://github.com/data-juicer/data-juicer/blob/main/data_juicer/core/data/data_validator.py)。
+```yaml
+dataset:
+configs:
+- type: local
+path: path/to/data.json
+
+validators:
+- type: swift_messages
+min_turns: 2
+max_turns: 20
+sample_size: 1000
+- type: required_fields
+required_fields:
+- "text"
+- "metadata"
+- "language"
+field_types:
+text: "str"
+metadata: "dict"
+language: "str"
+```
+
+### 旧版 dataset_path 配置
+
+`dataset_path` 配置是指定数据集路径的历史版本方式。它简单易用，但缺乏灵活性。它可以在 yaml 或命令行输入中使用。一些示例：
+
+命令行输入：
+```bash
+# 命令行输入
+dj-process --dataset_path path/to/your/dataset.json
+
+# 带权重的命令行输入
+dj-process --dataset_path 0.5 path/to/your/dataset1.json 0.5 path/to/your/dataset2.json
+```
+
+Yaml 输入：
+```yaml
+dataset_path：path/to/your/dataset.json
+```

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-__version__ = '1.2.2'`
	`1`	`+__version__ = '1.3.0'`
`2`	`2`
`3`	`3`	`import os`
`4`	`4`	`import subprocess`
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,5 @@`
`1`	`1`	`# Dataset Configuration Guide`
	`2`	`+EN \| [中文](DatasetCfg_ZH.md)`
`2`	`3`
`3`	`4`	`This guide provides an overview of how to configure datasets using YAML format in the Data-Juicer framework. The configurations allow you to specify local and remote datasets, with data validation rules.`
`4`	`5`