Release/v0.1.3 (#171)

HYLcool · web-flow · commit a3c8310bf084 · 2024-01-05T15:19:44.000+08:00
* * change simhash-py to simhash-pybind
+ update docs for new version

* * install pip for unit-test machine explicitly

* * install pip for unit-test machine explicitly

* * update wechat QR code

* * update dynamic QR code for WeChat group

* * update unittest
* add missing dependency

* * update news list

* * update version number

* * update release date

* * bold key content in README_ZH.md like the English version

* * minor changes on ZH docs

* * move infos about discussion groups to the front
diff --git a/README.md b/README.md
@@ -33,12 +33,20 @@ This project is being actively updated and maintained, and we will periodically
 If you find Data-Juicer useful for your research or development, please kindly 
 cite our [work](#references).
 
+Welcome to join our [Slack channel](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8253f30mgpjw), [DingDing group](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8253f30mgpjw&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11), or WeChat group (scan the QR code below with WeChat) for discussion.
+
+ <img src="https://img.alicdn.com/imgextra/i3/O1CN01QbwHJa1EV5uZwmU9c_!!6000000000356-2-tps-400-400.png" width = "100" height = "100" alt="QR Code for WeChat group" align=center />
+
 
 ----
 
 ## News
-- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2023-10-13] Our first data-centric LLM competition begins! Please
-  visit the competition's official websites, **FT-Data Ranker** ([1B Track](https://tianchi.aliyun.com/competition/entrance/532157), [7B Track](https://tianchi.aliyun.com/competition/entrance/532158)), for more information.
+- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-01-05] We release **Data-Juicer v0.1.3** now! 
+In this new version, we support **more Python versions** (3.7-3.10), and support **multimodal** dataset [converting](tools/multimodal/README.md)/[processing](docs/Operators.md) (Including texts, images, and audios. More modalities will be supported in the future).
+Besides, our paper is also updated to [v3](https://arxiv.org/abs/2309.02033).
+
+- [2023-10-13] Our first data-centric LLM competition begins! Please
+  visit the competition's official websites, FT-Data Ranker ([1B Track](https://tianchi.aliyun.com/competition/entrance/532157), [7B Track](https://tianchi.aliyun.com/competition/entrance/532158)), for more information.
 
 - [2023-10-8] We update our paper to the 2nd version and release the corresponding version 0.1.2 of Data-Juicer!
 
@@ -98,7 +106,7 @@ Table of Contents
 
 ## Prerequisites
 
-- Recommend Python==3.8
+- Recommend Python>=3.7,<=3.10
 - gcc >= 5 (at least C++14 support)
 
 ## Installation
@@ -330,7 +338,7 @@ We are in a rapidly developing field and greatly welcome contributions of new
 features, bug fixes and better documentations. Please refer to 
 [How-to Guide for Developers](docs/DeveloperGuide.md).
 
-Welcome to join our [Slack channel](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8253f30mgpjw), or [DingDing group](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8253f30mgpjw&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11) for discussion.
+If you have any questions, please join our [discussion groups](README.md).
 
 ## Acknowledgement
 Data-Juicer is used across various LLM products and research initiatives,
diff --git a/README_ZH.md b/README_ZH.md
@@ -31,12 +31,20 @@ Data-Juicer 是一个一站式数据处理系统，旨在为大语言模型 (LLM
 
 如果Data-Juicer对您的研发有帮助，请引用我们的[工作](#参考文献) 。
 
+欢迎加入我们的[Slack频道](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8275bc8g7ypp) ，[钉钉群](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8275bc8g7ypp&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11) ，或微信群（扫描下方二维码加入）进行讨论。
+
+ <img src="https://img.alicdn.com/imgextra/i3/O1CN01QbwHJa1EV5uZwmU9c_!!6000000000356-2-tps-400-400.png" width = "100" height = "100" alt="QR Code for WeChat group" align=center />
+
 
 ----
 
 ## 新消息
-- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2023-10-13] 我们的第一届以数据为中心的 LLM 竞赛开始了！
-  请访问大赛官网，**FT-Data Ranker**（[1B赛道](https://tianchi.aliyun.com/competition/entrance/532157) 、[7B赛道](https://tianchi.aliyun.com/competition/entrance/532158) ) ，了解更多信息。
+- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-01-05] 现在，我们发布了 **Data-Juicer v0.1.3** 版本！ 
+在这个新版本中，我们支持了**更多Python版本**（3.7-3.10），同时支持了**多模态**数据集的[转换](tools/multimodal/README_ZH.md)和[处理](docs/Operators_ZH.md)（包括文本、图像和音频。更多模态也将会在之后支持）。
+此外，我们的论文也更新到了[第三版](https://arxiv.org/abs/2309.02033) 。
+
+- [2023-10-13] 我们的第一届以数据为中心的 LLM 竞赛开始了！
+  请访问大赛官网，FT-Data Ranker（[1B赛道](https://tianchi.aliyun.com/competition/entrance/532157) 、[7B赛道](https://tianchi.aliyun.com/competition/entrance/532158) ) ，了解更多信息。
 
 - [2023-10-8] 我们的论文更新至第二版，并发布了对应的Data-Juicer v0.1.2版本！
 
@@ -86,7 +94,7 @@ Data-Juicer 是一个一站式数据处理系统，旨在为大语言模型 (LLM
 
 ## 前置条件
 
-* 推荐 Python==3.8
+* 推荐 Python>=3.7,<=3.10
 * gcc >= 5 (at least C++14 support)
 
 ## 安装
@@ -309,7 +317,7 @@ Data-Juicer 在 Apache License 2.0 协议下发布。
 
 大模型是一个高速发展的领域，我们非常欢迎贡献新功能、修复漏洞以及文档改善。请参考[开发者指南](docs/DeveloperGuide_ZH.md)。
 
-欢迎加入我们的[Slack channel](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8275bc8g7ypp), 或[DingDing群](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8275bc8g7ypp&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11) 。
+如果您有任何问题，欢迎加入我们的[讨论群](README_ZH.md) 。
 
 ## 致谢
 
diff --git a/data_juicer/__init__.py b/data_juicer/__init__.py
@@ -1 +1 @@
-__version__ = '0.1.2'
+__version__ = '0.1.3'
diff --git a/environments/minimal_requires.txt b/environments/minimal_requires.txt
@@ -7,6 +7,7 @@ tabulate
 tqdm
 jsonargparse[signatures]
 matplotlib
+seaborn
 emoji==2.2.0
 regex
 requests
diff --git a/tools/multimodal/README.md b/tools/multimodal/README.md
@@ -5,8 +5,62 @@ This folder contains some scripts and tools for multimodal datasets before and a
 ## Dataset Format Conversion
 
 Due to large format diversity among different multimodal datasets and works, 
-Data-Juicer propose a novel intermediate format for multimodal dataset and 
-provided several dataset format conversion tools for some popular multimodal 
+Data-Juicer propose a novel intermediate text-based interleaved data format for multimodal dataset, which 
+is based on chunk-wise formats such MMC4 dataset.
+
+In the Data-Juicer format, a multimodal sample or document is based on a text, 
+which consists of several text chunks. Each chunk is a semantic unit, and all the
+multimodal information in a chunk should talk about the same thing and be aligned
+with each other.
+
+Here is a multimodal sample example in Data-Juicer format below.
+- It includes 4 chunks split by the special token `<|__dj__eoc|>`.
+- In addition to texts, there are 3 other modalities: images, audios, videos. 
+They are stored on the disk and their paths are
+listed in the corresponding first-level fields in the sample.
+- Other modalities are represented as special tokens in the text (e.g. image -- `<__dj__image>`). 
+The special tokens of each modality correspond to the paths in the order of appearance. 
+(e.g. the two image tokens in the third chunk are images of antarctica_map and europe_map respectively)
+- There could be multiple types of modalities and multiple modality special tokens in a single chunk, 
+and they are semantically aligned with each other and text in this chunk. 
+The position of special tokens can be random in a chunk. (In general, they are usually before or after the text.)
+- For multimodal samples, unlike text-only samples, the computed stats for other 
+modalities could be a list of stats for the list of multimodal data (e.g. image_widths in this sample).
+
+```python
+{
+  "text": "<__dj__image> Antarctica is Earth's southernmost and least-populated continent. <|__dj__eoc|> "
+          "<__dj__video> <__dj__audio> Situated almost entirely south of the Antarctic Circle and surrounded by the "
+          "Southern Ocean (also known as the Antarctic Ocean), it contains the geographic South Pole. <|__dj__eoc|> "
+          "Antarctica is the fifth-largest continent, being about 40% larger than Europe, "
+          "and has an area of 14,200,000 km2 (5,500,000 sq mi). <__dj__image> <__dj__image> <|__dj__eoc|> "
+          "Most of Antarctica is covered by the Antarctic ice sheet, "
+          "with an average thickness of 1.9 km (1.2 mi). <|__dj__eoc|>",
+  "images": [
+    "path/to/the/image/of/antarctica_snowfield",
+    "path/to/the/image/of/antarctica_map",
+    "path/to/the/image/of/europe_map"
+  ],
+  "audios": [
+    "path/to/the/audio/of/sound_of_waves_in_Antarctic_Ocean"
+  ],
+  "videos": [
+    "path/to/the/video/of/remote_sensing_view_of_antarctica"
+  ],
+  "meta": {
+    "src": "customized",
+    "version": "0.1",
+    "author": "xxx"
+  },
+  "stats": {
+    "lang": "en",
+    "image_widths": [224, 336, 512],
+    ...
+  }
+}
+```
+
+According to this format, Data-Juicer provided several dataset format conversion tools for some popular multimodal 
 works.
 
 These tools consist of two types:
@@ -15,11 +69,11 @@ These tools consist of two types:
 
 For now, dataset formats that are supported by Data-Juicer are listed in the following table.
 
-| Format     | source_format_to_data_juicer_format | data_juicer_format_to_target_format | Ref.                                                                                                             |
-|------------|-------------------------------------|-------------------------------------|------------------------------------------------------------------------------------------------------------------|
-| LLaVA-like | `llava_to_dj.py`                    | `dj_to_llava.py`                    | [Format Description](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |
-| MMC4-like  | `mmc4_to_dj.py`                     | `dj_to_mmc4.py`                     | [Format Description](https://github.com/allenai/mmc4#documents)                                                  |
-| WavCaps-like  | `wavcaps_to_dj.py`                    | `dj_to_wavcaps.py`                    | [Format Description](https://github.com/XinhaoMei/WavCaps#table-of-contents) |
+| Format     | Type       | source_format_to_data_juicer_format | data_juicer_format_to_target_format | Ref.                                                                                                             |
+|------------|------------|-------------------------------------|-------------------------------------|------------------------------------------------------------------------------------------------------------------|
+| LLaVA-like | image-text | `llava_to_dj.py`                    | `dj_to_llava.py`                    | [Format Description](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |
+| MMC4-like  | image-text | `mmc4_to_dj.py`                     | `dj_to_mmc4.py`                     | [Format Description](https://github.com/allenai/mmc4#documents)                                                  |
+| WavCaps-like  | audio-text | `wavcaps_to_dj.py` | `dj_to_wavcaps.py`                  | [Format Description](https://github.com/XinhaoMei/WavCaps#table-of-contents) |
 
 For all tools, you can run the following command to find out the usage of them:
 
diff --git a/tools/multimodal/README_ZH.md b/tools/multimodal/README_ZH.md
@@ -4,19 +4,69 @@
 
 ## 数据集格式转换
 
-由于不同多模态数据集和工作之间的数据集格式差异较大，Data-Juicer 提出了一种新颖的多模态数据集中间格式，并为一些流行的多模态工作提供了若干数据集格式转换工具。
+由于不同多模态数据集和工作之间的数据集格式差异较大， Data-Juicer 提出了一种新颖的、中间的、
+基于文本的、交替的多模态数据格式，主要基于一些按块（chunk）组织的格式，如MMC4数据集格式。
+
+在 Data-Juicer 的格式中，一个多模态样本或者文档基于一段文本组织，其由若干个文本块组成。
+每个文本块是一个语义单元，单个文本块中包括的所有多模态信息都应该在谈论同样的事情，并且它们彼此语义上是对齐的。
+
+下面这里是一个 Data-Juicer 格式的多模态样本示例。
+- 它包括4个文本块，它们由特殊token `<|__dj__eoc|>` 分割开。
+- 除了文本，这个样本还包括3种其他模态：图像（images），音频（audios），视频（videos）。
+它们保存在硬盘上，而它们的硬盘路径列举在了样本中对应的一级字段的列表里。
+- 在文本中，其他模态被表示为了特殊token（例如，图像 -- `<__dj__image>`）。
+每种模态的特殊token所表示的数据按照它们在文本中出现的顺序对应到列表中的路径上。
+（例如，第3个文本块中的2个图像token分别对应了图像路径列表中的antarctica_map图像和europe_map图像）
+- 在单个文本块中，可以由多种模态的数据以及多个模态特殊token，它们彼此是语义上对齐的，而且它们与该文本块中的文本也是语义对齐的。
+这些模态特殊token在文本块中可以处于任意位置（通常处于文本前或者文本后）
+- 不同于纯文本样本，对于多模态样本来说，为其他模态计算的stats可能为针对多模态数据列表的一个stats列表（如例子中的image_widths）。
+
+```python
+{
+  "text": "<__dj__image> Antarctica is Earth's southernmost and least-populated continent. <|__dj__eoc|> "
+          "<__dj__video> <__dj__audio> Situated almost entirely south of the Antarctic Circle and surrounded by the "
+          "Southern Ocean (also known as the Antarctic Ocean), it contains the geographic South Pole. <|__dj__eoc|> "
+          "Antarctica is the fifth-largest continent, being about 40% larger than Europe, "
+          "and has an area of 14,200,000 km2 (5,500,000 sq mi). <__dj__image> <__dj__image> <|__dj__eoc|> "
+          "Most of Antarctica is covered by the Antarctic ice sheet, "
+          "with an average thickness of 1.9 km (1.2 mi). <|__dj__eoc|>",
+  "images": [
+    "path/to/the/image/of/antarctica_snowfield",
+    "path/to/the/image/of/antarctica_map",
+    "path/to/the/image/of/europe_map"
+  ],
+  "audios": [
+    "path/to/the/audio/of/sound_of_waves_in_Antarctic_Ocean"
+  ],
+  "videos": [
+    "path/to/the/video/of/remote_sensing_view_of_antarctica"
+  ],
+  "meta": {
+    "src": "customized",
+    "version": "0.1",
+    "author": "xxx"
+  },
+  "stats": {
+    "lang": "en",
+    "image_widths": [224, 336, 512],
+    ...
+  }
+}
+```
+
+根据这个格式，Data-Juicer 为一些流行的多模态工作提供了若干数据集格式转换工具。
 
 这些工具分为两种类型：
 - 其他格式到 Data-Juicer 格式的转换：这些工具在 `source_format_to_data_juicer_format` 目录中。它们可以帮助将其他格式的数据集转换为 Data-Juicer 格式的目标数据集。
 - Data-Juicer 格式到其他格式的转换：这些工具在 `data_juicer_format_to_target_format` 目录中。它们可以帮助将 Data-Juicer 格式的数据集转换为目标格式的数据集。
 
 目前，Data-Juicer 支持的数据集格式在下面表格中列出。
 
-| 格式       | source_format_to_data_juicer_format | data_juicer_format_to_target_format | 格式参考                                                                                               |
-|----------|-------------------------------------|-------------------------------------|----------------------------------------------------------------------------------------------------|
-| 类LLaVA格式 | `llava_to_dj.py`                    | `dj_to_llava.py`                    | [格式描述](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |
-| 类MMC4格式  | `mmc4_to_dj.py`                     | `dj_to_mmc4.py`                     | [格式描述](https://github.com/allenai/mmc4#documents) |
-| 类WavCaps格式  | `wavcaps_to_dj.py`                    | `dj_to_wavcaps.py`                    | [格式描述](https://github.com/XinhaoMei/WavCaps#table-of-contents) |
+| 格式       | 类型    | source_format_to_data_juicer_format | data_juicer_format_to_target_format | 格式参考                                                                                               |
+|----------|-------|-------------------------------------|-------------------------------------|----------------------------------------------------------------------------------------------------|
+| 类LLaVA格式 | 图像-文本 | `llava_to_dj.py`                    | `dj_to_llava.py`                    | [格式描述](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |
+| 类MMC4格式  | 图像-文本 | `mmc4_to_dj.py`                     | `dj_to_mmc4.py`                     | [格式描述](https://github.com/allenai/mmc4#documents) |
+| 类WavCaps格式  | 音频-文本 | `wavcaps_to_dj.py` | `dj_to_wavcaps.py`                  | [格式描述](https://github.com/XinhaoMei/WavCaps#table-of-contents) |
 
 对于所有工具，您可以运行以下命令来了解它们的详细用法：
 

Original file line number	Diff line number	Diff line change
`@@ -1 +1 @@`
`1`		`-__version__ = '0.1.2'`
	`1`	`+__version__ = '0.1.3'`