Skip to content

Commit a3c8310

Browse files
authored
Release/v0.1.3 (#171)
* * change simhash-py to simhash-pybind + update docs for new version * * install pip for unit-test machine explicitly * * install pip for unit-test machine explicitly * * update wechat QR code * * update dynamic QR code for WeChat group * * update unittest * add missing dependency * * update news list * * update version number * * update release date * * bold key content in README_ZH.md like the English version * * minor changes on ZH docs * * move infos about discussion groups to the front
1 parent ad445c9 commit a3c8310

File tree

6 files changed

+143
-22
lines changed

6 files changed

+143
-22
lines changed

README.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -33,12 +33,20 @@ This project is being actively updated and maintained, and we will periodically
3333
If you find Data-Juicer useful for your research or development, please kindly
3434
cite our [work](#references).
3535

36+
Welcome to join our [Slack channel](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8253f30mgpjw), [DingDing group](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8253f30mgpjw&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11), or WeChat group (scan the QR code below with WeChat) for discussion.
37+
38+
<img src="https://img.alicdn.com/imgextra/i3/O1CN01QbwHJa1EV5uZwmU9c_!!6000000000356-2-tps-400-400.png" width = "100" height = "100" alt="QR Code for WeChat group" align=center />
39+
3640

3741
----
3842

3943
## News
40-
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2023-10-13] Our first data-centric LLM competition begins! Please
41-
visit the competition's official websites, **FT-Data Ranker** ([1B Track](https://tianchi.aliyun.com/competition/entrance/532157), [7B Track](https://tianchi.aliyun.com/competition/entrance/532158)), for more information.
44+
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-01-05] We release **Data-Juicer v0.1.3** now!
45+
In this new version, we support **more Python versions** (3.7-3.10), and support **multimodal** dataset [converting](tools/multimodal/README.md)/[processing](docs/Operators.md) (Including texts, images, and audios. More modalities will be supported in the future).
46+
Besides, our paper is also updated to [v3](https://arxiv.org/abs/2309.02033).
47+
48+
- [2023-10-13] Our first data-centric LLM competition begins! Please
49+
visit the competition's official websites, FT-Data Ranker ([1B Track](https://tianchi.aliyun.com/competition/entrance/532157), [7B Track](https://tianchi.aliyun.com/competition/entrance/532158)), for more information.
4250

4351
- [2023-10-8] We update our paper to the 2nd version and release the corresponding version 0.1.2 of Data-Juicer!
4452

@@ -98,7 +106,7 @@ Table of Contents
98106

99107
## Prerequisites
100108

101-
- Recommend Python==3.8
109+
- Recommend Python>=3.7,<=3.10
102110
- gcc >= 5 (at least C++14 support)
103111

104112
## Installation
@@ -330,7 +338,7 @@ We are in a rapidly developing field and greatly welcome contributions of new
330338
features, bug fixes and better documentations. Please refer to
331339
[How-to Guide for Developers](docs/DeveloperGuide.md).
332340

333-
Welcome to join our [Slack channel](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8253f30mgpjw), or [DingDing group](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8253f30mgpjw&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11) for discussion.
341+
If you have any questions, please join our [discussion groups](README.md).
334342

335343
## Acknowledgement
336344
Data-Juicer is used across various LLM products and research initiatives,

README_ZH.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -31,12 +31,20 @@ Data-Juicer 是一个一站式数据处理系统,旨在为大语言模型 (LLM
3131

3232
如果Data-Juicer对您的研发有帮助,请引用我们的[工作](#参考文献)
3333

34+
欢迎加入我们的[Slack频道](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8275bc8g7ypp)[钉钉群](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8275bc8g7ypp&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11) ,或微信群(扫描下方二维码加入)进行讨论。
35+
36+
<img src="https://img.alicdn.com/imgextra/i3/O1CN01QbwHJa1EV5uZwmU9c_!!6000000000356-2-tps-400-400.png" width = "100" height = "100" alt="QR Code for WeChat group" align=center />
37+
3438

3539
----
3640

3741
## 新消息
38-
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2023-10-13] 我们的第一届以数据为中心的 LLM 竞赛开始了!
39-
请访问大赛官网,**FT-Data Ranker**[1B赛道](https://tianchi.aliyun.com/competition/entrance/532157)[7B赛道](https://tianchi.aliyun.com/competition/entrance/532158) ) ,了解更多信息。
42+
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-01-05] 现在,我们发布了 **Data-Juicer v0.1.3** 版本!
43+
在这个新版本中,我们支持了**更多Python版本**(3.7-3.10),同时支持了**多模态**数据集的[转换](tools/multimodal/README_ZH.md)[处理](docs/Operators_ZH.md)(包括文本、图像和音频。更多模态也将会在之后支持)。
44+
此外,我们的论文也更新到了[第三版](https://arxiv.org/abs/2309.02033)
45+
46+
- [2023-10-13] 我们的第一届以数据为中心的 LLM 竞赛开始了!
47+
请访问大赛官网,FT-Data Ranker([1B赛道](https://tianchi.aliyun.com/competition/entrance/532157)[7B赛道](https://tianchi.aliyun.com/competition/entrance/532158) ) ,了解更多信息。
4048

4149
- [2023-10-8] 我们的论文更新至第二版,并发布了对应的Data-Juicer v0.1.2版本!
4250

@@ -86,7 +94,7 @@ Data-Juicer 是一个一站式数据处理系统,旨在为大语言模型 (LLM
8694

8795
## 前置条件
8896

89-
* 推荐 Python==3.8
97+
* 推荐 Python>=3.7,<=3.10
9098
* gcc >= 5 (at least C++14 support)
9199

92100
## 安装
@@ -309,7 +317,7 @@ Data-Juicer 在 Apache License 2.0 协议下发布。
309317

310318
大模型是一个高速发展的领域,我们非常欢迎贡献新功能、修复漏洞以及文档改善。请参考[开发者指南](docs/DeveloperGuide_ZH.md)。
311319

312-
欢迎加入我们的[Slack channel](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8275bc8g7ypp), 或[DingDing群](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8275bc8g7ypp&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11) 。
320+
如果您有任何问题,欢迎加入我们的[讨论群](README_ZH.md) 。
313321

314322
## 致谢
315323

data_juicer/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = '0.1.2'
1+
__version__ = '0.1.3'

environments/minimal_requires.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ tabulate
77
tqdm
88
jsonargparse[signatures]
99
matplotlib
10+
seaborn
1011
emoji==2.2.0
1112
regex
1213
requests

tools/multimodal/README.md

Lines changed: 61 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,62 @@ This folder contains some scripts and tools for multimodal datasets before and a
55
## Dataset Format Conversion
66

77
Due to large format diversity among different multimodal datasets and works,
8-
Data-Juicer propose a novel intermediate format for multimodal dataset and
9-
provided several dataset format conversion tools for some popular multimodal
8+
Data-Juicer propose a novel intermediate text-based interleaved data format for multimodal dataset, which
9+
is based on chunk-wise formats such MMC4 dataset.
10+
11+
In the Data-Juicer format, a multimodal sample or document is based on a text,
12+
which consists of several text chunks. Each chunk is a semantic unit, and all the
13+
multimodal information in a chunk should talk about the same thing and be aligned
14+
with each other.
15+
16+
Here is a multimodal sample example in Data-Juicer format below.
17+
- It includes 4 chunks split by the special token `<|__dj__eoc|>`.
18+
- In addition to texts, there are 3 other modalities: images, audios, videos.
19+
They are stored on the disk and their paths are
20+
listed in the corresponding first-level fields in the sample.
21+
- Other modalities are represented as special tokens in the text (e.g. image -- `<__dj__image>`).
22+
The special tokens of each modality correspond to the paths in the order of appearance.
23+
(e.g. the two image tokens in the third chunk are images of antarctica_map and europe_map respectively)
24+
- There could be multiple types of modalities and multiple modality special tokens in a single chunk,
25+
and they are semantically aligned with each other and text in this chunk.
26+
The position of special tokens can be random in a chunk. (In general, they are usually before or after the text.)
27+
- For multimodal samples, unlike text-only samples, the computed stats for other
28+
modalities could be a list of stats for the list of multimodal data (e.g. image_widths in this sample).
29+
30+
```python
31+
{
32+
"text": "<__dj__image> Antarctica is Earth's southernmost and least-populated continent. <|__dj__eoc|> "
33+
"<__dj__video> <__dj__audio> Situated almost entirely south of the Antarctic Circle and surrounded by the "
34+
"Southern Ocean (also known as the Antarctic Ocean), it contains the geographic South Pole. <|__dj__eoc|> "
35+
"Antarctica is the fifth-largest continent, being about 40% larger than Europe, "
36+
"and has an area of 14,200,000 km2 (5,500,000 sq mi). <__dj__image> <__dj__image> <|__dj__eoc|> "
37+
"Most of Antarctica is covered by the Antarctic ice sheet, "
38+
"with an average thickness of 1.9 km (1.2 mi). <|__dj__eoc|>",
39+
"images": [
40+
"path/to/the/image/of/antarctica_snowfield",
41+
"path/to/the/image/of/antarctica_map",
42+
"path/to/the/image/of/europe_map"
43+
],
44+
"audios": [
45+
"path/to/the/audio/of/sound_of_waves_in_Antarctic_Ocean"
46+
],
47+
"videos": [
48+
"path/to/the/video/of/remote_sensing_view_of_antarctica"
49+
],
50+
"meta": {
51+
"src": "customized",
52+
"version": "0.1",
53+
"author": "xxx"
54+
},
55+
"stats": {
56+
"lang": "en",
57+
"image_widths": [224, 336, 512],
58+
...
59+
}
60+
}
61+
```
62+
63+
According to this format, Data-Juicer provided several dataset format conversion tools for some popular multimodal
1064
works.
1165

1266
These tools consist of two types:
@@ -15,11 +69,11 @@ These tools consist of two types:
1569

1670
For now, dataset formats that are supported by Data-Juicer are listed in the following table.
1771

18-
| Format | source_format_to_data_juicer_format | data_juicer_format_to_target_format | Ref. |
19-
|------------|-------------------------------------|-------------------------------------|------------------------------------------------------------------------------------------------------------------|
20-
| LLaVA-like | `llava_to_dj.py` | `dj_to_llava.py` | [Format Description](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |
21-
| MMC4-like | `mmc4_to_dj.py` | `dj_to_mmc4.py` | [Format Description](https://github.com/allenai/mmc4#documents) |
22-
| WavCaps-like | `wavcaps_to_dj.py` | `dj_to_wavcaps.py` | [Format Description](https://github.com/XinhaoMei/WavCaps#table-of-contents) |
72+
| Format | Type | source_format_to_data_juicer_format | data_juicer_format_to_target_format | Ref. |
73+
|------------|------------|-------------------------------------|-------------------------------------|------------------------------------------------------------------------------------------------------------------|
74+
| LLaVA-like | image-text | `llava_to_dj.py` | `dj_to_llava.py` | [Format Description](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |
75+
| MMC4-like | image-text | `mmc4_to_dj.py` | `dj_to_mmc4.py` | [Format Description](https://github.com/allenai/mmc4#documents) |
76+
| WavCaps-like | audio-text | `wavcaps_to_dj.py` | `dj_to_wavcaps.py` | [Format Description](https://github.com/XinhaoMei/WavCaps#table-of-contents) |
2377

2478
For all tools, you can run the following command to find out the usage of them:
2579

tools/multimodal/README_ZH.md

Lines changed: 56 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,19 +4,69 @@
44

55
## 数据集格式转换
66

7-
由于不同多模态数据集和工作之间的数据集格式差异较大,Data-Juicer 提出了一种新颖的多模态数据集中间格式,并为一些流行的多模态工作提供了若干数据集格式转换工具。
7+
由于不同多模态数据集和工作之间的数据集格式差异较大, Data-Juicer 提出了一种新颖的、中间的、
8+
基于文本的、交替的多模态数据格式,主要基于一些按块(chunk)组织的格式,如MMC4数据集格式。
9+
10+
在 Data-Juicer 的格式中,一个多模态样本或者文档基于一段文本组织,其由若干个文本块组成。
11+
每个文本块是一个语义单元,单个文本块中包括的所有多模态信息都应该在谈论同样的事情,并且它们彼此语义上是对齐的。
12+
13+
下面这里是一个 Data-Juicer 格式的多模态样本示例。
14+
- 它包括4个文本块,它们由特殊token `<|__dj__eoc|>` 分割开。
15+
- 除了文本,这个样本还包括3种其他模态:图像(images),音频(audios),视频(videos)。
16+
它们保存在硬盘上,而它们的硬盘路径列举在了样本中对应的一级字段的列表里。
17+
- 在文本中,其他模态被表示为了特殊token(例如,图像 -- `<__dj__image>`)。
18+
每种模态的特殊token所表示的数据按照它们在文本中出现的顺序对应到列表中的路径上。
19+
(例如,第3个文本块中的2个图像token分别对应了图像路径列表中的antarctica_map图像和europe_map图像)
20+
- 在单个文本块中,可以由多种模态的数据以及多个模态特殊token,它们彼此是语义上对齐的,而且它们与该文本块中的文本也是语义对齐的。
21+
这些模态特殊token在文本块中可以处于任意位置(通常处于文本前或者文本后)
22+
- 不同于纯文本样本,对于多模态样本来说,为其他模态计算的stats可能为针对多模态数据列表的一个stats列表(如例子中的image_widths)。
23+
24+
```python
25+
{
26+
"text": "<__dj__image> Antarctica is Earth's southernmost and least-populated continent. <|__dj__eoc|> "
27+
"<__dj__video> <__dj__audio> Situated almost entirely south of the Antarctic Circle and surrounded by the "
28+
"Southern Ocean (also known as the Antarctic Ocean), it contains the geographic South Pole. <|__dj__eoc|> "
29+
"Antarctica is the fifth-largest continent, being about 40% larger than Europe, "
30+
"and has an area of 14,200,000 km2 (5,500,000 sq mi). <__dj__image> <__dj__image> <|__dj__eoc|> "
31+
"Most of Antarctica is covered by the Antarctic ice sheet, "
32+
"with an average thickness of 1.9 km (1.2 mi). <|__dj__eoc|>",
33+
"images": [
34+
"path/to/the/image/of/antarctica_snowfield",
35+
"path/to/the/image/of/antarctica_map",
36+
"path/to/the/image/of/europe_map"
37+
],
38+
"audios": [
39+
"path/to/the/audio/of/sound_of_waves_in_Antarctic_Ocean"
40+
],
41+
"videos": [
42+
"path/to/the/video/of/remote_sensing_view_of_antarctica"
43+
],
44+
"meta": {
45+
"src": "customized",
46+
"version": "0.1",
47+
"author": "xxx"
48+
},
49+
"stats": {
50+
"lang": "en",
51+
"image_widths": [224, 336, 512],
52+
...
53+
}
54+
}
55+
```
56+
57+
根据这个格式,Data-Juicer 为一些流行的多模态工作提供了若干数据集格式转换工具。
858

959
这些工具分为两种类型:
1060
- 其他格式到 Data-Juicer 格式的转换:这些工具在 `source_format_to_data_juicer_format` 目录中。它们可以帮助将其他格式的数据集转换为 Data-Juicer 格式的目标数据集。
1161
- Data-Juicer 格式到其他格式的转换:这些工具在 `data_juicer_format_to_target_format` 目录中。它们可以帮助将 Data-Juicer 格式的数据集转换为目标格式的数据集。
1262

1363
目前,Data-Juicer 支持的数据集格式在下面表格中列出。
1464

15-
| 格式 | source_format_to_data_juicer_format | data_juicer_format_to_target_format | 格式参考 |
16-
|----------|-------------------------------------|-------------------------------------|----------------------------------------------------------------------------------------------------|
17-
| 类LLaVA格式 | `llava_to_dj.py` | `dj_to_llava.py` | [格式描述](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |
18-
| 类MMC4格式 | `mmc4_to_dj.py` | `dj_to_mmc4.py` | [格式描述](https://github.com/allenai/mmc4#documents) |
19-
| 类WavCaps格式 | `wavcaps_to_dj.py` | `dj_to_wavcaps.py` | [格式描述](https://github.com/XinhaoMei/WavCaps#table-of-contents) |
65+
| 格式 | 类型 | source_format_to_data_juicer_format | data_juicer_format_to_target_format | 格式参考 |
66+
|----------|-------|-------------------------------------|-------------------------------------|----------------------------------------------------------------------------------------------------|
67+
| 类LLaVA格式 | 图像-文本 | `llava_to_dj.py` | `dj_to_llava.py` | [格式描述](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |
68+
| 类MMC4格式 | 图像-文本 | `mmc4_to_dj.py` | `dj_to_mmc4.py` | [格式描述](https://github.com/allenai/mmc4#documents) |
69+
| 类WavCaps格式 | 音频-文本 | `wavcaps_to_dj.py` | `dj_to_wavcaps.py` | [格式描述](https://github.com/XinhaoMei/WavCaps#table-of-contents) |
2070

2171
对于所有工具,您可以运行以下命令来了解它们的详细用法:
2272

0 commit comments

Comments
 (0)