Skip to content

Commit 2720113

Browse files
authored
release dj v0.2.0 (dj_video) (#227)
* release dj v0.2.0 (dj_video) * authored by data-juicer team
1 parent 475c52b commit 2720113

File tree

172 files changed

+11515
-1040
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

172 files changed

+11515
-1040
lines changed

README.md

Lines changed: 89 additions & 63 deletions
Large diffs are not rendered by default.

README_ZH.md

Lines changed: 82 additions & 62 deletions
Large diffs are not rendered by default.

configs/config_all.yaml

Lines changed: 116 additions & 23 deletions
Large diffs are not rendered by default.

configs/data_juicer_recipes/README.md

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ We found that there are still some "bad" samples in existing processed datasets
44

55
We use simple 3-σ rule to set the hyperparameters for ops in each recipe.
66

7-
## Before and after refining for Pretraining Dataset
7+
## Before and after refining for Pretraining Text Dataset
88

99
| subset | #samples before | #samples after | keep ratio | config link | data link | source |
1010
|----------------------|:---------------------------:|:--------------:|:----------:|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------|
@@ -35,3 +35,17 @@ We use simple 3-σ rule to set the hyperparameters for ops in each recipe.
3535
|------------------|:-------------------------:|:--------------------------------------:|:----------:|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------|
3636
| Alpaca-Cot EN | 136,219,879 | 72,855,345 | 54.48% | [alpaca-cot-en-refine.yaml](alpaca_cot/alpaca-cot-en-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-en-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-en-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/alpaca-cot-en-refined-by-data-juicer) | [39 Subsets of Alpaca-CoT](alpaca_cot/README.md#refined-alpaca-cot-dataset-meta-info) |
3737
| Alpaca-Cot ZH | 21,197,246 | 9,873,214 | 46.58% | [alpaca-cot-zh-refine.yaml](alpaca_cot/alpaca-cot-zh-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-zh-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-zh-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/alpaca-cot-zh-refined-by-data-juicer) | [28 Subsets of Alpaca-CoT](alpaca_cot/README.md#refined-alpaca-cot-dataset-meta-info) |
38+
39+
## Before and after refining for Multimodal Dataset
40+
41+
| subset | #samples before | #samples after | keep ratio | config link | data link | source |
42+
|---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
43+
| LLaVA pretrain (LCS-558k) | 558,128 | 500,380 | 89.65% | [llava-pretrain-refine.yaml](llava-pretrain-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/LLaVA-1.5/public/llava-pretrain-refine-result.json) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/llava-pretrain-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/llava-pretrain-refined-by-data-juicer) | [LLaVA-1.5](https://github.com/haotian-liu/LLaVA) |
44+
45+
### Evaluation Results
46+
- LLaVA pretrain (LCS-558k): models **pretrained with refined dataset** and fine-tuned with the original instruct dataset outperforms the baseline (LLaVA-1.5-13B) on 10 out of 12 benchmarks.
47+
48+
| model | VQAv2 | GQA | VizWiz | SQA | TextVQA | POPE | MME | MM-Bench | MM-Bench-CN | SEED | LLaVA-Bench-Wild | MM-Vet |
49+
|-------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
50+
| LLaVA-1.5-13B <br> (baseline) | **80.0** | 63.3 | 53.6 | 71.6 | **61.3** | 85.9 | 1531.3 | 67.7 | 63.6 | 61.6 | 72.5 | 36.1 |
51+
| LLaVA-1.5-13B <br> (refined pretrain dataset) | 79.94 | **63.5** | **54.09** | **74.20** | 60.82 | **86.67** | **1565.53** | **68.2** | **63.9** | **61.8** | **75.9** | **37.4** |

configs/data_juicer_recipes/README_ZH.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,3 +35,17 @@
3535
|-------------------|:------------------------:|:----------------------------------:|:---------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------|
3636
| Alpaca-Cot EN | 136,219,879 | 72,855,345 | 54.48% | [alpaca-cot-en-refine.yaml](alpaca_cot/alpaca-cot-en-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-en-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-en-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/alpaca-cot-en-refined-by-data-juicer) | [来自Alpaca-CoT的39个子集](alpaca_cot/README_ZH.md#完善的-alpaca-cot-数据集元信息) |
3737
| Alpaca-Cot ZH | 21,197,246 | 9,873,214 | 46.58% | [alpaca-cot-zh-refine.yaml](alpaca_cot/alpaca-cot-zh-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-zh-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-zh-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/alpaca-cot-zh-refined-by-data-juicer) | [来自Alpaca-CoT的28个子集](alpaca_cot/README_ZH.md#完善的-alpaca-cot-数据集元信息) |
38+
39+
## 完善前后的多模态数据集
40+
41+
| 数据子集 | 完善前的样本数目 | 完善后的样本数目 | 样本保留率 | 配置链接 | 数据链接 | 来源 |
42+
|---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
43+
| LLaVA pretrain (LCS-558k) | 558,128 | 500,380 | 89.65% | [llava-pretrain-refine.yaml](llava-pretrain-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/LLaVA-1.5/public/llava-pretrain-refine-result.json) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/llava-pretrain-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/llava-pretrain-refined-by-data-juicer) | [LLaVA-1.5](https://github.com/haotian-liu/LLaVA) |
44+
45+
### 评测结果
46+
- LLaVA pretrain (LCS-558k): 使用**完善后的预训练数据集**预训练并使用原始的指令数据集微调后的模型在12个评测集上有10个超过了基线模型LLaVA-1.5-13B。
47+
48+
| 模型 | VQAv2 | GQA | VizWiz | SQA | TextVQA | POPE | MME | MM-Bench | MM-Bench-CN | SEED | LLaVA-Bench-Wild | MM-Vet |
49+
|---------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
50+
| LLaVA-1.5-13B <br> (基线) | **80.0** | 63.3 | 53.6 | 71.6 | **61.3** | 85.9 | 1531.3 | 67.7 | 63.6 | 61.6 | 72.5 | 36.1 |
51+
| LLaVA-1.5-13B <br> (完善后的预训练数据集) | 79.94 | **63.5** | **54.09** | **74.20** | 60.82 | **86.67** | **1565.53** | **68.2** | **63.9** | **61.8** | **75.9** | **37.4** |
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
project_name: 'llava-1.5-pretrain-dataset-refine-recipe'
2+
dataset_path: 'blip_laion_cc_sbu_558k_dj_fmt_only_caption.jsonl' # converted LLaVA pretrain dataset in Data-Juicer format with only_keep_caption is True. See tools/multimodal/source_format_to_data_juicer_format/llava_to_dj.py
3+
export_path: 'blip_laion_cc_sbu_558k_dj_fmt_only_caption_refined.jsonl'
4+
5+
np: 42 # number of subprocess to process your dataset
6+
text_keys: 'text' # the key name of field where the sample texts to be processed, e.g., `text`, `instruction`, `output`, ...
7+
8+
# for multimodal data processing
9+
image_key: 'images' # Key name of field to store the list of sample image paths.
10+
image_special_token: '<image>' # The special token that represents an image in the text. For LLaVA, it's "<image>". Should be aligned with the args when running conversion tools.
11+
eoc_special_token: '<|__dj__eoc|>' # The special token that represents the end of a chunk in the text. In default, it's "<|__dj__eoc|>". You can specify your own special token according to your input dataset. Should be aligned with the args when running conversion tools.
12+
13+
open_tracer: true
14+
15+
# process schedule: a list of several process operators with their arguments
16+
process:
17+
- fix_unicode_mapper: # fix unicode errors in text.
18+
- punctuation_normalization_mapper: # normalize unicode punctuations to English punctuations.
19+
20+
# 558128
21+
# Filter ops
22+
- alphanumeric_filter: #558087 # filter text with alphabet/numeric ratio out of specific range.
23+
tokenization: false # Whether to count the ratio of alphanumeric to the total number of tokens.
24+
min_ratio: 0.60 # the min ratio of filter range
25+
- character_repetition_filter: #546105 # filter text with the character repetition ratio out of specific range
26+
rep_len: 10 # repetition length for char-level n-gram
27+
max_ratio: 0.09373663 # the max ratio of filter range
28+
- flagged_words_filter: #543960 # filter text with the flagged-word ratio larger than a specific max value
29+
lang: en # consider flagged words in what language
30+
tokenization: false # whether to use model to tokenize documents
31+
max_ratio: 0.0 # the max ratio to filter text
32+
- perplexity_filter: #532029 # filter text with perplexity score out of specific range
33+
lang: en # compute perplexity in what language
34+
max_ppl: 14435.5806 # the max perplexity score to filter text
35+
- special_characters_filter: #531968 # filter text with special-char ratio out of specific range
36+
min_ratio: 0.16534802 # the min ratio of filter range
37+
max_ratio: 0.42023757 # the max ratio of filter range
38+
- word_repetition_filter: # 530773 # filter text with the word repetition ratio out of specific range
39+
lang: en # sample in which language
40+
tokenization: false # whether to use model to tokenize documents
41+
rep_len: 10 # repetition length for word-level n-gram
42+
max_ratio: 0.03085751 # the max ratio of filter range
43+
44+
- image_aspect_ratio_filter: #542389 # filter samples according to the aspect ratios of images (a fraction of width by height, r=w/h) in them
45+
min_ratio: 0.333 # the min aspect ratio of filter range
46+
max_ratio: 3.0 # the max aspect ratio of filter range
47+
any_or_all: any # keep this sample when any/all images meet the filter condition
48+
- image_shape_filter: #533966 # filter samples according to the widths and heights of images in them
49+
max_width: 727.8798422276 # the max width of width filter range
50+
max_height: 606.2421072264 # the max height of height filter range
51+
any_or_all: any # keep this sample when any/all images meet the filter condition
52+
- image_size_filter: # 533966 # filter samples according to the size of images (in bytes) within them
53+
max_size: "124KB" # the max size of filter range
54+
any_or_all: any # keep this sample when any/all images meet the filter condition
55+
- image_text_similarity_filter: #544202 # filter samples according to the similarity between text and images.
56+
hf_clip: openai/clip-vit-base-patch32 # name of used Hugging Face clip
57+
min_score: 0.20315419 # the min similarity of filter range
58+
- image_text_matching_filter: # filter samples according to the matching score between image and text.
59+
hf_blip: Salesforce/blip-itm-base-coco # name of used Hugging Face blip
60+
min_score: 0.44930778 # the min matching score of filter range

data_juicer/config/config.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -149,6 +149,18 @@ def init_configs(args=None):
149149
help='The special token that represents an audio in the text. In '
150150
'default, it\'s "<__dj__audio>". You can specify your own special'
151151
' token according to your input dataset.')
152+
parser.add_argument(
153+
'--video_key',
154+
type=str,
155+
default='videos',
156+
help='Key name of field to store the list of sample video paths.')
157+
parser.add_argument(
158+
'--video_special_token',
159+
type=str,
160+
default=SpecialTokens.video,
161+
help='The special token that represents a video in the text. In '
162+
'default, it\'s "<__dj__video>". You can specify your own special'
163+
' token according to your input dataset.')
152164
parser.add_argument(
153165
'--eoc_special_token',
154166
type=str,

0 commit comments

Comments
 (0)