-
Notifications
You must be signed in to change notification settings - Fork 151
Open
Description
Hi Dolma maintainers,
Thank you very much for publishing and maintaining this repository. I tried following the tutorial (https://github.com/allenai/dolma/blob/main/docs/getting-started.md). The log indicated that it would generate only 14 NumPy files, but it actually produced thousands of files with huge sizes.
The command I used is as follows:
dolma tokens --documents /tmp/train/documents --dtype uint32 --tokenizer.name_or_path "allenai/OLMo-2-0425-1B" --tokenizer.bos_token_id 100257 --tokenizer.eos_token_id 100257 --tokenizer.pad_token_id 100277 --destination /tmp/train/tokens/ --processes 14 --files_per_process 1
The /tmp/train/documents
directory contains the following files:
slimpajama-0000.jsonl.gz
slimpajama-0001.jsonl.gz
...
slimpajama-0056.jsonl.gz
slimpajama-0057.json.gz
Below is the log:
batch_size: 10000
debug: false
destination: /tmp/train/tokens/
documents:
- /tmp/train/documents
dryrun: false
dtype: uint32
fields:
id_field_name: id
id_field_type: str
text_field_name: text
text_field_type: str
files_per_process: null
max_size: 1073741824
processes: 14
ring_size: 8
sample_ring_prop: false
seed: 3920
tokenizer:
bos_token_id: 100257
encode_special_tokens: false
eos_token_id: 100257
fast: true
name_or_path: allenai/OLMo-2-0425-1B
pad_token_id: 100277
refresh: 0
segment_before_tokenization: false
tokenizer_name_or_path: null
work_dir:
input: null
output: null
Tokenizing 58 source files into 14 numpy destinations.
files: 0.00f [00:00, ?f/s]
documents: 0.00d [00:00, ?d/s]�[A
tokens: 0.00t [00:00, ?t/s]�[A�[A
memmaps: 0.00m [00:00, ?m/s]�[A�[A�[A
files: 0.00f [00:00, ?f/s]
documents: 0.00d [00:00, ?d/s]�[A
tokens: 0.00t [00:00, ?t/s]�[A�[A
memmaps: 1.00m [00:00, 4.22m/s]�[A�[A�[A
files: 0.00f [3:10:32, ?f/s]
documents: 1.00d [3:10:33, 11.4ks/d]�[A
tokens: 2.48Gt [3:10:33, 217kt/s]�[A�[A
memmaps: 14.0m [3:10:33, 835s/m]�[A�[A�[A
tokens: 2.48Gt [3:10:50, 217kt/s]�[A�[A
files: 0.00f [6:59:13, ?f/s]
documents: 2.00d [6:59:13, 12.8ks/d]�[A
tokens: 4.97Gt [6:59:13, 195kt/s]�[A�[A
tokens: 4.97Gt [6:59:30, 195kt/s]�[A�[A
files: 0.00f [10:50:54, ?f/s]
documents: 3.00d [10:50:55, 13.3ks/d]�[A
tokens: 7.46Gt [10:50:55, 187kt/s]�[A�[A
tokens: 7.46Gt [10:51:10, 187kt/s]�[A�[A
files: 0.00f [14:33:32, ?f/s]
documents: 4.00d [14:33:32, 13.3ks/d]�[A
tokens: 9.94Gt [14:33:32, 187kt/s]�[A�[A
files: 1.00f [14:33:35, 3.32s/f]
files: 2.00f [14:33:35, 1.59s/f]
files: 3.00f [14:33:35, 1.06f/s]
memmaps: 15.0m [14:33:36, 4.63ks/m]�[A�[A�[A
files: 4.00f [14:33:36, 1.63f/s]
memmaps: 22.0m [14:33:36, 2.38ks/m]�[A�[A�[A
memmaps: 29.0m [14:33:36, 1.40ks/m]�[A�[A�[A
...
memmaps: 7.08Mm [47:25:37, 92.3m/s]�[A�[A�[A
memmaps: 7.08Mm [47:25:37, 92.1m/s]�[A�[A�[A
memmaps: 7.08Mm [47:25:54, 92.1m/s]�[A�[A�[A
I have also tried using the arguments --processes 1 --files_per_process 1
, but it did not resolve the issue. Do you have any idea why this happened? Thank you.
Metadata
Metadata
Assignees
Labels
No labels