Unlimited Generation During Tokenization

Hi Dolma maintainers,
Thank you very much for publishing and maintaining this repository. I tried following the tutorial ([https://github.com/allenai/dolma/blob/main/docs/getting-started.md](https://github.com/allenai/dolma/blob/main/docs/getting-started.md)). The log indicated that it would generate only 14 NumPy files, but it actually produced thousands of files with huge sizes.

The command I used is as follows:
```
dolma tokens --documents /tmp/train/documents --dtype uint32 --tokenizer.name_or_path "allenai/OLMo-2-0425-1B" --tokenizer.bos_token_id 100257 --tokenizer.eos_token_id 100257 --tokenizer.pad_token_id 100277 --destination /tmp/train/tokens/ --processes 14 --files_per_process 1
```

The `/tmp/train/documents` directory contains the following files:

```
slimpajama-0000.jsonl.gz  
slimpajama-0001.jsonl.gz  
...  
slimpajama-0056.jsonl.gz  
slimpajama-0057.json.gz  
```

Below is the log:
```
batch_size: 10000
debug: false
destination: /tmp/train/tokens/
documents:
- /tmp/train/documents
dryrun: false
dtype: uint32
fields:
  id_field_name: id
  id_field_type: str
  text_field_name: text
  text_field_type: str
files_per_process: null
max_size: 1073741824
processes: 14
ring_size: 8
sample_ring_prop: false
seed: 3920
tokenizer:
  bos_token_id: 100257
  encode_special_tokens: false
  eos_token_id: 100257
  fast: true
  name_or_path: allenai/OLMo-2-0425-1B
  pad_token_id: 100277
  refresh: 0
  segment_before_tokenization: false
tokenizer_name_or_path: null
work_dir:
  input: null
  output: null
Tokenizing 58 source files into 14 numpy destinations.
files: 0.00f [00:00, ?f/s]
documents: 0.00d [00:00, ?d/s][A

tokens: 0.00t [00:00, ?t/s][A[A


memmaps: 0.00m [00:00, ?m/s][A[A[A
files: 0.00f [00:00, ?f/s]
documents: 0.00d [00:00, ?d/s][A

tokens: 0.00t [00:00, ?t/s][A[A


memmaps: 1.00m [00:00, 4.22m/s][A[A[A
files: 0.00f [3:10:32, ?f/s]
documents: 1.00d [3:10:33, 11.4ks/d][A

tokens: 2.48Gt [3:10:33, 217kt/s][A[A


memmaps: 14.0m [3:10:33, 835s/m][A[A[A

tokens: 2.48Gt [3:10:50, 217kt/s][A[A
files: 0.00f [6:59:13, ?f/s]
documents: 2.00d [6:59:13, 12.8ks/d][A

tokens: 4.97Gt [6:59:13, 195kt/s][A[A

tokens: 4.97Gt [6:59:30, 195kt/s][A[A
files: 0.00f [10:50:54, ?f/s]
documents: 3.00d [10:50:55, 13.3ks/d][A

tokens: 7.46Gt [10:50:55, 187kt/s][A[A

tokens: 7.46Gt [10:51:10, 187kt/s][A[A
files: 0.00f [14:33:32, ?f/s]
documents: 4.00d [14:33:32, 13.3ks/d][A

tokens: 9.94Gt [14:33:32, 187kt/s][A[A
files: 1.00f [14:33:35, 3.32s/f]
files: 2.00f [14:33:35, 1.59s/f]
files: 3.00f [14:33:35, 1.06f/s]


memmaps: 15.0m [14:33:36, 4.63ks/m][A[A[A
files: 4.00f [14:33:36, 1.63f/s]


memmaps: 22.0m [14:33:36, 2.38ks/m][A[A[A


memmaps: 29.0m [14:33:36, 1.40ks/m][A[A[A

...

memmaps: 7.08Mm [47:25:37, 92.3m/s][A[A[A


memmaps: 7.08Mm [47:25:37, 92.1m/s][A[A[A


memmaps: 7.08Mm [47:25:54, 92.1m/s][A[A[A
```

I have also tried using the arguments `--processes 1 --files_per_process 1` , but it did not resolve the issue. Do you have any idea why this happened? Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unlimited Generation During Tokenization #272

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unlimited Generation During Tokenization #272

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions