Skip to content

Unlimited Generation During Tokenization #272

@mrqorib

Description

@mrqorib

Hi Dolma maintainers,
Thank you very much for publishing and maintaining this repository. I tried following the tutorial (https://github.com/allenai/dolma/blob/main/docs/getting-started.md). The log indicated that it would generate only 14 NumPy files, but it actually produced thousands of files with huge sizes.

The command I used is as follows:

dolma tokens --documents /tmp/train/documents --dtype uint32 --tokenizer.name_or_path "allenai/OLMo-2-0425-1B" --tokenizer.bos_token_id 100257 --tokenizer.eos_token_id 100257 --tokenizer.pad_token_id 100277 --destination /tmp/train/tokens/ --processes 14 --files_per_process 1

The /tmp/train/documents directory contains the following files:

slimpajama-0000.jsonl.gz  
slimpajama-0001.jsonl.gz  
...  
slimpajama-0056.jsonl.gz  
slimpajama-0057.json.gz  

Below is the log:

batch_size: 10000
debug: false
destination: /tmp/train/tokens/
documents:
- /tmp/train/documents
dryrun: false
dtype: uint32
fields:
  id_field_name: id
  id_field_type: str
  text_field_name: text
  text_field_type: str
files_per_process: null
max_size: 1073741824
processes: 14
ring_size: 8
sample_ring_prop: false
seed: 3920
tokenizer:
  bos_token_id: 100257
  encode_special_tokens: false
  eos_token_id: 100257
  fast: true
  name_or_path: allenai/OLMo-2-0425-1B
  pad_token_id: 100277
  refresh: 0
  segment_before_tokenization: false
tokenizer_name_or_path: null
work_dir:
  input: null
  output: null
Tokenizing 58 source files into 14 numpy destinations.
files: 0.00f [00:00, ?f/s]
documents: 0.00d [00:00, ?d/s]�[A

tokens: 0.00t [00:00, ?t/s]�[A�[A


memmaps: 0.00m [00:00, ?m/s]�[A�[A�[A
files: 0.00f [00:00, ?f/s]
documents: 0.00d [00:00, ?d/s]�[A

tokens: 0.00t [00:00, ?t/s]�[A�[A


memmaps: 1.00m [00:00, 4.22m/s]�[A�[A�[A
files: 0.00f [3:10:32, ?f/s]
documents: 1.00d [3:10:33, 11.4ks/d]�[A

tokens: 2.48Gt [3:10:33, 217kt/s]�[A�[A


memmaps: 14.0m [3:10:33, 835s/m]�[A�[A�[A

tokens: 2.48Gt [3:10:50, 217kt/s]�[A�[A
files: 0.00f [6:59:13, ?f/s]
documents: 2.00d [6:59:13, 12.8ks/d]�[A

tokens: 4.97Gt [6:59:13, 195kt/s]�[A�[A

tokens: 4.97Gt [6:59:30, 195kt/s]�[A�[A
files: 0.00f [10:50:54, ?f/s]
documents: 3.00d [10:50:55, 13.3ks/d]�[A

tokens: 7.46Gt [10:50:55, 187kt/s]�[A�[A

tokens: 7.46Gt [10:51:10, 187kt/s]�[A�[A
files: 0.00f [14:33:32, ?f/s]
documents: 4.00d [14:33:32, 13.3ks/d]�[A

tokens: 9.94Gt [14:33:32, 187kt/s]�[A�[A
files: 1.00f [14:33:35, 3.32s/f]
files: 2.00f [14:33:35, 1.59s/f]
files: 3.00f [14:33:35, 1.06f/s]


memmaps: 15.0m [14:33:36, 4.63ks/m]�[A�[A�[A
files: 4.00f [14:33:36, 1.63f/s]


memmaps: 22.0m [14:33:36, 2.38ks/m]�[A�[A�[A


memmaps: 29.0m [14:33:36, 1.40ks/m]�[A�[A�[A

...

memmaps: 7.08Mm [47:25:37, 92.3m/s]�[A�[A�[A


memmaps: 7.08Mm [47:25:37, 92.1m/s]�[A�[A�[A


memmaps: 7.08Mm [47:25:54, 92.1m/s]�[A�[A�[A

I have also tried using the arguments --processes 1 --files_per_process 1 , but it did not resolve the issue. Do you have any idea why this happened? Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions