Skip to content

Commit a77169f

Browse files
Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
1 parent 8d2cea5 commit a77169f

File tree

3 files changed

+11
-11
lines changed

3 files changed

+11
-11
lines changed

tools/dataset-converter/README.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ GGUF Structure for Training Data
2020

2121
The generated GGUF files follow a specific structure for training data:
2222

23-
* **Metadata (KV-pairs)**: All metadata keys are prefixed with `training.` to avoid conflicts with model metadata.
23+
* **Metadata (KV pairs)**: All metadata keys are prefixed with `training.` to avoid conflicts with model metadata.
2424

2525
* `training.format.version`: `string` (e.g., "1.0") - Specification version.
2626

@@ -36,13 +36,13 @@ The generated GGUF files follow a specific structure for training data:
3636

3737
* `training.tokenizer.gguf.merges`: `array[string]` (optional) - Tokenizer merges (for BPE).
3838

39-
* `training.tokenizer.gguf.pre`: `string` (optional) - Pre-tokenization architecture.
39+
* `training.tokenizer.gguf.pre`: `string` (optional) - Architecture of the pre-tokenizer.
4040

4141
* `training.sequence.count`: `uint64` - Total number of sequences in the file.
4242

4343
* **Tensors**: Each training sequence is stored as a separate tensor.
4444

45-
* **Naming**: `training.tensor.{index}` (e.g., `training.tensor.0`, `training.tensor.1`, ...).
45+
* **Naming**: `training.tensor.{index}` (e.g., `training.tensor.0`, `training.tensor.1`, ...). No leading zeros.
4646

4747
* **Data type**: `GGML_TYPE_I32` (standard for tokens in `llama.cpp`).
4848

@@ -52,7 +52,7 @@ The generated GGUF files follow a specific structure for training data:
5252
Building
5353
--------
5454

55-
It is assumed that you have already set up the `llama.cpp` build environment (e.g., using CMake).
55+
It is assumed that you have already set up the `llama.cpp` build environment (e.g., using CMake) and installed Arrow and Parquet on your system.
5656

5757
1. **Clone the `llama.cpp` repository**:
5858

@@ -85,13 +85,13 @@ Usage
8585

8686
* `-h`, `--help`: Show this help message and exit.
8787

88-
* `-m <path>` : Path to the GGUF model used for the tokenizer (default: `models/7B/ggml-model-f16.gguf`).
88+
* `-m <path>, --model <path>` : Path to the GGUF model used for the tokenizer (default: `models/7B/ggml-model-f16.gguf`).
8989

90-
* `--in-file <path>`: Path to the input dataset file. For text input, this is a single file. For Parquet, this is the path to the Parquet file (default: `input.txt`).
90+
* `--in-file <path>`: Path to the input dataset file, either a plain text file or a Parquet file (default: `input.txt`).
9191

92-
* `-o <path>`, `--output <path>`: Path to save the output GGUF file (default: `output.gguf`).
92+
* `-o <path>`, `--output <path>`: Path to save the output GGUF file to (default: `output.gguf`).
9393

94-
* ``--max-seq-len <length>`: Maximum sequence length in tokens (default: `2048`). Sequences exceeding this length will be truncated.
94+
* `--max-seq-len <length>`: Maximum sequence length in tokens (default: `2048`). Sequences exceeding this length will be truncated.
9595

9696
* `--pre-tokenized`: Specifies that the input file contains pre-tokenized data (space-separated token IDs) rather than raw text.
9797

@@ -110,7 +110,7 @@ Usage
110110

111111
### Usage Examples
112112

113-
1. **Converting a regular text file**:
113+
1. **Converting a plain text file**:
114114

115115
./bin/convert-to-train-gguf -m models/7B/ggml-model-f16.gguf -i my_dataset.txt -o my_training_data.gguf -l 1024
116116

tools/dataset-converter/dataset-to-gguf/llama-dataset-reader/llama-parquet-data-reader.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
#include <string>
1313
#include <vector>
1414

15-
# include "llama-dataset-reader.h"
15+
#include "llama-dataset-reader.h"
1616

1717
// Implementation of DatasetReader for reading Parquet files.
1818
// This class will handle reading tokenized sequences from a Parquet file.

tools/dataset-converter/dataset-to-gguf/llama-dataset-reader/llama-text-data-reader.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ bool llama_text_dataset_reader::read_next_sequence(std::vector<llama_token> & to
7373
if (n_tokens < 0) {
7474
std::cerr << "Error: Tokenization failed for line: " << line << std::endl;
7575
// Return an empty sequence in case of tokenization error
76-
return true;
76+
return false;
7777
}
7878
tokens.assign(m_tokens_buffer.begin(), m_tokens_buffer.begin() + n_tokens);
7979
}

0 commit comments

Comments
 (0)