You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
***Naming**: `training.tensor.{index}` (e.g., `training.tensor.0`, `training.tensor.1`, ...). No leading zeros.
46
46
47
47
***Data type**: `GGML_TYPE_I32` (standard for tokens in `llama.cpp`).
48
48
@@ -52,7 +52,7 @@ The generated GGUF files follow a specific structure for training data:
52
52
Building
53
53
--------
54
54
55
-
It is assumed that you have already set up the `llama.cpp` build environment (e.g., using CMake).
55
+
It is assumed that you have already set up the `llama.cpp` build environment (e.g., using CMake) and installed Arrow and Parquet on your system.
56
56
57
57
1.**Clone the `llama.cpp` repository**:
58
58
@@ -85,13 +85,13 @@ Usage
85
85
86
86
*`-h`, `--help`: Show this help message and exit.
87
87
88
-
*`-m <path>` : Path to the GGUF model used for the tokenizer (default: `models/7B/ggml-model-f16.gguf`).
88
+
*`-m <path>, --model <path>` : Path to the GGUF model used for the tokenizer (default: `models/7B/ggml-model-f16.gguf`).
89
89
90
-
*`--in-file <path>`: Path to the input dataset file. For text input, this is a single file. For Parquet, this is the path to the Parquet file (default: `input.txt`).
90
+
*`--in-file <path>`: Path to the input dataset file, either a plain text file or a Parquet file (default: `input.txt`).
91
91
92
-
*`-o <path>`, `--output <path>`: Path to save the output GGUF file (default: `output.gguf`).
92
+
*`-o <path>`, `--output <path>`: Path to save the output GGUF file to (default: `output.gguf`).
93
93
94
-
* ``--max-seq-len <length>`: Maximum sequence length in tokens (default: `2048`). Sequences exceeding this length will be truncated.
94
+
*`--max-seq-len <length>`: Maximum sequence length in tokens (default: `2048`). Sequences exceeding this length will be truncated.
95
95
96
96
*`--pre-tokenized`: Specifies that the input file contains pre-tokenized data (space-separated token IDs) rather than raw text.
0 commit comments