Question about --train-data Format in llama.cpp Finetune #7817

saaraahfar · 2024-06-07T15:43:37Z

saaraahfar
Jun 7, 2024

Hi,

I'm trying to understand how the --train-data option works in the finetune script for llama.cpp. The example shows using a single text file (shakespeare.txt), but I'm not sure how to format the data if I have multiple entries.

I have a list of formatted messages like:

[
  {
    "role": "system",
    "content": "..."
  },
  {
    "role": "user",
    "content": "..."
  },
  {
    "role": "assistant",
    "content": "..."
  }
]

Is there a way to provide a list of such text entries for fine-tuning? How should the training data be structured if it's not just one continuous text file?

saaraahfar · 2024-06-08T21:50:29Z

saaraahfar
Jun 8, 2024
Author

after a long investigation in the code I think this is what I need and I need to format my data myself based on which model I am finetuning. can anybody please confirm that I am on the right path?

$ finetune --help
  --sample-start STR         Sets the starting point for samples after the specified pattern. If empty use every token position as sample start. (default '')
  --include-sample-start     Include the sample start in the samples. (default off)
  --escape                   process sample start escapes sequences (\n, \r, \t, \', \", \\)
  --overlapping-samples      Samples may overlap, will include sample-start of second and following samples. When off, samples will end at begin of next sample. (default off)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about --train-data Format in llama.cpp Finetune #7817

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Question about --train-data Format in llama.cpp Finetune #7817

Uh oh!

saaraahfar Jun 7, 2024

Replies: 1 comment

Uh oh!

saaraahfar Jun 8, 2024 Author

saaraahfar
Jun 7, 2024

saaraahfar
Jun 8, 2024
Author