best practices for using llama-train-text-from-scratch #8654

ericleasemorgan · 2024-07-23T18:29:40Z

ericleasemorgan
Jul 23, 2024

Can somebody here present me with some best practices for using llama-train-text-from-scratch?

I have a 44 MB plain text file made up of bunches o' etexts written in English. It is my training data. [1] I then submitted the following command on my 60-core Linux computer:

  ./lib/llama-train-text-from-scratch \
  --vocab-model ./lib/models/ggml-vocab-llama-spm.gguf \
  --checkpoint-in ./trainings/alex-chk-LATEST.gguf \
  --checkpoint-out ./trainings/alex-chk-ITERATION.gguf \
  --model-out ./trainings/alex-gguf-ITERATION.bin \
  --train-data ./etc/alex.txt \
  --save-every 128 -t 54 --adam-iter 32768 --batch 128

After running for more than a few weeks, the log file says training is about 10% complete:

train_opt_callback: iter=330644 sample=1142529/11276953 sched=0.100000 loss=2.425543 dt=00:00:09 eta=3d 14:15:08 |>

I can use llama-cli against a version of the model [2]:

  ./lib/llama-cli -m ./models/alex.gguf \
  -co -p "What are the definitions of man, love, honor, and truth?" \
  -n 400 -e 2>/dev/null

And this is a truncated version of the output:

What are the definitions of man, love, honor, and truth? or which is well
noted. i. this series is [native apparitions] of the essenes eve as the
consistency of the shellfish could; byron, his own rubage, lepid: the pacific
is on the break of the snoring when he first comes. lepidus i am glad of your
approach, sir. lepidus you'll bring us to him. lepidus your way is to his
majesty. [exeunt] a midsummer nightingale shall be gathered on the greenwood;
and we may hear of him. pompey you shall have music, you have been too much
by--you whoop, though it were a dreamer's...

At the current rate, the modeling process will be complete by this time next year. Obviously, this won't work for me. Soon I will have access to a GPU, and I suspect processing will speed up.

That said, what can I do to best use llama-train-text-from-scratch? For example, maybe I ought to delimit all of the sentences in my training data with characters? Maybe I could make sure my sample data has line feeds equal to ASCII character 13? Besides these formatting options, what are some of the ways I can: 1) optimize the model creation process, and 2) optimize the model itself?

Finally, even if I do all of this modeling, I don't expect results similar to other open source models, but I'd like to understand the process so I might model smaller things somewhat successfully.

[1] training data - https://distantreader.org/tmp/training/alex.txt
[2] model - https://distantreader.org/tmp/training/alex.gguf

--
Eric Morgan emorgan@nd.edu
Navari Family Center for Digital Scholarship
University of Notre Dame

lbarasc · 2024-12-31T05:03:40Z

lbarasc
Dec 31, 2024

Hi, i want to train text from scratch like you.
do you find some tricks to do that ?
do you have some good results ? can we share some experience ?
thank you

0 replies

ericleasemorgan · 2024-12-31T15:51:37Z

ericleasemorgan
Dec 31, 2024
Author

On Dec 31, 2024, at 12:04 AM, lbarasc ***@***.***> wrote: Hi, i want to train text from scratch like you. do you find some tricks to do that ? do you have some good results ? can we share some experience ? thank you

Alas, no. I have not been able to train anything from scratch, but I believe I have plenty of content -- a corpus of more than 3 billion words. I ran the llamacpp toy training script on the Shakespeare content. It ran for about a month and output somewhat useful results. I'd like to try something bigger. --Eric Morgan

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

best practices for using llama-train-text-from-scratch #8654

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

best practices for using llama-train-text-from-scratch #8654

Uh oh!

ericleasemorgan Jul 23, 2024

Replies: 2 comments

Uh oh!

lbarasc Dec 31, 2024

Uh oh!

ericleasemorgan Dec 31, 2024 Author

ericleasemorgan
Jul 23, 2024

lbarasc
Dec 31, 2024

ericleasemorgan
Dec 31, 2024
Author