best practices for using llama-train-text-from-scratch #8654
Unanswered
ericleasemorgan
asked this question in
Q&A
Replies: 2 comments
-
Hi, i want to train text from scratch like you. |
Beta Was this translation helpful? Give feedback.
0 replies
-
On Dec 31, 2024, at 12:04 AM, lbarasc ***@***.***> wrote:
Hi, i want to train text from scratch like you. do you find some tricks to do that ?
do you have some good results ? can we share some experience ? thank you
Alas, no. I have not been able to train anything from scratch, but I believe I have plenty of content -- a corpus of more than 3 billion words. I ran the llamacpp toy training script on the Shakespeare content. It ran for about a month and output somewhat useful results. I'd like to try something bigger. --Eric Morgan
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Can somebody here present me with some best practices for using llama-train-text-from-scratch?
I have a 44 MB plain text file made up of bunches o' etexts written in English. It is my training data. [1] I then submitted the following command on my 60-core Linux computer:
After running for more than a few weeks, the log file says training is about 10% complete:
train_opt_callback: iter=330644 sample=1142529/11276953 sched=0.100000 loss=2.425543 dt=00:00:09 eta=3d 14:15:08 |>
I can use llama-cli against a version of the model [2]:
And this is a truncated version of the output:
At the current rate, the modeling process will be complete by this time next year. Obviously, this won't work for me. Soon I will have access to a GPU, and I suspect processing will speed up.
That said, what can I do to best use llama-train-text-from-scratch? For example, maybe I ought to delimit all of the sentences in my training data with
characters? Maybe I could make sure my sample data has line feeds equal to ASCII character 13? Besides these formatting options, what are some of the ways I can: 1) optimize the model creation process, and 2) optimize the model itself?Finally, even if I do all of this modeling, I don't expect results similar to other open source models, but I'd like to understand the process so I might model smaller things somewhat successfully.
[1] training data - https://distantreader.org/tmp/training/alex.txt
[2] model - https://distantreader.org/tmp/training/alex.gguf
--
Eric Morgan emorgan@nd.edu
Navari Family Center for Digital Scholarship
University of Notre Dame
Beta Was this translation helpful? Give feedback.
All reactions