"Arthur Bench: The Most Robust Way to Evaluate LLMs" #2826
ianscrivener
started this conversation in
Ideas
Replies: 1 comment 1 reply
-
Yes, we currently have 5 nodes (1 of them is V100 GPU, the others are x64 and ARM CPUs): I'm fairly happy with it so far. There are lots of things to improve, but for now it does a good job. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
open-source evaluation tool for comparing LLMs, prompts, and hyperparameters for generative text models.
Arthur AI blog post
https://github.com/arthur-ai/bench
@ggerganov, did the Azure Cloud resources ever eventuate?
Perplexity score, Hellaswag scores, Arthur Bench tests... while CI/CD for llama.cpp is ✅done... inference quality and latency benchmarking for llama.cpp would be great to see!
(with only a MacBook + 4G internet access - personally I'm quite constrained as to what I can do without cloud resources)
Beta Was this translation helpful? Give feedback.
All reactions