"Arthur Bench: The Most Robust Way to Evaluate LLMs" #2826

ianscrivener · 2023-08-27T08:56:49Z

ianscrivener
Aug 27, 2023

open-source evaluation tool for comparing LLMs, prompts, and hyperparameters for generative text models.

Arthur AI blog post
https://github.com/arthur-ai/bench

@ggerganov, did the Azure Cloud resources ever eventuate?
Perplexity score, Hellaswag scores, Arthur Bench tests... while CI/CD for llama.cpp is ✅done... inference quality and latency benchmarking for llama.cpp would be great to see!

(with only a MacBook + 4G internet access - personally I'm quite constrained as to what I can do without cloud resources)

ggerganov · 2023-08-27T09:47:37Z

ggerganov
Aug 27, 2023
Maintainer

did the Azure Cloud resources ever eventuate?

Yes, we currently have 5 nodes (1 of them is V100 GPU, the others are x64 and ARM CPUs):

I'm fairly happy with it so far. There are lots of things to improve, but for now it does a good job.
More info here: https://github.com/ggerganov/llama.cpp/tree/master/ci

1 reply

ianscrivener Aug 27, 2023
Author

Great! Are you doing automated Perplexity and/or Hellaswag bench marking?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

"Arthur Bench: The Most Robust Way to Evaluate LLMs" #2826

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

"Arthur Bench: The Most Robust Way to Evaluate LLMs" #2826

Uh oh!

Uh oh!

ianscrivener Aug 27, 2023

Replies: 1 comment · 1 reply

Uh oh!

ggerganov Aug 27, 2023 Maintainer

Uh oh!

ianscrivener Aug 27, 2023 Author

ianscrivener
Aug 27, 2023

Replies: 1 comment 1 reply

ggerganov
Aug 27, 2023
Maintainer

ianscrivener Aug 27, 2023
Author