Test Results #10

Spiritdude · 2024-09-17T09:42:32Z

Spiritdude
Sep 17, 2024

Just a thread to post test results.

E.g. to use another backend like openrouter.ai:

OPENAI_API_KEY=${OPENROUTER_API_KEY} python3 test.py --base_url=https://openrouter.ai/api/v1/ --model=....

I updated test.py to support base_url

Spiritdude · 2024-09-17T09:43:19Z

Spiritdude
Sep 17, 2024
Author

Test Results: Llama-3.1 70b instruct (fireworks.ai)

=== Test Results Summary ===

Test Case: Arena Bench Hard
✅ bon: 25.65s
✅ leap: 27.94s
✅ z3: 66.48s
✅ cot_reflection: 111.95s
✅ rto: 157.93s
✅ moa: 236.86s
✅ self_consistency: 243.00s
✅ plansearch: 265.08s
✅ mcts: 473.38s
✅ rstar: 594.77s
✅ pvg: 1309.31s

Test Case: Big Code Bench
✅ bon: 14.89s
✅ z3: 32.49s
✅ cot_reflection: 59.09s
✅ moa: 72.93s
✅ rto: 120.13s
✅ self_consistency: 133.31s
✅ leap: 135.54s
✅ plansearch: 151.93s
✅ mcts: 198.14s
✅ rstar: 549.54s
✅ pvg: 923.09s

Test Case: Maths Problem
✅ bon: 118.39s
✅ cot_reflection: 161.69s
✅ z3: 197.59s
✅ rto: 235.34s
✅ moa: 247.35s
✅ leap: 256.80s
✅ plansearch: 296.14s
✅ self_consistency: 459.20s
✅ rstar: 500.56s
✅ mcts: 566.77s
✅ pvg: 884.29s

Test Case: GSM8K
✅ cot_reflection: 24.52s
✅ z3: 46.71s
✅ bon: 58.61s
✅ moa: 59.45s
✅ self_consistency: 95.29s
✅ rto: 126.24s
✅ plansearch: 126.88s
✅ mcts: 168.76s
✅ rstar: 181.80s
✅ leap: 306.91s
✅ pvg: 581.49s

1 reply

Spiritdude Sep 17, 2024
Author

Test Results: Llama-3.1 8b instruct (fireworks.ai)

=== Test Results Summary ===

Test Case: Arena Bench Hard
✅ cot_reflection: 3.02s
✅ z3: 4.10s
✅ leap: 5.01s
✅ bon: 6.09s
✅ moa: 9.01s
✅ plansearch: 9.22s
✅ rto: 10.20s
✅ rstar: 15.27s
✅ self_consistency: 16.87s
✅ mcts: 24.14s
✅ pvg: 38.27s

Test Case: Big Code Bench
✅ leap: 1.18s
✅ cot_reflection: 1.89s
✅ bon: 2.29s
✅ rto: 3.59s
✅ z3: 4.03s
✅ plansearch: 4.69s
✅ moa: 5.21s
✅ self_consistency: 6.07s
✅ mcts: 11.17s
✅ rstar: 12.58s
✅ pvg: 26.14s

Test Case: Maths Problem
✅ leap: 2.15s
✅ cot_reflection: 2.59s
✅ bon: 3.14s
✅ z3: 4.37s
✅ rto: 5.79s
✅ moa: 6.24s
✅ plansearch: 6.28s
✅ self_consistency: 9.84s
✅ rstar: 11.76s
✅ mcts: 13.97s
✅ pvg: 36.18s

Test Case: GSM8K
✅ cot_reflection: 0.88s
✅ bon: 1.42s
✅ z3: 1.76s
✅ self_consistency: 2.51s
✅ moa: 2.56s
✅ rto: 3.81s
✅ plansearch: 5.87s
✅ mcts: 5.99s
✅ rstar: 7.94s
✅ leap: 8.04s
✅ pvg: 14.96s

Spiritdude · 2024-09-17T09:45:14Z

Spiritdude
Sep 17, 2024
Author

Test Results: Gemma 2 2b it Q8_0 (local: llama_cpp_python)

=== Test Results Summary ===

Test Case: Arena Bench Hard
✅ cot_reflection: 24.58s
❌ rstar: 300.79s
Error:
❌ moa: 380.24s
Error: list index out of range
✅ bon: 1022.92s
✅ leap: 1218.86s
✅ rto: 1400.61s
✅ z3: 1790.11s
✅ plansearch: 1928.60s
✅ self_consistency: 2212.08s
✅ pvg: 2449.37s
✅ mcts: 2480.71s

Test Case: Big Code Bench
❌ moa: 69.11s
Error: list index out of range
✅ cot_reflection: 280.52s
❌ rstar: 301.02s
Error:
✅ bon: 436.59s
✅ z3: 489.64s
✅ leap: 537.49s
✅ rto: 859.71s
✅ plansearch: 871.66s
✅ self_consistency: 941.65s
✅ pvg: 1222.16s
✅ mcts: 1247.61s

Test Case: Maths Problem
❌ moa: 68.84s
Error: list index out of range
❌ rstar: 300.47s
Error:
✅ cot_reflection: 311.99s
✅ bon: 432.17s
✅ z3: 553.89s
✅ leap: 647.40s
✅ rto: 838.41s
✅ plansearch: 935.86s
✅ self_consistency: 984.79s
✅ pvg: 1181.73s
✅ mcts: 1192.61s

Test Case: GSM8K
❌ moa: 20.68s
Error: list index out of range
✅ cot_reflection: 81.93s
✅ bon: 137.23s
✅ leap: 214.53s
✅ rto: 451.97s
✅ plansearch: 507.66s
✅ self_consistency: 574.10s
✅ z3: 606.57s
✅ pvg: 747.68s
✅ rstar: 763.87s
✅ mcts: 770.36s

1 reply

Spiritdude Sep 17, 2024
Author

Test Results: Gemma 2 27B (openrouter.ai)

=== Test Results Summary ===

Test Case: Arena Bench Hard
✅ cot_reflection: 8.20s
✅ leap: 9.64s
✅ z3: 34.46s
✅ bon: 37.12s
❌ moa: 38.38s
Error: list index out of range
✅ plansearch: 51.59s
✅ rto: 73.63s
✅ mcts: 99.24s
✅ rstar: 109.37s
✅ self_consistency: 112.15s
✅ pvg: 119.65s

Test Case: Big Code Bench
✅ cot_reflection: 7.22s
✅ leap: 15.10s
✅ bon: 17.65s
❌ moa: 22.30s
Error: list index out of range
✅ z3: 26.84s
✅ plansearch: 38.92s
✅ rto: 40.99s
✅ mcts: 53.84s
✅ pvg: 70.87s
✅ self_consistency: 74.45s
❌ rstar: 108.85s
Error: 0, message='Attempt to decode JSON with unexpected mimetype: text/html; charset=utf-8', url=URL('https://openrouter.ai/api/v1/chat/completions')

Test Case: Maths Problem
❌ moa: 7.11s
Error: list index out of range
✅ cot_reflection: 9.48s
✅ bon: 12.95s
✅ leap: 19.08s
✅ z3: 28.28s
✅ plansearch: 29.35s
✅ rto: 33.19s
✅ rstar: 38.56s
✅ mcts: 41.27s
✅ self_consistency: 52.80s
✅ pvg: 171.69s

Test Case: GSM8K
❌ moa: 2.31s
Error: list index out of range
✅ bon: 2.98s
✅ cot_reflection: 3.56s
✅ z3: 8.19s
✅ mcts: 10.74s
✅ self_consistency: 12.88s
✅ plansearch: 17.13s
✅ rstar: 21.02s
✅ rto: 30.32s
✅ leap: 37.64s
✅ pvg: 40.40s

Spiritdude · 2024-09-17T10:13:22Z

Spiritdude
Sep 17, 2024
Author

Test Results: Phi-3.5 Mini 128K Instruct (openrouter.ai)

=== Test Results Summary ===

Test Case: Arena Bench Hard
❌ moa: 32.35s
Error: list index out of range
✅ bon: 36.01s
✅ z3: 50.59s
✅ rstar: 89.90s
✅ rto: 123.02s
✅ plansearch: 146.24s
✅ leap: 164.27s
✅ self_consistency: 194.03s
✅ mcts: 245.17s
✅ pvg: 249.50s
✅ cot_reflection: 640.94s

Test Case: Big Code Bench
✅ bon: 10.56s
❌ moa: 12.96s
Error: list index out of range
✅ cot_reflection: 16.71s
✅ leap: 44.05s
✅ rto: 48.42s
✅ z3: 48.58s
✅ self_consistency: 49.33s
✅ plansearch: 60.63s
✅ rstar: 87.37s
✅ mcts: 192.81s
✅ pvg: 739.12s

Test Case: Maths Problem
✅ cot_reflection: 7.13s
❌ moa: 11.43s
Error: list index out of range
✅ bon: 14.15s
✅ z3: 17.66s
✅ leap: 48.33s
✅ self_consistency: 50.56s
✅ plansearch: 65.71s
✅ pvg: 108.00s
❌ rstar: 114.84s
Error: 0, message='Attempt to decode JSON with unexpected mimetype: text/html; charset=utf-8', url=URL('https://openrouter.ai/api/v1/chat/completions')
✅ rto: 267.34s
✅ mcts: 312.35s

Test Case: GSM8K
❌ moa: 1.90s
Error: list index out of range
✅ bon: 4.00s
✅ cot_reflection: 4.40s
✅ self_consistency: 12.59s
✅ rto: 25.83s
✅ leap: 32.10s
✅ plansearch: 46.75s
✅ pvg: 62.61s
✅ mcts: 68.81s
❌ rstar: 109.70s
Error: 0, message='Attempt to decode JSON with unexpected mimetype: text/html; charset=utf-8', url=URL('https://openrouter.ai/api/v1/chat/completions')
✅ z3: 319.85s

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Test Results #10

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Test Results #10

Uh oh!

Uh oh!

Spiritdude Sep 17, 2024

Replies: 3 comments · 2 replies

Uh oh!

Spiritdude Sep 17, 2024 Author

Test Results: Llama-3.1 70b instruct (fireworks.ai)

Uh oh!

Spiritdude Sep 17, 2024 Author

Test Results: Llama-3.1 8b instruct (fireworks.ai)

Uh oh!

Spiritdude Sep 17, 2024 Author

Test Results: Gemma 2 2b it Q8_0 (local: llama_cpp_python)

Uh oh!

Spiritdude Sep 17, 2024 Author

Test Results: Gemma 2 27B (openrouter.ai)

Uh oh!

Spiritdude Sep 17, 2024 Author

Test Results: Phi-3.5 Mini 128K Instruct (openrouter.ai)

Spiritdude
Sep 17, 2024

Replies: 3 comments 2 replies

Spiritdude
Sep 17, 2024
Author

Spiritdude Sep 17, 2024
Author

Spiritdude
Sep 17, 2024
Author

Spiritdude Sep 17, 2024
Author

Spiritdude
Sep 17, 2024
Author