Skip to content

LLM evaluation via rap battles

vadim0x60/rapbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The great LLM rap-off

Listen on Youtube Music, Spotify and other streaming platforms. A new tournament will be released every month: follow me on twitter to not miss anything.

Are you frustrated by AI companies training on benchmarks? Do you enjoy battle rap? Well, both of you are at the right place. Rap battles combine debate, poetry and improvisation - the three toughest tests of verbal intelligence. State of the art in LLM evaluation. State of the art in generative poetry. The benchmark to end all benchmarks. Built with keeptalking.

Battles

Round 7

Round 6

Round 5

Round 4

Round 3

Round 2

Round 1

Round 0

  • anthropic/claude-sonnet-4 v nvidia/nemotron-nano-9b-v2 lyrics, verdicts
  • deepseek/deepseek-chat-v3.1 v openai/gpt-4o-2024-08-06 lyrics, verdicts
  • google/gemini-2.0-flash-001 v openai/gpt-3.5-turbo lyrics, verdicts
  • google/gemini-2.5-pro v mistralai/mistral-large-2411 lyrics, verdicts
  • deepseek/deepseek-chat-v3-0324 v aion-labs/aion-1.0 lyrics, verdicts
  • openai/gpt-5 v google/gemma-2-9b-it lyrics, verdicts
  • openai/gpt-4.1-mini v liquid/lfm-3b lyrics, verdicts
  • google/gemini-2.5-flash-lite v deepseek/deepseek-r1-distill-qwen-32b lyrics, verdicts
  • google/gemma-3-12b-it v mistralai/ministral-3b lyrics, verdicts
  • anthropic/claude-3.7-sonnet v qwen/qwen-turbo lyrics, verdicts
  • openai/gpt-oss-20b v meta-llama/llama-3.2-1b-instruct lyrics, verdicts
  • openai/gpt-4o-mini v amazon/nova-lite-v1 lyrics, verdicts
  • openai/gpt-oss-120b v baidu/ernie-4.5-300b-a47b lyrics, verdicts
  • openai/gpt-5-mini v meituan/longcat-flash-chat lyrics, verdicts
  • z-ai/glm-4.5 v perplexity/sonar-pro lyrics, verdicts
  • qwen/qwen3-235b-a22b-2507 v meta-llama/llama-3-70b-instruct lyrics, verdicts
  • mistralai/mistral-nemo v undi95/remm-slerp-l2-13b lyrics, verdicts
  • openai/gpt-4.1-nano v openai/gpt-4-turbo lyrics, verdicts
  • meta-llama/llama-3.3-70b-instruct v anthropic/claude-3.5-sonnet-20240620 lyrics, verdicts
  • google/gemini-2.5-flash-lite-preview-06-17 v nousresearch/hermes-4-405b lyrics, verdicts
  • google/gemma-3-27b-it v minimax/minimax-m1 lyrics, verdicts
  • deepseek/deepseek-r1-0528 v amazon/nova-micro-v1 lyrics, verdicts
  • x-ai/grok-4 v nousresearch/hermes-4-70b lyrics, verdicts
  • deepseek/deepseek-v3.1-terminus v thudm/glm-4.1v-9b-thinking lyrics, verdicts
  • z-ai/glm-4.5-air v qwen/qwen-plus lyrics, verdicts
  • x-ai/grok-3-mini v mistralai/mistral-large lyrics, verdicts
  • google/gemini-2.0-flash-lite-001 v perplexity/sonar-reasoning-pro lyrics, verdicts
  • openai/gpt-5-chat v anthropic/claude-3-opus lyrics, verdicts
  • anthropic/claude-3.5-sonnet v qwen/qwen-2.5-vl-7b-instruct lyrics, verdicts
  • openai/gpt-5-nano v qwen/qwen-max lyrics, verdicts
  • deepseek/deepseek-chat v cognitivecomputations/dolphin3.0-mistral-24b lyrics, verdicts
  • openai/gpt-4o-mini-2024-07-18 v nvidia/llama-3.1-nemotron-70b-instruct lyrics, verdicts
  • openai/gpt-4o v mistralai/mixtral-8x22b-instruct lyrics, verdicts
  • meta-llama/llama-3.2-3b-instruct v cohere/command-r-08-2024 lyrics, verdicts
  • qwen/qwen3-235b-a22b-thinking-2507 v arcee-ai/afm-4.5b lyrics, verdicts
  • google/gemini-2.5-flash-lite-preview-09-2025 v neversleep/llama-3.1-lumimaid-8b lyrics, verdicts
  • alibaba/tongyi-deepresearch-30b-a3b v microsoft/phi-3.5-mini-128k-instruct lyrics, verdicts
  • tngtech/deepseek-r1t-chimera v mistralai/mistral-medium-3 lyrics, verdicts
  • mistralai/mistral-small-24b-instruct-2501 v qwen/qwen-plus-2025-07-28 lyrics, verdicts
  • meta-llama/llama-4-scout v nousresearch/deephermes-3-llama-3-8b-preview lyrics, verdicts
  • qwen/qwen3-next-80b-a3b-instruct v deepseek/deepseek-v3.1-base lyrics, verdicts
  • qwen/qwen3-32b v mistralai/mistral-small lyrics, verdicts
  • meta-llama/llama-3.1-70b-instruct v openai/gpt-4o-2024-05-13 lyrics, verdicts
  • meta-llama/llama-3.1-8b-instruct v mistralai/mistral-large-2407 lyrics, verdicts
  • deepseek/deepseek-r1 v openai/gpt-4 lyrics, verdicts
  • microsoft/wizardlm-2-8x22b v neversleep/noromaid-20b lyrics, verdicts
  • microsoft/mai-ds-r1 v ai21/jamba-mini-1.7 lyrics, verdicts
  • qwen/qwq-32b v openai/gpt-3.5-turbo-0613 lyrics, verdicts
  • openai/gpt-4o-2024-11-20 v microsoft/phi-4-multimodal-instruct lyrics, verdicts
  • qwen/qwen3-235b-a22b v anthracite-org/magnum-v4-72b lyrics, verdicts
  • qwen/qwen-2.5-7b-instruct v deepcogito/cogito-v2-preview-llama-109b-moe lyrics, verdicts
  • qwen/qwen3-30b-a3b v mistralai/mistral-7b-instruct-v0.3 lyrics, verdicts
  • openai/o4-mini v mistralai/magistral-medium-2506 lyrics, verdicts
  • anthropic/claude-3-haiku v cognitivecomputations/dolphin3.0-r1-mistral-24b lyrics, verdicts
  • nousresearch/hermes-3-llama-3.1-405b v bytedance/seed-oss-36b-instruct lyrics, verdicts
  • qwen/qwen3-14b v google/gemma-2-27b-it lyrics, verdicts
  • qwen/qwen-2.5-72b-instruct v baidu/ernie-4.5-vl-424b-a47b lyrics, verdicts
  • qwen/qwen3-30b-a3b-instruct-2507 v openai/o3-pro lyrics, verdicts
  • google/gemma-3-4b-it v microsoft/phi-3-mini-128k-instruct lyrics, verdicts
  • gryphe/mythomax-l2-13b v nvidia/llama-3.1-nemotron-ultra-253b-v1 lyrics, verdicts
  • qwen/qwen3-8b v inception/mercury lyrics, verdicts
  • nousresearch/hermes-3-llama-3.1-70b v ai21/jamba-large-1.7 lyrics, verdicts
  • mistralai/mistral-tiny v openai/gpt-3.5-turbo-16k lyrics, verdicts
  • google/gemma-3n-e4b-it v cohere/command-r-plus-08-2024 lyrics, verdicts
  • qwen/qwen3-max v openai/gpt-4-turbo-preview lyrics, verdicts
  • meta-llama/llama-3.1-405b-instruct v alpindale/goliath-120b lyrics, verdicts
  • mistralai/mixtral-8x7b-instruct v microsoft/phi-3-medium-128k-instruct lyrics, verdicts
  • sao10k/l3-lunaris-8b v baidu/ernie-4.5-21b-a3b lyrics, verdicts
  • deepseek/deepseek-v3.2-exp v openai/gpt-4-1106-preview lyrics, verdicts
  • perplexity/sonar-deep-research v mistralai/magistral-small-2506 lyrics, verdicts
  • thedrummer/rocinante-12b v cohere/command-r7b-12-2024 lyrics, verdicts
  • liquid/lfm-7b v openai/gpt-3.5-turbo-instruct lyrics, verdicts
  • minimax/minimax-01 v nousresearch/deephermes-3-mistral-24b-preview lyrics, verdicts
  • mistralai/mistral-7b-instruct v meta-llama/llama-3.1-405b lyrics, verdicts
  • mistralai/mistral-medium-3.1 v neversleep/llama-3-lumimaid-70b lyrics, verdicts
  • nousresearch/hermes-2-pro-llama-3-8b v stepfun-ai/step3 lyrics, verdicts
  • thedrummer/skyfall-36b-v2 v arcee-ai/virtuoso-large lyrics, verdicts
  • meta-llama/llama-3-8b-instruct v mancer/weaver lyrics, verdicts
  • openai/o4-mini-high v baidu/ernie-4.5-vl-28b-a3b lyrics, verdicts
  • mistralai/mistral-small-3.1-24b-instruct v mistralai/mistral-7b-instruct-v0.1 lyrics, verdicts
  • anthropic/claude-3.5-haiku-20241022 v openai/o1-pro lyrics, verdicts
  • openai/chatgpt-4o-latest v inflection/inflection-3-productivity lyrics, verdicts
  • qwen/qwen3-30b-a3b-thinking-2507 v inflection/inflection-3-pi lyrics, verdicts
  • mistralai/ministral-8b v openai/gpt-4-0314 lyrics, verdicts
  • microsoft/phi-4 v google/gemini-2.5-pro-preview lyrics, verdicts

Results

SSS: qwen/qwen-plus

SS: openai/o3-pro

S: qwen/qwen3-235b-a22b-2507

A: qwen/qwen3-235b-a22b-thinking-2507, openai/gpt-4-1106-preview, qwen/qwen3-30b-a3b-thinking-2507

B: deepseek/deepseek-v3.1-terminus, mistralai/mistral-medium-3.1, openai/o4-mini, qwen/qwen-max, qwen/qwen3-max

C: openai/gpt-5, openai/chatgpt-4o-latest, deepseek/deepseek-r1, google/gemini-2.5-flash-lite-preview-06-17, openai/gpt-4o-2024-05-13, nvidia/llama-3.1-nemotron-ultra-253b-v1, bytedance/seed-oss-36b-instruct, anthropic/claude-3-opus, qwen/qwen3-32b, qwen/qwen3-14b, anthropic/claude-3.5-sonnet

D: alibaba/tongyi-deepresearch-30b-a3b, stepfun-ai/step3, tngtech/deepseek-r1t-chimera, openai/o4-mini-high, anthropic/claude-sonnet-4, microsoft/wizardlm-2-8x22b, meituan/longcat-flash-chat, qwen/qwen3-235b-a22b, arcee-ai/virtuoso-large, meta-llama/llama-3.1-405b-instruct, deepseek/deepseek-chat, anthropic/claude-3.5-sonnet-20240620, openai/gpt-4-turbo, deepseek/deepseek-v3.1-base, qwen/qwen3-30b-a3b, google/gemma-3-27b-it, microsoft/mai-ds-r1, perplexity/sonar-reasoning-pro, qwen/qwq-32b, google/gemini-2.5-flash-lite, openai/gpt-oss-20b

F: perplexity/sonar-deep-research, google/gemma-3-4b-it, nousresearch/hermes-3-llama-3.1-70b, openai/gpt-4o-mini, nvidia/llama-3.1-nemotron-70b-instruct, google/gemma-3-12b-it, z-ai/glm-4.5, cohere/command-r7b-12-2024, google/gemini-2.5-pro-preview, qwen/qwen-plus-2025-07-28, nousresearch/hermes-4-70b, mistralai/mistral-large-2407, openai/gpt-3.5-turbo-16k, anthropic/claude-3.7-sonnet, openai/gpt-4-0314, deepcogito/cogito-v2-preview-llama-109b-moe, google/gemini-2.5-flash-lite-preview-09-2025, baidu/ernie-4.5-vl-424b-a47b, google/gemini-2.0-flash-001, cognitivecomputations/dolphin3.0-r1-mistral-24b, qwen/qwen3-8b, meta-llama/llama-4-scout, sao10k/l3-lunaris-8b, openai/gpt-4o, cohere/command-r-08-2024, openai/gpt-4o-2024-08-06, minimax/minimax-01, liquid/lfm-7b, deepseek/deepseek-r1-0528, x-ai/grok-4-fast, openai/o1-pro, google/gemini-2.5-pro, openai/gpt-oss-120b, microsoft/phi-4-multimodal-instruct, aion-labs/aion-1.0, mistralai/mistral-7b-instruct, mistralai/mistral-nemo, openai/gpt-4.1 Merge branch 'master' of github.com:vadim0x60/rapbench -mini, x-ai/grok-3-mini, mistralai/mistral-small-3.1-24b-instruct, cohere/command-r-plus-08-2024, mistralai/mixtral-8x7b-instruct, mancer/weaver

FF: microsoft/phi-4, openai/gpt-3.5-turbo-0613, openai/gpt-5-chat, anthropic/claude-3-haiku, mistralai/ministral-3b, inflection/inflection-3-pi, nousresearch/hermes-2-pro-llama-3-8b, qwen/qwen-2.5-vl-7b-instruct, baidu/ernie-4.5-21b-a3b, nousresearch/hermes-3-llama-3.1-405b, baidu/ernie-4.5-300b-a47b, thedrummer/skyfall-36b-v2, qwen/qwen-2.5-72b-instruct, qwen/qwen-turbo, openai/gpt-4, inception/mercury, alpindale/goliath-120b, google/gemini-2.0-flash-lite-001, neversleep/noromaid-20b, undi95/remm-slerp-l2-13b, perplexity/sonar-pro, mistralai/magistral-small-2506, openai/gpt-5-mini, baidu/ernie-4.5-vl-28b-a3b, google/gemma-3n-e4b-it, microsoft/phi-3-medium-128k-instruct, z-ai/glm-4.5-air, openai/gpt-4o-mini-2024-07-18, openai/gpt-3.5-turbo-instruct, nvidia/nemotron-nano-9b-v2, amazon/nova-lite-v1, gryphe/mythomax-l2-13b, liquid/lfm-3b, neversleep/llama-3.1-lumimaid-8b, nousresearch/hermes-4-405b, mistralai/mistral-large, mistralai/mistral-large-2411, nousresearch/deephermes-3-llama-3-8b-preview, anthropic/claude-3.5-haiku-20241022, nousresearch/deephermes-3-mistral-24b-preview, mistralai/magistral-medium-2506, mistralai/mistral-tiny, anthracite-org/magnum-v4-72b, deepseek/deepseek-v3.2-exp, x-ai/grok-4, openai/gpt-4o-2024-11-20, microsoft/phi-3.5-mini-128k-instruct, meta-llama/llama-3.1-8b-instruct, minimax/minimax-m1, openai/gpt-4.1-nano, microsoft/phi-3-mini-128k-instruct, mistralai/mistral-7b-instruct-v0.1, thudm/glm-4.1v-9b-thinking, openai/gpt-5-nano, mistralai/mistral-small-24b-instruct-2501, meta-llama/llama-3-70b-instruct, openai/gpt-4-turbo-preview, mistralai/ministral-8b, mistralai/mistral-small, deepseek/deepseek-chat-v3.1, thedrummer/rocinante-12b, qwen/qwen3-next-80b-a3b-instruct, mistralai/mistral-medium-3, meta-llama/llama-3.2-3b-instruct, cognitivecomputations/dolphin3.0-mistral-24b, ai21/jamba-mini-1.7, google/gemma-2-9b-it, mistralai/mistral-7b-instruct-v0.3, neversleep/llama-3-lumimaid-70b, google/gemma-2-27b-it, meta-llama/llama-3-8b-instruct, amazon/nova-micro-v1, mistralai/mixtral-8x22b-instruct, arcee-ai/afm-4.5b, ai21/jamba-large-1.7, openai/gpt-3.5-turbo, meta-llama/llama-3.1-405b, deepseek/deepseek-chat-v3-0324, deepseek/deepseek-r1-distill-qwen-32b, meta-llama/llama-3.3-70b-instruct, qwen/qwen-2.5-7b-instruct, qwen/qwen3-30b-a3b-instruct-2507, meta-llama/llama-3.2-1b-instruct, inflection/inflection-3-productivity, meta-llama/llama-3.1-70b-instruct

Reproducibility

This repository is a snakemake workflow. To reproduce the entire tournament, clone it, delete the tournament directory, set OPENROUTER_API_KEY environment variable, and run:

snakemake -c all

This will cost you around $30.

About

LLM evaluation via rap battles

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages