LLM_Sizing_Guide

A calculator to estimate the memory footprint, capacity, and latency based on your planned LLM application's requirements on different GPU architectures.. Blog: https://blogs.vmware.com/cloud-foundation/2024/09/25/llm-inference-sizing-and-performance-guidance/

Usage

Prerequisite: pip install -r requirements.txt

Here are the Flags and their abbreviations for the script.

num_gpu ('-g'): Specify the number of GPUs you plan to use for your deployment.
prompt_sz ('-p'): Define the average size of the input prompts you expect to process.
response_sz ('-r'): Set the average size of the responses you expect to generate.
n_concurrent_req ('-c'): Indicate the number of concurrent requests you anticipate handling.

By modifying these variables, you can easily estimate the performance characteristics of your LLM deployment and make informed decisions about your infrastructure requirements.

Example output

✗ python LLM_size_pef_calculator.py -g 4 -p 4096 -r 256 -c 10
 num_gpu = 4, prompt_size = 4096 tokens, response_size = 256 tokens
 n_concurrent_request = 10

******************** Estimate LLM Memory Footprint ********************
| Model           |   Input Size (tokens) |   Output Size (tokens) |   Concurrent Requests | KV Cache Size per Token   | Memory Footprint   |
|-----------------+-----------------------+------------------------+-----------------------+---------------------------+--------------------|
| Llama-3.1-8B    |                  4096 |                    256 |                    10 | 0.000122 GiB/token        | 21.31 GB           |
| Llama-3.1-70B   |                  4096 |                    256 |                    10 | 0.000305 GiB/token        | 153.28 GB          |
| Mistral-7B-v0.3 |                  4096 |                    256 |                    10 | 0.000122 GiB/token        | 19.31 GB           |
| Qwen2.5-14B     |                  4096 |                    256 |                    10 | 0.000183 GiB/token        | 37.37 GB           |

******************** Estimate LLM Capacity and Latency ********************
| Model           | GPU      |   Input Size (tokens) |   Output Size (tokens) |   Concurrent Requests |   Max # KV Cache Tokens | Prefill Time   | TPOT (ms)   | TTFT    | E2E Latency   | Output Tokens Throughput   |
|-----------------+----------+-----------------------+------------------------+-----------------------+-------------------------+----------------+-------------+---------+---------------+----------------------------|
| Llama-3.1-8B    | L40s     |                  4096 |                    256 |                    10 |                 1441792 | 0.011 ms       | 4.630 ms    | 0.016 s | 1.2 s         | 208.05 tokens/sec          |
| Llama-3.1-8B    | H100 NVL |                  4096 |                    256 |                    10 |                 2949120 | 0.005 ms       | 1.026 ms    | 0.006 s | 0.3 s         | 907.24 tokens/sec          |
| Llama-3.1-8B    | H200 NVL |                  4096 |                    256 |                    10 |                 4489216 | 0.005 ms       | 0.833 ms    | 0.006 s | 0.2 s         | 1098.98 tokens/sec         |
| Llama-3.1-8B    | MI300X   |                  4096 |                    256 |                    10 |                 6160384 | 0.003 ms       | 0.755 ms    | 0.004 s | 0.2 s         | 1244.27 tokens/sec         |
| Llama-3.1-70B   | L40s     |                  4096 |                    256 |                    10 |                  170393 | 0.097 ms       | 40.509 ms   | 0.137 s | 10.8 s        | 23.78 tokens/sec           |
| Llama-3.1-70B   | H100 NVL |                  4096 |                    256 |                    10 |                  773324 | 0.042 ms       | 8.974 ms    | 0.051 s | 2.5 s         | 103.68 tokens/sec          |
| Llama-3.1-70B   | H200 NVL |                  4096 |                    256 |                    10 |                 1389363 | 0.042 ms       | 7.292 ms    | 0.049 s | 2.0 s         | 125.60 tokens/sec          |
| Llama-3.1-70B   | MI300X   |                  4096 |                    256 |                    10 |                 2057830 | 0.027 ms       | 6.604 ms    | 0.033 s | 1.8 s         | 142.20 tokens/sec          |
| Mistral-7B-v0.3 | L40s     |                  4096 |                    256 |                    10 |                 1458176 | 0.010 ms       | 4.051 ms    | 0.014 s | 1.1 s         | 237.78 tokens/sec          |
| Mistral-7B-v0.3 | H100 NVL |                  4096 |                    256 |                    10 |                 2965504 | 0.004 ms       | 0.897 ms    | 0.005 s | 0.2 s         | 1036.85 tokens/sec         |
| Mistral-7B-v0.3 | H200 NVL |                  4096 |                    256 |                    10 |                 4505600 | 0.004 ms       | 0.729 ms    | 0.005 s | 0.2 s         | 1255.98 tokens/sec         |
| Mistral-7B-v0.3 | MI300X   |                  4096 |                    256 |                    10 |                 6176768 | 0.003 ms       | 0.660 ms    | 0.003 s | 0.2 s         | 1422.02 tokens/sec         |
| Qwen2.5-14B     | L40s     |                  4096 |                    256 |                    10 |                  888012 | 0.020 ms       | 8.507 ms    | 0.029 s | 2.3 s         | 113.23 tokens/sec          |
| Qwen2.5-14B     | H100 NVL |                  4096 |                    256 |                    10 |                 1892898 | 0.009 ms       | 1.885 ms    | 0.011 s | 0.5 s         | 493.74 tokens/sec          |
| Qwen2.5-14B     | H200 NVL |                  4096 |                    256 |                    10 |                 2919628 | 0.009 ms       | 1.531 ms    | 0.010 s | 0.4 s         | 598.08 tokens/sec          |
| Qwen2.5-14B     | MI300X   |                  4096 |                    256 |                    10 |                 4033740 | 0.006 ms       | 1.387 ms    | 0.007 s | 0.4 s         | 677.15 tokens/sec          |

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
configs		configs
llm_calculator		llm_calculator
.gitignore		.gitignore
LLM_size_pef_calculator.py		LLM_size_pef_calculator.py
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM_Sizing_Guide

Usage

Example output

About

Uh oh!

Releases

Packages

Uh oh!

Languages

qoofyk/LLM_Sizing_Guide

Folders and files

Latest commit

History

Repository files navigation

LLM_Sizing_Guide

Usage

Example output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages