Studying the Overhead of `torch.compile()` in Large Language Model Inference

The report is available here.

TL;DR To lower the latency of initial LLM generations on spot instances, do not compile for the prefill phase. Furthermore, set torch.compile(dynamic=True) if expecting different prefill shapes (prompt lengths) to avoid extra recompilations. It can be also approached by padding on the left until a fixed sequence length (but this has the cost of fixed prefill shape and may lead to suboptimal tensor optimisations).

Structure of this Repo

📚 This repository contains files used in the project Studying the Overhead of torch.compile() in Large Language Model Inference.

The structure of the repo is as follows:

benchmarking/ -- this directory contains code used for prefill+decode experiments.
compare_llm.py -- profiling comparison between compiled and not compiled models.
compile_multiple_inputs.py -- code for generating dynamically changing prefill sequence lengths.
example_llm.py -- example compiled llm.
profile_ops.py -- script for profiling operations of a PyTorch code (works with both compiled and eager modes).
model.py -- GPT-2 implementation from https://github.com/karpathy/nanoGPT.
visualise_results.ipynb -- notebook containing code for plotting results in the report.
prompts2.txt -- file containing synthetic prompts for experiments.
cupcake_ipsum.txt -- file containing synthetically long context.
benchmarking_results_final/ -- results from prefill+decode experiments.
resources/ -- resources for the report, e.g. images, plots, etc.
report.pdf -- a report with some cool insights. Heads up -- it was written overnight to make it for the deadline, so if you spot any typos this is likely the reason 😉

Note

This works has been submitted as part of the module assignment for R244: Large Scale Data Optimisation and Processing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Studying the Overhead of `torch.compile()` in Large Language Model Inference

Structure of this Repo

Note

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
benchmarking		benchmarking
benchmarking_results_final		benchmarking_results_final
resources		resources
.gitignore		.gitignore
README.md		README.md
compare_llm.py		compare_llm.py
compile_multiple_inputs.py		compile_multiple_inputs.py
cupcake_ipsum.txt		cupcake_ipsum.txt
example_llm.py		example_llm.py
model.py		model.py
profile_ops.py		profile_ops.py
prompts2.txt		prompts2.txt
report.pdf		report.pdf
visualise_results.ipynb		visualise_results.ipynb

TheRootOf3/torch-compile-benchmarks

Folders and files

Latest commit

History

Repository files navigation

Studying the Overhead of torch.compile() in Large Language Model Inference

Structure of this Repo

Note

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Studying the Overhead of `torch.compile()` in Large Language Model Inference

Packages