LLM Benchmark Framework

Objective

The objective of this benchmark is to evaluate the performance of the language models in different scenarios. It is a part of AI/RUN ^TM Engineering Benchmark. See more details in AI/RUN ^TM Engineering Benchmark repo to understand the whole picture on what the benchmark is and what repositories are involved.

Evaluation Scenarios

We assess the models using various scenarios such as:

Code transformation between different technologies
Code generation
Documentation generation
Large context instructions following

These scenarios allow us to comprehensively evaluate the capabilities and limitations of language models in handling diverse programming tasks and developer interactions.

A dataset for testing scenario was created based on the codebase from the following open-source repositories:

https://github.com/CosmoCMS/Cosmo

https://github.com/danjac/podbaby

https://github.com/tastejs/todomvc/tree/master/examples/typescript-react/js

https://github.com/tastejs/todomvc/tree/master/examples/typescript-angular/js

https://github.com/tastejs/todomvc/tree/master/examples/jquery

https://github.com/algorithm-visualizer/algorithm-visualizer

How to Set Up benchmark

Clone repositories

To complete benchmark, you need to clone the additional repository:

AIRUN-LLM-Benchmark-Results - for storing criteria and results of benchmark

Prepare Python Virtual Environment

Install prerequisites:

Python (>= 3.12)
Poetry

Run:

poetry install

Install pre-commit hooks (one-time setup):

pre-commit install

(Optional) Connect your python venv with your IDE

Environment Variables Setup

Before running the scripts, create a .env file in the root directory of the project using .env.example as a template. Fill in all the necessary environment variables with values specific to your environment.

cp .env.example .env

Prepare for experiment

Add new model

If you want to add new model to the benchmark, you need to follow these steps:

Go to config.py and add your model to the Model class.
Use your model in the run_tasks.ipynb notebook by selecting it in the Model class.

Extend dataset

If you want to add new language or repository to the benchmark, you need to follow these steps:

Create a new directory in the Dataset folder with the name of your language (e.g., "JS" or "Java").
Add your repository to the new directory. The repository should contain the code files you want to use in the prompt.
Add information about the repository to Utils/constants.py file. This includes:
- 'ToDoApp_ReactJS': 'high_avg': means repository "ToDoApp_ReactJS" with high complexity and avg size.
- 'ReactSelect': 'React': means repository "ReactSelect" with React technology.

Extend categories and scenarios

If you want to add new scenario to the benchmark, you need to follow these steps:

Create a new directory Scenarios/Tasks/{language} if directory for your language does not exist.
Add your category (e.g., "component_test") to the Scenarios/Tasks/{language} directory.
Add your scenario (e.g., "WriteTestsForComponent_RepoName_complexity_size") to the Scenarios/Tasks/{language} directory.
Don't forget to add <place_code_here repo="REPO_NAME"/> in your scenario file to enrich the template with the code from the repository during test run.
Add criteria to {results-repo}/Criteria/{language}/{category} for evaluation results.

How to complete experiment

Run the benchmark with standard categories

Open the run_tasks.ipynb notebook.
Start from the first cell, you can set model and scenarios to run or skip.
Next cell generate summary report in AIRUN-LLM-Benchmark-Results repository.
Last cell will evaluate the results and generate report in AIRUN-LLM-Benchmark-Results repository.
Result will be in AIRUN-LLM-Benchmark-Results repository in Output/{model}/{language} directory.

Run LCIF experiment

Open the run_contextual_task.ipynb
Change model to use in the experiment
Start all cells
Result will be in AIRUN-LLM-Benchmark-Results repository in Output/{model}/{language}/contextual_experiment directory.

Contributing

We appreciate all contributions to improve the AI/RUN ^TM Engineering Benchmark. Please see our Contribution Guidelines for more information on how to get involved.

If you have suggestions for new benchmark scenarios or improvements to existing ones, please open an issue or submit a pull request.

📄 License

This project is licensed under the Apache 2.0.

EPAM and EPAM AI/RUN ^TM are trademarks of EPAM Systems, Inc.

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
Dataset/JS		Dataset/JS
Scenarios/Tasks/JS		Scenarios/Tasks/JS
Utils		Utils
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
run_contextual_task.ipynb		run_contextual_task.ipynb
run_instruction_following.ipynb		run_instruction_following.ipynb
run_tasks.ipynb		run_tasks.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Benchmark Framework

Objective

Evaluation Scenarios

How to Set Up benchmark

Clone repositories

Prepare Python Virtual Environment

Environment Variables Setup

Prepare for experiment

Add new model

Extend dataset

Extend categories and scenarios

How to complete experiment

Run the benchmark with standard categories

Run LCIF experiment

Contributing

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

epam/AIRUN-LLM-Benchmark

Folders and files

Latest commit

History

Repository files navigation

LLM Benchmark Framework

Objective

Evaluation Scenarios

How to Set Up benchmark

Clone repositories

Prepare Python Virtual Environment

Environment Variables Setup

Prepare for experiment

Add new model

Extend dataset

Extend categories and scenarios

How to complete experiment

Run the benchmark with standard categories

Run LCIF experiment

Contributing

📄 License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages