The objective of this benchmark is to evaluate the performance of the language models in different scenarios. It is a part of AI/RUN TM Engineering Benchmark. See more details in AI/RUN TM Engineering Benchmark repo to understand the whole picture on what the benchmark is and what repositories are involved.
We assess the models using various scenarios such as:
- Code transformation between different technologies
- Code generation
- Documentation generation
- Large context instructions following
These scenarios allow us to comprehensively evaluate the capabilities and limitations of language models in handling diverse programming tasks and developer interactions.
A dataset for testing scenario was created based on the codebase from the following open-source repositories:
- https://github.com/CosmoCMS/Cosmo
- https://github.com/danjac/podbaby
- https://github.com/tastejs/todomvc/tree/master/examples/typescript-react/js
- https://github.com/tastejs/todomvc/tree/master/examples/typescript-angular/js
- https://github.com/tastejs/todomvc/tree/master/examples/jquery
- https://github.com/algorithm-visualizer/algorithm-visualizer
To complete benchmark, you need to clone the additional repository:
- AIRUN-LLM-Benchmark-Results - for storing criteria and results of benchmark
- Install prerequisites:
- Python (>= 3.12)
- Poetry
- Run:
poetry install
- Install pre-commit hooks (one-time setup):
pre-commit install
- (Optional) Connect your python venv with your IDE
Before running the scripts, create a .env file in the root directory of the project using .env.example as a template. Fill in all the necessary environment variables with values specific to your environment.
cp .env.example .env
If you want to add new model to the benchmark, you need to follow these steps:
- Go to config.py and add your model to the
Model
class. - Use your model in the
run_tasks.ipynb
notebook by selecting it in the Model class.
If you want to add new language or repository to the benchmark, you need to follow these steps:
- Create a new directory in the
Dataset
folder with the name of your language (e.g., "JS" or "Java"). - Add your repository to the new directory. The repository should contain the code files you want to use in the prompt.
- Add information about the repository to
Utils/constants.py
file. This includes:'ToDoApp_ReactJS': 'high_avg'
: means repository "ToDoApp_ReactJS" with high complexity and avg size.'ReactSelect': 'React'
: means repository "ReactSelect" with React technology.
If you want to add new scenario to the benchmark, you need to follow these steps:
- Create a new directory
Scenarios/Tasks/{language}
if directory for your language does not exist. - Add your category (e.g., "component_test") to the
Scenarios/Tasks/{language}
directory. - Add your scenario (e.g., "WriteTestsForComponent_RepoName_complexity_size") to the
Scenarios/Tasks/{language}
directory. - Don't forget to add
<place_code_here repo="REPO_NAME"/>
in your scenario file to enrich the template with the code from the repository during test run. - Add criteria to
{results-repo}/Criteria/{language}/{category}
for evaluation results.
- Open the run_tasks.ipynb notebook.
- Start from the first cell, you can set model and scenarios to run or skip.
- Next cell generate summary report in AIRUN-LLM-Benchmark-Results repository.
- Last cell will evaluate the results and generate report in AIRUN-LLM-Benchmark-Results repository.
- Result will be in
AIRUN-LLM-Benchmark-Results
repository inOutput/{model}/{language}
directory.
- Open the run_contextual_task.ipynb
- Change model to use in the experiment
- Start all cells
- Result will be in
AIRUN-LLM-Benchmark-Results
repository inOutput/{model}/{language}/contextual_experiment
directory.
We appreciate all contributions to improve the AI/RUN TM Engineering Benchmark. Please see our Contribution Guidelines for more information on how to get involved.
If you have suggestions for new benchmark scenarios or improvements to existing ones, please open an issue or submit a pull request.
This project is licensed under the Apache 2.0.
EPAM and EPAM AI/RUN TM are trademarks of EPAM Systems, Inc.