Code Red! On the Harmfulness of Applying Off-the-shelf Large Language Models to Programming Tasks

Replication Package for the FSE'25 Paper Titled: Code Red! On the Harmfulness of Applying Off-the-shelf Large Language Models to Programming Tasks

Additional visualisations of our data can be found in the AISE-TUDelft/Code-Red-Benchmark

Instructions

To run the project, follow these steps:

Create a virtual environment using conda by running: conda create --name myenv python=3.10.8. (Note that Python versions newer than 3.10 will not be supported by Autosklearn)
Activate the virtual environment by running: conda activate myenv.
Install the required Python packages by running: pip install -r requirements.txt.
Create 2 files with the respective API keys: openai.key and openrouter.key to run a single generation with the models we selected, you will need approximately €1 in OpenAI and €30 in Openrouter credits.

The code is tested on an Ubuntu 20.04 LTS machine with 32GB of RAM and an Intel Core i9-12900HK processor.

Results

The results of the experiments can be found in the ./results folder, each file contains the results of a single model for a single generation for the entire dataset. The trained classifier can be found in the ./classification_models folder, we only provide the best performing model to save space.

Replication steps

Classifier Training: The classifier.ipynb notebook will run the experiments with the labelled data to create the classifiers. The classifiers are saved in the /classification_models folder.
Sample Generation: generation.ipynb notebook will run generation with all the models. The results are saved in the ./results
Sample Tagging: tagging.ipynb will use the classifier from step 1 ot label the samples generated in the previous steps, the results will be saved in the ./results/tagged folder.
Plotting: plots.ipynb will take all the results and compile them into several figures used in the paper, each figure is saved in the ./plots folder.

Citation

Please cite our paper if you find our work useful:

misc{alkaswan2025codered,
      title={Code Red! On the Harmfulness of Applying Off-the-shelf Large Language Models to Programming Tasks}, 
      author={Ali Al-Kaswan and Sebastian Deatc and Begüm Koç and Arie van Deursen and Maliheh Izadi},
      year={2025},
      eprint={2504.01850},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2504.01850}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
classification_models		classification_models
datasets		datasets
plots		plots
results		results
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md
classifier.ipynb		classifier.ipynb
decoding_params.ipynb		decoding_params.ipynb
generation.ipynb		generation.ipynb
manual_eval_labeled.feather		manual_eval_labeled.feather
plots.ipynb		plots.ipynb
requirements.txt		requirements.txt
size_map.json		size_map.json
tagging.ipynb		tagging.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Code Red! On the Harmfulness of Applying Off-the-shelf Large Language Models to Programming Tasks

Instructions

Results

Replication steps

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

AISE-TUDelft/CodeRed

Folders and files

Latest commit

History

Repository files navigation

Code Red! On the Harmfulness of Applying Off-the-shelf Large Language Models to Programming Tasks

Instructions

Results

Replication steps

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages