sunbird 🐦‍🔥

Overview

This project focuses on translating x86-64 assembly back into C code using a machine learning model trained on a dataset of C code snippets. Each snippet is compiled with multiple optimization levels across different compilers, and the resulting assembly code is tokenized for use in training.

Dataset

the model was trained on an augmented version of this dataset
each snippet of C code is compiled (by default) with the first four optimization levels of GCC and Clang, yielding 8 unique assembly code snippets for each element in the initial dataset (totaling 2.5 million snippets)

If kaggle is in your path, the original dataset can be downloaded with:

kaggle datasets download -d shirshaka/c-code-snippets-and-their-labels && \
unzip -d dataset c-code-snippets-and-their-labels.zip

Generation

compilation is performed as needed when calling DatasetIterator.take(n)
- compilation settings, including optimization levels and compiler choices are specified in the arguments to this method call
the exact flags passed into the compilation subprocesses are specified in the .compile() methods in compilation.py

Tokenization

C and assembly code snippets are tokenized semantically using the tree-sitter library. Each token includes raw text paired with its symbolic identity, e.g., (variable, 42).

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github		.github
checkpoints		checkpoints
dataset		dataset
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
compilation.py		compilation.py
compression_filter.py		compression_filter.py
datapoint.py		datapoint.py
dataset_iterator.py		dataset_iterator.py
pyproject.toml		pyproject.toml
readme.md		readme.md
sunbird.ipynb		sunbird.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

sunbird 🐦‍🔥

Overview

Dataset

Generation

Tokenization

About

Uh oh!

Releases

Packages

Uh oh!

Languages

wyatt-avilla/sunbird

Folders and files

Latest commit

History

Repository files navigation

sunbird 🐦‍🔥

Overview

Dataset

Generation

Tokenization

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages