Skip to content

wyatt-avilla/sunbird

Repository files navigation

sunbird 🐦‍🔥

Ruff Checked with mypy python pytorch Tree Sitter

Overview

This project focuses on translating x86-64 assembly back into C code using a machine learning model trained on a dataset of C code snippets. Each snippet is compiled with multiple optimization levels across different compilers, and the resulting assembly code is tokenized for use in training.

Dataset

  • the model was trained on an augmented version of this dataset
  • each snippet of C code is compiled (by default) with the first four optimization levels of GCC and Clang, yielding 8 unique assembly code snippets for each element in the initial dataset (totaling 2.5 million snippets)

If kaggle is in your path, the original dataset can be downloaded with:

kaggle datasets download -d shirshaka/c-code-snippets-and-their-labels && \
unzip -d dataset c-code-snippets-and-their-labels.zip

Generation

  • compilation is performed as needed when calling DatasetIterator.take(n)
    • compilation settings, including optimization levels and compiler choices are specified in the arguments to this method call
  • the exact flags passed into the compilation subprocesses are specified in the .compile() methods in compilation.py

Tokenization

C and assembly code snippets are tokenized semantically using the tree-sitter library. Each token includes raw text paired with its symbolic identity, e.g., (variable, 42).

About

Training pipeline for software decompiliation oriented machine learning models

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published