EnzymeDatasetBuilder

This repository contains code for a web scraper that builds a dataset of enzyme's amino acid sequences and SMILEs of their substrates and products.

1. EC and Amino acid getter

This file accesses the Uniprot database to find enzymes and their names. The program then searches the MetaCyC database for the same enzyme for future use. This file saves the uniprot and MetaCyC links for future use.

2. Chem url getter

This file loops through the MetaCyc links produced by the EC and Amino Acid getter and creates links to the chemical pages for each enzyme. Sadly, the file runs using pyautogui to navigate through the website.

3. Chem getter

This file loops through the chemical page links and retrives the SMILE encodings from MetaCyC. It then saves them the csv files accesable through numpy.

Format of ran dataset

With all of the files run sequentially they will produce 4 different directories. Each directory has a file for each enzyme found. One of these directories is significant: "chem" contains numpy csvs for each enzyme containing their subtrate's and product's SMILE encodings and the enzyme's aa encoding. Each enzyme's aa sequence is encoded into numbers via simply assigning thier respective letter with a number. It is in this format to enable easy future one hot encoding.

Dataset

There is a preproccessed dataset file that is loadable via pandas.read_csv. This file was not produced by the webscraper but rather is the data that MetaCyc has downloadable. The web scraper can be used to add to this dataset of ~1600 samples

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Chem_Getter.py		Chem_Getter.py
Chem_Url_Getter.py		Chem_Url_Getter.py
Dataset.csv		Dataset.csv
EC_and_Amino_Acid_Getter.py		EC_and_Amino_Acid_Getter.py
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EnzymeDatasetBuilder

1. EC and Amino acid getter

2. Chem url getter

3. Chem getter

Format of ran dataset

Dataset

About

Uh oh!

Releases

Packages

Languages

License

JohnNesbit/EnzymeDatasetBuilder

Folders and files

Latest commit

History

Repository files navigation

EnzymeDatasetBuilder

1. EC and Amino acid getter

2. Chem url getter

3. Chem getter

Format of ran dataset

Dataset

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages