Dataset generation pipeline for Enigma2 using NCBI database. Along with additional helper such as Dataset
class to retireve & create batches for training ML models.
Before setting up EnigmaDB, ensure that you have the following prerequisites installed:
- Python 3.8 or higher
- pip (Python package installer)
pip install enigmadatabase
git clone https://github.com/delveopers/EnigmaDataset.git
cd EnigmaDB
from EnigmaDB import Database, EntrezQueries
queries = EntrezQueries() # get queries
db = Database(topics=queries(), out_dir="./data/raw", email=EMAIL, api_key=API_KEY, retmax=1500, max_rate=10) # set parameters
db.build(with_index=False) # startbuilding
from EnigmaDB import create_index
create_index("./data/raw") # add path to data
from EnigmaDB import convert_fasta
convert_fasta(input_dir="./data/raw", output_dir="./data/parquet", mode='parquet') # for parquet
convert_fasta(input_dir="./data/raw", output_dir="./data/parquet", mode='csv') # for csv
for more technical information, refer to documentation:
├── docs/
├── ├── Database.md
├── ├── Dataset.md
├── src/
├── ├── __init__.py
├── ├── _database.py # ``Database`` class for downloading data from NCBI
├── ├── _dataset.py # ``Dataset`` a dataloader class for enigma2
├── ├── _queries.py # contains queries for DB pipeline
├── README.md
├── setup.py
├── pyproject.toml
├── requirements.txt # List of Python dependencies
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
-
Fork the repository.
-
Create a feature branch:
git checkout -b feature-name
- Commit your changes:
git commit -m "Add feature"
- Push to the branch:
git push origin feature-name
- Create a pull request.
Please make sure to update tests as appropriate.
MIT License. Check out License for more info.