This is a machine learning based tool to predict the thermophilic nature of proteins

Random Forest Regression Model for Predicting Target Variable

This Python script implements a Random Forest regression model using the scikit-learn library to predict a target variable from a dataset. The script reads data from a CSV file, splits the data into training and testing sets, trains the model, and evaluates its performance. An online wrapper is also included to generate an HTML output for easy access and interaction.

Prerequisites

Ensure you have the following Python packages installed before running the script: pandas numpy scikit-learn You can install them using pip: bash

Copy code pip install pandas numpy scikit-learn

Usage

Input File: The script reads the dataset from a CSV file named 4glm.csv. Make sure the file is located in the same directory as the script, or provide the full path if it's elsewhere.

Target Variable: The target variable column should be named target in the CSV file. Ensure that the column exists and contains the values to be predicted.

Features: The script uses all columns except the target column as features for training the model.

Modifiable Parameters test_size: Specifies the proportion of the dataset to include in the test split (default: 0.25). random_state: Sets the seed for reproducibility (default: 42). n_estimators: Number of trees in the random forest (default: 1000). Script Details Data Loading: The script reads the dataset from 4glm.csv using pandas. Data Splitting: The features and target are split into training and testing sets using train_test_split. Baseline Model: Calculates a baseline error using a simple predictor (the ip column) for comparison. Model Training: A RandomForestRegressor is initialized and trained using the training dataset. Predictions and Evaluation: The script makes predictions on the test set, calculates the Mean Absolute Error (MAE), and computes the model's accuracy using the Mean Absolute Percentage Error (MAPE). Outputs Console Output:

The first 5 rows of the dataset are displayed as a preview. The shape of the training and testing datasets is printed. Baseline error, mean absolute error, and model accuracy are printed to the console. File Output:

An output file out.txt is created (though currently not used in the active code). An HTML file is generated when using the online wrapper, providing an interactive view of the model’s outputs and evaluation metrics. Online Wrapper An online wrapper has been added around the script to facilitate easy interaction with the model. The wrapper allows users to upload their data, adjust parameters, and view results through a web interface. The HTML file for this wrapper can be found in the project directory. Open this file in your browser to access the online interface.

Notes Baseline Predictor: The script uses a column named ip for the baseline prediction. Ensure this column is present in your dataset, or modify the code to use a valid feature column. Accuracy Calculation: The script calculates accuracy as 100 - mean(MAPE). Make sure that the target values are not zero to avoid division errors. Troubleshooting File Not Found Error: Ensure 4glm.csv is in the correct directory and the path is specified correctly. Missing Columns: Verify that the target column (target) and baseline predictor column (ip) are present in the dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
README.md		README.md
Supplementary 3_a.pdb		Supplementary 3_a.pdb
Supplementary 3_b.phr		Supplementary 3_b.phr
Supplementary 3_c.pin		Supplementary 3_c.pin
Supplementary 3_d.pot		Supplementary 3_d.pot
Supplementary 3_e.psq		Supplementary 3_e.psq
Supplementary 3_f.ptf		Supplementary 3_f.ptf
Supplementary 3_g.pto		Supplementary 3_g.pto
blast.php		blast.php
blast.py		blast.py
random_forest.py		random_forest.py
spedit.php		spedit.php
submit.php		submit.php
submit2.php		submit2.php
trailrf.py		trailrf.py
txt2csv.py		txt2csv.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

This is a machine learning based tool to predict the thermophilic nature of proteins

Random Forest Regression Model for Predicting Target Variable

Prerequisites

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Jithin-S-Sunny/SPEDIT-Web-Tool-for-Machine-Learning-based-Classification

Folders and files

Latest commit

History

Repository files navigation

This is a machine learning based tool to predict the thermophilic nature of proteins

Random Forest Regression Model for Predicting Target Variable

Prerequisites

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages