Skip to content

Jithin-S-Sunny/SPEDIT-Web-Tool-for-Machine-Learning-based-Classification

Repository files navigation

This is a machine learning based tool to predict the thermophilic nature of proteins

The published work can be found here

Random Forest Regression Model for Predicting Target Variable

This Python script implements a Random Forest regression model using the scikit-learn library to predict a target variable from a dataset. The script reads data from a CSV file, splits the data into training and testing sets, trains the model, and evaluates its performance. An online wrapper is also included to generate an HTML output for easy access and interaction.

Prerequisites

Ensure you have the following Python packages installed before running the script: pandas numpy scikit-learn You can install them using pip: bash

Copy code pip install pandas numpy scikit-learn

Usage

Input File: The script reads the dataset from a CSV file named 4glm.csv. Make sure the file is located in the same directory as the script, or provide the full path if it's elsewhere.

Target Variable: The target variable column should be named target in the CSV file. Ensure that the column exists and contains the values to be predicted.

Features: The script uses all columns except the target column as features for training the model.

Modifiable Parameters test_size: Specifies the proportion of the dataset to include in the test split (default: 0.25). random_state: Sets the seed for reproducibility (default: 42). n_estimators: Number of trees in the random forest (default: 1000). Script Details Data Loading: The script reads the dataset from 4glm.csv using pandas. Data Splitting: The features and target are split into training and testing sets using train_test_split. Baseline Model: Calculates a baseline error using a simple predictor (the ip column) for comparison. Model Training: A RandomForestRegressor is initialized and trained using the training dataset. Predictions and Evaluation: The script makes predictions on the test set, calculates the Mean Absolute Error (MAE), and computes the model's accuracy using the Mean Absolute Percentage Error (MAPE). Outputs Console Output:

The first 5 rows of the dataset are displayed as a preview. The shape of the training and testing datasets is printed. Baseline error, mean absolute error, and model accuracy are printed to the console. File Output:

An output file out.txt is created (though currently not used in the active code). An HTML file is generated when using the online wrapper, providing an interactive view of the model’s outputs and evaluation metrics. Online Wrapper An online wrapper has been added around the script to facilitate easy interaction with the model. The wrapper allows users to upload their data, adjust parameters, and view results through a web interface. The HTML file for this wrapper can be found in the project directory. Open this file in your browser to access the online interface.

Notes Baseline Predictor: The script uses a column named ip for the baseline prediction. Ensure this column is present in your dataset, or modify the code to use a valid feature column. Accuracy Calculation: The script calculates accuracy as 100 - mean(MAPE). Make sure that the target values are not zero to avoid division errors. Troubleshooting File Not Found Error: Ensure 4glm.csv is in the correct directory and the path is specified correctly. Missing Columns: Verify that the target column (target) and baseline predictor column (ip) are present in the dataset.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published