Skip to content

Madhuarvind/Data-preprocessing

Repository files navigation

Data Preprocessing Application

Overview

This is an advanced web application for data preprocessing and visualization built with Flask. It allows users to upload CSV or Excel datasets, apply various preprocessing techniques, generate summary statistics, and create interactive charts for comparison between original and cleaned data.

Features

  • File Upload Support: Upload CSV and Excel (.xlsx) files (max 10MB).
  • Preprocessing Options:
    • Remove missing values (default).
    • Remove duplicates.
    • Remove outliers using IQR method for numerical columns.
    • Normalize numerical columns using MinMaxScaler.
    • One-hot encode categorical columns.
  • Summary Statistics: Generate and display descriptive statistics for original and cleaned datasets.
  • Interactive Visualizations: Select from Histogram, Box Plot, Scatter Plot, Line Chart, Bar Chart, Pie Chart. Uses Plotly for interactive charts comparing original vs. cleaned data.
  • Downloads: Download original and cleaned datasets as CSV.
  • User-Friendly UI: Bootstrap-based interface with custom styling.

Setup Instructions

  1. Clone or Download the project.
  2. Install Dependencies:
    pip install -r requirements.txt
    
    Required libraries: Flask, pandas, matplotlib, seaborn, plotly, openpyxl, scikit-learn.
  3. Run the Application:
    python app.py
    
    The app will start on http://127.0.0.1:5000.
  4. Access the App: Open your browser and go to http://127.0.0.1:5000.

Usage

  1. Upload Dataset: Select a CSV or Excel file.
  2. Select Preprocessing Options: Choose from the checkboxes (Remove Missing Values is enabled by default).
  3. Select Charts: Check the visualizations you want to generate.
  4. Submit: Click "Upload and Generate Charts".
  5. View Results:
    • Download original or cleaned datasets.
    • Review summary statistics tables.
    • Interact with generated charts (before and after preprocessing).

Project Structure

  • app.py: Main Flask application with routes, preprocessing logic, and chart generation.
  • templates/index.html: Upload form with preprocessing and chart options.
  • templates/result.html: Results page with downloads, summaries, and charts.
  • static/css/styles.css: Custom styles for the UI.
  • static/charts/: Directory for chart files (auto-created).
  • requirements.txt: Python dependencies.
  • logo.jpg: App logo (optional).

Notes

  • Charts are generated based on the first suitable columns (numerical for most, categorical for pie).
  • For scatter plots, requires at least two numerical columns.
  • Error handling for invalid files, empty datasets, and processing errors.
  • Interactive charts require an internet connection for Plotly CDN.

Deployment Instructions

To deploy the application to a cloud platform, follow these steps for Heroku (free tier available):

  1. Install Heroku CLI: Download and install from https://devcenter.heroku.com/articles/heroku-cli.

  2. Prepare the App for Production:

    • Add gunicorn to requirements.txt (already included).
    • Create a Procfile in the root directory with the content:
      web: gunicorn app:app
      
    • Ensure app.py has the following at the end:
      if __name__ == '__main__':
          app.run()
      (Already present.)
  3. Deploy to Heroku:

    • Login to Heroku: heroku login
    • Create a new app: heroku create your-app-name
    • Push the code: git add . && git commit -m "Initial commit" && git push heroku main
    • Open the app: heroku open

For other platforms like Render or Railway, create an account, connect your GitHub repo (push the code to GitHub first), and deploy as a web service.

Future Improvements

  • Support for more file formats (JSON, etc.).
  • Advanced preprocessing (feature selection, imputation methods).
  • Export charts as images/PDF.
  • User authentication and session management.

© 2024 Data Cleaner. All rights reserved.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published