GitHub - Jameshuff91/ML-Ingestor

DataFlow Assistant Introduction DataFlow Assistant is a user-friendly application designed to streamline the process of data ingestion and preliminary data science steps necessary for model development. The frontend UI is built using JavaScript and managed with npm, while the backend is developed using the Flask Python framework. With an intuitive frontend interface, users can easily connect to data sources, including the ability to drag and drop CSV files. The application performs essential data validation tasks and allows users to interact and make decisions during the data processing workflow.

Features Intuitive User Interface: Easy-to-use frontend for connecting data sources, with drag-and-drop functionality for CSV files. Data Ingestion: Supports importing data from CSV files and other data sources. Data Validation: Imputation: Handles missing values with various imputation methods. Negative Value Detection: Identifies and reports negative values in the dataset. Duplicate Detection: Finds duplicates or distinct elements within the data. Entity-Relationship Diagram (ERD): Automatically generates ERDs to visualize data flow and relationships. Correlation Analysis: Analyzes and displays relationships and correlations between data elements. User Interaction: Enables users to provide input or directives when decisions are required during data processing. Robust Logging: Comprehensive logging for every function, command, and class, facilitating easy debugging and transparency. Test-Driven Development: Developed using TDD to ensure high code quality and reliability. Installation Prerequisites Python 3.8 or higher pip package manager Steps Clone the Repository

bash Copy code git clone https://github.com/yourusername/dataflow-assistant.git cd dataflow-assistant Create a Virtual Environment

bash Copy code python -m venv venv source venv/bin/activate # On Windows use venv\Scripts\activate Install Dependencies

bash Copy code pip install -r requirements.txt Configure the Application

Edit the config.yaml file to set up any necessary configurations. Run the Application

bash Copy code python app.py Access the Frontend Interface

Open your web browser and navigate to http://localhost:8000. Usage Data Ingestion

Drag and Drop: Simply drag your CSV file into the designated area on the homepage. File Upload: Click the "Upload" button to browse and select your CSV file. Data Validation Process

Imputation of Missing Values: The application detects missing values and suggests imputation methods. Negative Values Detection: Highlights negative values and prompts for corrective actions. Duplicate Detection: Identifies duplicates and allows you to decide whether to remove or retain them. ERD Generation: Automatically creates an Entity-Relationship Diagram to visualize data relationships. Correlation Analysis: Provides insights into the correlations between different data elements. User Interaction

Decision Points: When the application encounters issues like extensive missing data, it will recommend solutions but allows you to make the final decision. Comments and Directives: You can add comments or directives at any point to guide the data processing steps. Proceed to Model Development

After validation, export the processed data or proceed to model development if integrated within the application. Logging All operations are logged with detailed information. Logs are stored in the logs/ directory. Logging levels (DEBUG, INFO, WARNING, ERROR) can be configured in the config.yaml file. Running Tests The application is developed using Test-Driven Development.

Tests are located in the tests/ directory.

To run tests:

bash Copy code python -m unittest discover tests To check test coverage (requires coverage package):

bash Copy code coverage run -m unittest discover tests coverage report Contributing Contributions are welcome! Please read the CONTRIBUTING.md file for guidelines on how to contribute to this project.

License This project is licensed under the MIT License - see the LICENSE file for details.

Contact Email: support@dataflowassistant.com GitHub Issues: https://github.com/yourusername/dataflow-assistant/issues Initial Development Steps and Tests To develop this application using Test-Driven Development (TDD), we'll start by writing tests for each component before implementing them. Below are the first few steps and corresponding tests to develop the application.

Project Setup Objective: Set up the initial project structure and environment.

Actions:

Initialize a Git repository. Create the basic directory structure. Set up a virtual environment. Create requirements.txt for dependencies. Directory Structure:

markdown Copy code dataflow-assistant/ ├── app.py ├── config.yaml ├── requirements.txt ├── README.md ├── LICENSE ├── CONTRIBUTING.md ├── src/ │ ├── init.py │ ├── ingestion.py │ ├── validation.py │ ├── erd_generator.py │ ├── correlation.py │ ├── ui.py │ └── logger.py ├── tests/ │ ├── init.py │ ├── test_ingestion.py │ ├── test_validation.py │ ├── test_erd_generator.py │ ├── test_correlation.py │ └── test_ui.py ├── logs/ └── data/ 2. Data Ingestion Component Test Case 1: CSV File Ingestion Objective: Verify that the application can ingest CSV files correctly.

Test Code (tests/test_ingestion.py):

python Copy code import unittest import os from src.ingestion import DataIngestion

class TestDataIngestion(unittest.TestCase): def setUp(self): self.ingestion = DataIngestion() self.sample_csv = 'data/sample.csv' # Create a sample CSV file for testing with open(self.sample_csv, 'w') as f: f.write('id,name,value\n1,Alice,10\n2,Bob,20\n')

def tearDown(self):
    # Remove the sample CSV file after tests
    os.remove(self.sample_csv)

def test_load_csv(self):
    data = self.ingestion.load_csv(self.sample_csv)
    self.assertEqual(len(data), 2)
    self.assertIn('name', data.columns)
    self.assertIn('value', data.columns)

if name == 'main': unittest.main() Implementation Steps:

Create DataIngestion class in src/ingestion.py. Implement the load_csv method using pandas.read_csv. Ensure the method includes logging at the INFO level. Test Case 2: Drag-and-Drop Functionality (UI) Objective: Ensure the drag-and-drop feature works in the frontend.

Test Code (tests/test_ui.py):

For UI testing, you might use a tool like Selenium.

python Copy code import unittest from selenium import webdriver from selenium.webdriver.common.action_chains import ActionChains

class TestUI(unittest.TestCase): def setUp(self): self.driver = webdriver.Chrome() self.driver.get('http://localhost:8000')

def tearDown(self):
    self.driver.quit()

def test_drag_and_drop_csv(self):
    driver = self.driver
    upload_area = driver.find_element_by_id('upload-area')
    file_path = '/path/to/tests/data/sample.csv'
    # Simulate drag and drop
    # (Implementation depends on the testing framework and may require additional setup)
    self.assertTrue(driver.find_element_by_id('upload-success').is_displayed())

if name == 'main': unittest.main() Implementation Steps:

Develop the frontend using a framework like React or simple HTML/JavaScript. Implement drag-and-drop functionality. Ensure the backend can receive and process the uploaded file. 3. Data Validation Component Test Case 1: Missing Values Detection Objective: Check if missing values are correctly identified in the dataset.

Test Code (tests/test_validation.py):

python Copy code import unittest import pandas as pd from src.validation import DataValidation

class TestDataValidation(unittest.TestCase): def setUp(self): data = {'id': [1, 2, 3], 'value': [10, None, 30]} self.df = pd.DataFrame(data) self.validation = DataValidation(self.df)

def test_find_missing_values(self):
    missing_values = self.validation.find_missing_values()
    self.assertEqual(missing_values['value'], 1)
    self.assertEqual(missing_values['id'], 0)

if name == 'main': unittest.main() Implementation Steps:

Create DataValidation class in src/validation.py. Implement find_missing_values method using pandas.isnull() and sum(). Add appropriate logging statements. Test Case 2: Negative Values Detection Objective: Detect negative values in numerical columns.

Test Code (tests/test_validation.py):

python Copy code def test_find_negative_values(self): data = {'id': [1, 2, 3], 'value': [10, -20, 30]} df = pd.DataFrame(data) validation = DataValidation(df) negative_values = validation.find_negative_values() self.assertEqual(negative_values['value'], [-20]) self.assertNotIn('id', negative_values)

if name == 'main': unittest.main() Implementation Steps:

Implement find_negative_values method in DataValidation. Use df[df < 0] to find negative values. Include logging of negative value detection. Test Case 3: Duplicate Detection Objective: Identify duplicate rows in the dataset.

Test Code (tests/test_validation.py):

python Copy code def test_find_duplicates(self): data = {'id': [1, 2, 2], 'value': [10, 20, 20]} df = pd.DataFrame(data) validation = DataValidation(df) duplicates = validation.find_duplicates() self.assertEqual(len(duplicates), 1) self.assertEqual(duplicates.iloc[0]['id'], 2)

if name == 'main': unittest.main() Implementation Steps:

Implement find_duplicates method using df[df.duplicated()]. Add logging to record duplicate entries found. 4. Entity-Relationship Diagram (ERD) Generation Test Case: ERD Generation Objective: Verify that the ERD is generated based on the data.

Test Code (tests/test_erd_generator.py):

python Copy code import unittest from src.erd_generator import ERDGenerator

class TestERDGenerator(unittest.TestCase): def setUp(self): self.sample_schema = { 'tables': { 'users': ['id', 'name', 'email'], 'orders': ['id', 'user_id', 'product_id'] }, 'relationships': [ ('users.id', 'orders.user_id') ] } self.erd_generator = ERDGenerator(self.sample_schema)

def test_generate_erd(self):
    result = self.erd_generator.generate()
    self.assertTrue(result)  # Assuming the method returns True on success
    self.assertTrue(os.path.exists('output/erd.png'))

if name == 'main': unittest.main() Implementation Steps:

Create ERDGenerator class in src/erd_generator.py. Implement the generate method using a library like pygraphviz or ERAlchemy. Ensure the ERD is saved to an output directory. Include logging for ERD generation steps. 5. Correlation Analysis Test Case: Correlation Calculation Objective: Confirm that correlation coefficients are calculated correctly.

Test Code (tests/test_correlation.py):

python Copy code import unittest import pandas as pd from src.correlation import CorrelationAnalysis

class TestCorrelationAnalysis(unittest.TestCase): def setUp(self): data = { 'A': [1, 2, 3, 4, 5], 'B': [2, 4, 6, 8, 10], 'C': [5, 3, 6, 2, 1] } self.df = pd.DataFrame(data) self.correlation = CorrelationAnalysis(self.df)

def test_calculate_correlation(self):
    corr_matrix = self.correlation.calculate()
    self.assertEqual(corr_matrix.loc['A', 'B'], 1.0)
    self.assertTrue(-1 <= corr_matrix.loc['A', 'C'] <= 1)

if name == 'main': unittest.main() Implementation Steps:

Create CorrelationAnalysis class in src/correlation.py. Implement calculate method using df.corr(). Add logging to capture the correlation results. 6. User Interaction for Decision Points Test Case: User Decision on Missing Data Handling Objective: Ensure the application prompts the user and handles their input correctly.

Test Code (tests/test_validation.py):

python Copy code from unittest.mock import patch

def test_user_decision_on_missing_data(self): with patch('builtins.input', return_value='impute'): decision = self.validation.handle_missing_data_decision() self.assertEqual(decision, 'impute')

if name == 'main': unittest.main() Implementation Steps:

Implement handle_missing_data_decision method in DataValidation. Use input() function to prompt the user. Add logging to record the user's decision. 7. Logging Component Test Case: Logging Functionality Objective: Verify that logging occurs at each step and logs are correctly stored.

Test Code (tests/test_logger.py):

python Copy code import unittest import logging from src.logger import setup_logging

class TestLogging(unittest.TestCase): def setUp(self): setup_logging() self.logger = logging.getLogger('dataflow')

def test_logging_output(self):
    with self.assertLogs('dataflow', level='INFO') as cm:
        self.logger.info('Test log message')
    self.assertIn('Test log message', cm.output[0])

if name == 'main': unittest.main() Implementation Steps:

Create setup_logging function in src/logger.py. Configure logging to output to both console and file. Ensure logging format includes timestamps and log levels.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data/uploads		data/uploads
node_modules		node_modules
src		src
static		static
templates		templates
tests		tests
.gitignore		.gitignore
README.md		README.md
TODO.md		TODO.md
app.py		app.py
babel.config.js		babel.config.js
config.yaml		config.yaml
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Jameshuff91/ML-Ingestor

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages