DataFlow Assistant Introduction DataFlow Assistant is a user-friendly application designed to streamline the process of data ingestion and preliminary data science steps necessary for model development. The frontend UI is built using JavaScript and managed with npm, while the backend is developed using the Flask Python framework. With an intuitive frontend interface, users can easily connect to data sources, including the ability to drag and drop CSV files. The application performs essential data validation tasks and allows users to interact and make decisions during the data processing workflow.
Features Intuitive User Interface: Easy-to-use frontend for connecting data sources, with drag-and-drop functionality for CSV files. Data Ingestion: Supports importing data from CSV files and other data sources. Data Validation: Imputation: Handles missing values with various imputation methods. Negative Value Detection: Identifies and reports negative values in the dataset. Duplicate Detection: Finds duplicates or distinct elements within the data. Entity-Relationship Diagram (ERD): Automatically generates ERDs to visualize data flow and relationships. Correlation Analysis: Analyzes and displays relationships and correlations between data elements. User Interaction: Enables users to provide input or directives when decisions are required during data processing. Robust Logging: Comprehensive logging for every function, command, and class, facilitating easy debugging and transparency. Test-Driven Development: Developed using TDD to ensure high code quality and reliability. Installation Prerequisites Python 3.8 or higher pip package manager Steps Clone the Repository
bash Copy code git clone https://github.com/yourusername/dataflow-assistant.git cd dataflow-assistant Create a Virtual Environment
bash
Copy code
python -m venv venv
source venv/bin/activate # On Windows use venv\Scripts\activate
Install Dependencies
bash Copy code pip install -r requirements.txt Configure the Application
Edit the config.yaml file to set up any necessary configurations. Run the Application
bash Copy code python app.py Access the Frontend Interface
Open your web browser and navigate to http://localhost:8000. Usage Data Ingestion
Drag and Drop: Simply drag your CSV file into the designated area on the homepage. File Upload: Click the "Upload" button to browse and select your CSV file. Data Validation Process
Imputation of Missing Values: The application detects missing values and suggests imputation methods. Negative Values Detection: Highlights negative values and prompts for corrective actions. Duplicate Detection: Identifies duplicates and allows you to decide whether to remove or retain them. ERD Generation: Automatically creates an Entity-Relationship Diagram to visualize data relationships. Correlation Analysis: Provides insights into the correlations between different data elements. User Interaction
Decision Points: When the application encounters issues like extensive missing data, it will recommend solutions but allows you to make the final decision. Comments and Directives: You can add comments or directives at any point to guide the data processing steps. Proceed to Model Development
After validation, export the processed data or proceed to model development if integrated within the application. Logging All operations are logged with detailed information. Logs are stored in the logs/ directory. Logging levels (DEBUG, INFO, WARNING, ERROR) can be configured in the config.yaml file. Running Tests The application is developed using Test-Driven Development.
Tests are located in the tests/ directory.
To run tests:
bash Copy code python -m unittest discover tests To check test coverage (requires coverage package):
bash Copy code coverage run -m unittest discover tests coverage report Contributing Contributions are welcome! Please read the CONTRIBUTING.md file for guidelines on how to contribute to this project.
License This project is licensed under the MIT License - see the LICENSE file for details.
Contact Email: support@dataflowassistant.com GitHub Issues: https://github.com/yourusername/dataflow-assistant/issues Initial Development Steps and Tests To develop this application using Test-Driven Development (TDD), we'll start by writing tests for each component before implementing them. Below are the first few steps and corresponding tests to develop the application.
- Project Setup Objective: Set up the initial project structure and environment.
Actions:
Initialize a Git repository. Create the basic directory structure. Set up a virtual environment. Create requirements.txt for dependencies. Directory Structure:
markdown Copy code dataflow-assistant/ ├── app.py ├── config.yaml ├── requirements.txt ├── README.md ├── LICENSE ├── CONTRIBUTING.md ├── src/ │ ├── init.py │ ├── ingestion.py │ ├── validation.py │ ├── erd_generator.py │ ├── correlation.py │ ├── ui.py │ └── logger.py ├── tests/ │ ├── init.py │ ├── test_ingestion.py │ ├── test_validation.py │ ├── test_erd_generator.py │ ├── test_correlation.py │ └── test_ui.py ├── logs/ └── data/ 2. Data Ingestion Component Test Case 1: CSV File Ingestion Objective: Verify that the application can ingest CSV files correctly.
Test Code (tests/test_ingestion.py):
python Copy code import unittest import os from src.ingestion import DataIngestion
class TestDataIngestion(unittest.TestCase): def setUp(self): self.ingestion = DataIngestion() self.sample_csv = 'data/sample.csv' # Create a sample CSV file for testing with open(self.sample_csv, 'w') as f: f.write('id,name,value\n1,Alice,10\n2,Bob,20\n')
def tearDown(self):
# Remove the sample CSV file after tests
os.remove(self.sample_csv)
def test_load_csv(self):
data = self.ingestion.load_csv(self.sample_csv)
self.assertEqual(len(data), 2)
self.assertIn('name', data.columns)
self.assertIn('value', data.columns)
if name == 'main': unittest.main() Implementation Steps:
Create DataIngestion class in src/ingestion.py. Implement the load_csv method using pandas.read_csv. Ensure the method includes logging at the INFO level. Test Case 2: Drag-and-Drop Functionality (UI) Objective: Ensure the drag-and-drop feature works in the frontend.
Test Code (tests/test_ui.py):
For UI testing, you might use a tool like Selenium.
python Copy code import unittest from selenium import webdriver from selenium.webdriver.common.action_chains import ActionChains
class TestUI(unittest.TestCase): def setUp(self): self.driver = webdriver.Chrome() self.driver.get('http://localhost:8000')
def tearDown(self):
self.driver.quit()
def test_drag_and_drop_csv(self):
driver = self.driver
upload_area = driver.find_element_by_id('upload-area')
file_path = '/path/to/tests/data/sample.csv'
# Simulate drag and drop
# (Implementation depends on the testing framework and may require additional setup)
self.assertTrue(driver.find_element_by_id('upload-success').is_displayed())
if name == 'main': unittest.main() Implementation Steps:
Develop the frontend using a framework like React or simple HTML/JavaScript. Implement drag-and-drop functionality. Ensure the backend can receive and process the uploaded file. 3. Data Validation Component Test Case 1: Missing Values Detection Objective: Check if missing values are correctly identified in the dataset.
Test Code (tests/test_validation.py):
python Copy code import unittest import pandas as pd from src.validation import DataValidation
class TestDataValidation(unittest.TestCase): def setUp(self): data = {'id': [1, 2, 3], 'value': [10, None, 30]} self.df = pd.DataFrame(data) self.validation = DataValidation(self.df)
def test_find_missing_values(self):
missing_values = self.validation.find_missing_values()
self.assertEqual(missing_values['value'], 1)
self.assertEqual(missing_values['id'], 0)
if name == 'main': unittest.main() Implementation Steps:
Create DataValidation class in src/validation.py. Implement find_missing_values method using pandas.isnull() and sum(). Add appropriate logging statements. Test Case 2: Negative Values Detection Objective: Detect negative values in numerical columns.
Test Code (tests/test_validation.py):
python Copy code def test_find_negative_values(self): data = {'id': [1, 2, 3], 'value': [10, -20, 30]} df = pd.DataFrame(data) validation = DataValidation(df) negative_values = validation.find_negative_values() self.assertEqual(negative_values['value'], [-20]) self.assertNotIn('id', negative_values)
if name == 'main': unittest.main() Implementation Steps:
Implement find_negative_values method in DataValidation. Use df[df < 0] to find negative values. Include logging of negative value detection. Test Case 3: Duplicate Detection Objective: Identify duplicate rows in the dataset.
Test Code (tests/test_validation.py):
python Copy code def test_find_duplicates(self): data = {'id': [1, 2, 2], 'value': [10, 20, 20]} df = pd.DataFrame(data) validation = DataValidation(df) duplicates = validation.find_duplicates() self.assertEqual(len(duplicates), 1) self.assertEqual(duplicates.iloc[0]['id'], 2)
if name == 'main': unittest.main() Implementation Steps:
Implement find_duplicates method using df[df.duplicated()]. Add logging to record duplicate entries found. 4. Entity-Relationship Diagram (ERD) Generation Test Case: ERD Generation Objective: Verify that the ERD is generated based on the data.
Test Code (tests/test_erd_generator.py):
python Copy code import unittest from src.erd_generator import ERDGenerator
class TestERDGenerator(unittest.TestCase): def setUp(self): self.sample_schema = { 'tables': { 'users': ['id', 'name', 'email'], 'orders': ['id', 'user_id', 'product_id'] }, 'relationships': [ ('users.id', 'orders.user_id') ] } self.erd_generator = ERDGenerator(self.sample_schema)
def test_generate_erd(self):
result = self.erd_generator.generate()
self.assertTrue(result) # Assuming the method returns True on success
self.assertTrue(os.path.exists('output/erd.png'))
if name == 'main': unittest.main() Implementation Steps:
Create ERDGenerator class in src/erd_generator.py. Implement the generate method using a library like pygraphviz or ERAlchemy. Ensure the ERD is saved to an output directory. Include logging for ERD generation steps. 5. Correlation Analysis Test Case: Correlation Calculation Objective: Confirm that correlation coefficients are calculated correctly.
Test Code (tests/test_correlation.py):
python Copy code import unittest import pandas as pd from src.correlation import CorrelationAnalysis
class TestCorrelationAnalysis(unittest.TestCase): def setUp(self): data = { 'A': [1, 2, 3, 4, 5], 'B': [2, 4, 6, 8, 10], 'C': [5, 3, 6, 2, 1] } self.df = pd.DataFrame(data) self.correlation = CorrelationAnalysis(self.df)
def test_calculate_correlation(self):
corr_matrix = self.correlation.calculate()
self.assertEqual(corr_matrix.loc['A', 'B'], 1.0)
self.assertTrue(-1 <= corr_matrix.loc['A', 'C'] <= 1)
if name == 'main': unittest.main() Implementation Steps:
Create CorrelationAnalysis class in src/correlation.py. Implement calculate method using df.corr(). Add logging to capture the correlation results. 6. User Interaction for Decision Points Test Case: User Decision on Missing Data Handling Objective: Ensure the application prompts the user and handles their input correctly.
Test Code (tests/test_validation.py):
python Copy code from unittest.mock import patch
def test_user_decision_on_missing_data(self): with patch('builtins.input', return_value='impute'): decision = self.validation.handle_missing_data_decision() self.assertEqual(decision, 'impute')
if name == 'main': unittest.main() Implementation Steps:
Implement handle_missing_data_decision method in DataValidation. Use input() function to prompt the user. Add logging to record the user's decision. 7. Logging Component Test Case: Logging Functionality Objective: Verify that logging occurs at each step and logs are correctly stored.
Test Code (tests/test_logger.py):
python Copy code import unittest import logging from src.logger import setup_logging
class TestLogging(unittest.TestCase): def setUp(self): setup_logging() self.logger = logging.getLogger('dataflow')
def test_logging_output(self):
with self.assertLogs('dataflow', level='INFO') as cm:
self.logger.info('Test log message')
self.assertIn('Test log message', cm.output[0])
if name == 'main': unittest.main() Implementation Steps:
Create setup_logging function in src/logger.py. Configure logging to output to both console and file. Ensure logging format includes timestamps and log levels.