Python Data Profiler Tool

Overview

This project is a Python-based data profiling tool with a web interface built using Streamlit. It allows users to connect to various data sources (databases like PostgreSQL, Snowflake, and CSV files), select specific attributes (columns), and generate comprehensive profile reports.

The tool calculates various statistical metrics, identifies data types, detects common patterns (like SSN and Date of Birth), and stores these profiles in a designated database table. Additionally, it includes a feature to perform similarity analysis and clustering on the profiled attributes based on their metrics, helping users understand relationships and redundancy within their data.

Key Features

Multi-Source Connectivity: Connect to PostgreSQL, Snowflake databases, or upload CSV files.
Attribute Selection: Interactively select schemas, tables, and specific columns for profiling across different sources.
Comprehensive Profiling: Calculates a wide range of metrics:
- Common: Record count, null count/percentage, distinct count/percentage, uniqueness.
- Numeric: Min, max, mean, median, standard deviation, variance, quantiles, histogram data.
- String/Text: Min/max/average length, top frequent values, pattern detection flags.
- Date/Time: Min/max date, time range, histogram by year.
- Boolean: True/false counts and percentage.
Pattern Detection: Identifies potential SSN and Date of Birth columns using format and logical checks.
Persistent Results: Stores profiling results in a user-configured database table (PostgreSQL or Snowflake). Handles overwriting previous profiles for the same attribute.
Similarity Clustering: Performs hierarchical clustering on all stored profiles to group similar attributes based on their metrics. Cluster IDs are stored back in the results table.
Web UI: Interactive Streamlit interface for configuration, selection, progress viewing, and results display.

Project Structure

.
├── app.py                  # Main Streamlit application script
├── requirements.txt        # Project dependencies
├── PROJECT_PLAN.md         # Original architectural plan
├── LICENSE                 # Apache 2.0 License file
├── README.md               # This file
└── src/                    # Source code directory
    ├── __init__.py
    ├── app_logic.py        # Core application workflow logic (profiling job)
    ├── clustering_engine.py # Handles attribute clustering
    ├── config_manager.py   # Manages application configuration (connections, table names)
    ├── database_connector.py # Handles connections to DBs and CSV reading
    ├── pattern_detector.py # Logic for detecting SSN, DOB, etc.
    ├── profiling_engine.py # Core engine for calculating profile metrics
    ├── results_manager.py  # Handles saving/retrieving results from the database
    └── ui_components.py    # Reusable Streamlit UI elements (forms, displays)

Setup

Clone the repository:

git clone https://github.com/rejozacharia/dataprofiler
cd dataprofiler

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install dependencies:
```
pip install -r requirements.txt
```
Note: Depending on your system and database choices, you might need additional system libraries or drivers (e.g., for PostgreSQL).

Usage

Run the Streamlit application:
```
streamlit run app.py
```
Open your web browser to the local URL provided by Streamlit (usually http://localhost:8501).
Configure Connections:
- Use the sidebar to select the Source Data type (Database or CSV).
- Enter the required connection details for your source database or upload a CSV file.
- Configure the Results Storage database connection details and specify the desired table name (or use the default). You can opt to use the same connection as the source if applicable.
- Click "Connect to Source DB" and "Connect to Results DB".
Select Attributes:
- Use the selectors in the main area to choose the schema, table, and columns (for databases) or just columns (for CSV).
- Click "Add Columns to List" to add selected attributes (or all if none are selected) to the profiling queue.
- View and manage the selected attributes in the "Attributes Selected for Profiling" expander.
Run Profiling:
- Click the "Start Profiling Listed Attributes" button. Progress will be shown.
- View the generated profiles in the "Profiling Results" section. Results (including any errors) are saved to the configured results database table.
Run Clustering:
- Adjust the "Distance Threshold" if desired.
- Click "Run Clustering on All Stored Profiles".
- View the cluster assignments and the updated profiles table in the "Clustering Analysis" section.

License

This project is licensed under the Apache License, Version 2.0. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Python Data Profiler Tool

Overview

Key Features

Project Structure

Setup

Usage

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
PROJECT_PLAN.md		PROJECT_PLAN.md
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

License

rejozacharia/dataprofiler

Folders and files

Latest commit

History

Repository files navigation

Python Data Profiler Tool

Overview

Key Features

Project Structure

Setup

Usage

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages