Modular data segmentation using PCA and K-Means in Python
kmflow is a lightweight Python package and CLI toolset for exploring, segmenting, and visualizing tabular data via principal-component analysis (PCA) and K-Means clustering. It also ships with a Streamlit dashboard (app/app.py
) for interactive analysis.
Key features:
- CLI workflows for preprocessing, clustering, PCA, evaluation & plotting
- Streamlit dashboard for point-and-click analysis and download of results
- Modular utils for wrangling, scaling, PCA summary, cluster prep, benchmarks
- Cookiecutter-inspired layout that’s been customized for clarity and flexibility
-
Clone the repo
git clone git@github.com:you/kmflow.git cd kmflow
-
Create & activate a virtual environment
# Create venv python -m venv .venv # Activate source .venv/bin/activate # macOS / Linux source .venv/Scripts/activate # Windows
-
Install kmflow package (editable mode)
pip install -e .
-
Install dependencies
pip install -r requirements.txt # If pip isn't installed, run this and then install dependencies pip install -U pip
-
Run the dashboard if you want a more user-friendly experience (Optional)
cd app streamlit run app.py
This will launch and host the dashboard on your local machine. Alternatively, you can skip hosting it locally and use the hosted version by following this link: K-Means Clustering Dashboard
If you see this message after clicking the link directly above:
This app has gone to sleep due to inactivity. Would you like to wake it back up?
It's completely fine to click:
Yes, get this app back up!
At this time, I'm not paying for a streamlit account and the free version doesn't allow me to keep the app up for very long if no one is using it.
-
Check out the Description below for a detailed breakdown and examples of how to use the CLI if you'd like to go beyond the dashboard experience.
├── app/
│ └── app.py # Streamlit dashboard entrypoint
├── docs/ # A default mkdocs project
│ └── mkdocs.md
├── kmflow/ # Python package
│ ├── cli/ # Command-line interface
│ ├── utils/ # Helper modules (wrangle, scale, PCA, plots, eval)
│ └── config.py # Global constants & defaults
├── main/
│ └── entrypoint.py # Console‐script entry point: builds the `kmflow` Typer app
| by importing and registering all subcommands from kmflow/cli
├── tests/
│ └── test_data.py # Basic unit tests
├── LICENSE
├── Makefile # `make data`, `make train`, etc.
├── pyproject.toml # Packaging + tool config (black, isort, etc.)
├── README.md # You are here
└── requirements.txt # Pinned dependencies
- Three different files
- Top-level
- util
- CLI
File | Purpose |
---|---|
kmflow/__init__.py |
Package initializer—defines __version__ , exposes top-level API and makes kmflow importable. |
kmflow/cli/__init__.py |
CLI entry-point setup—imports and registers all subcommands into a single Typer app. |
kmflow/utils/__init__.py |
Utility namespace—collects and exposes core helper functions (wrangle , kmeans , pca , etc.) for easy import. |
-
Purpose: basic data-wrangling and IQR-based outlier handling as a prelude to scaling and clustering.
-
CLI (
wrangle.py
): create flags to invoke any custom cleaning steps you create and use built-in flags to detect or remove interquartile-range outliers. -
Utils (
wrangle_utils.py
): core functions for:- simple cleaning helpers (e.g. drop columns, drop rows)
- IQR outlier detection & optional removal
-
Extensibility: these modules form a scaffold—you’re expected to add dataset-specific wrangling functions (e.g. custom parsers, imputation routines) and register them in the CLI so unique “dirty” cases can be handled.
-
When to use: once your tabular data is in the general shape you need it to be in (e.g. no missing values, etc.), go ahead and run the IQR outlier & removal processes.
Note: export-outliers
and --remove-outliers
cannot be used in the same command.
- Export outliers into a CSV file using raw data:
cat data/raw/raw_data.csv \
| kmflow wrangle outlier \
--export-outliers \
> exported_outliers.csv
You can then expect to see a new CSV file inside your project by the same name you assigned. Upon opening the CSV file, you should see something like this (different column names, row_index, and outlier_value depending on your data):
row_index,column,outlier_value
2,headsize,125.0
7,balance,-0.375
23,length,27.6
- Remove outliers and export reduced data into a predetermined destination for holding results:
cat data/raw/raw_data.csv \
| kmflow wrangle outlier \
--remove-outliers \
> data/interim/removed_outliers.csv
-
Purpose: A suite of diagnostic and exploratory visualizations for clustering and PCA results, from 2D/3D biplots, to inertia and silhouette charts.
-
CLI (
plots.py
): Exposes commands like (biplot
,3d-biplot
,inertia
,silhouette
, etc.). Reads input via stdin, accepts plot-specific flags (e.g.--hue
,--skip-scores
, etc.) and writes output to disk using a properly named.png
, or display the plot interactively (when--no-save
is used). -
Utils (
plots_utils.py
): Core plotting functions (using matplotlib, seaborn, and plotly under the hood) that handle data formatting, layout/styling, and figure saving. Includes shared helpers for colors, themes, and axis formatting.
- Create and display an interactive histogram (not saved to disk):
cat data/processed/processed_data.csv \
| kmflow plots histogram \
<COLUMN> \
--bins 50 \
--no-save
- Create and write a histogram to a predetermined destination:
cat data/processed/processed_data.csv \
| kmflow plots histogram \
<COLUMN> \
--bins 25 \
> reports/figures/histogram.png
-
Purpose: Apply configurable scaling and transformations (normalization, standardization, min–max, log1p, Yeo–Johnson) to prepare data for PCA and K-Means.
-
CLI (
process.py
): Accepts flags for one or more transformations (e.g.--standardize
,--minmax
,--log1p
,--yeo-johnson
), and writes the transformed DataFrame to disk or stdout. -
Utils (
process_utils.py
): Implements core scaling functions, handles column selection and transformer fitting, and integrates with I/O helpers for seamless DataFrame processing. -
When to use: Run immediately after wrangling raw data to ensure features are on a comparable scale before PCA or K-Means clustering, if your data isn't already on a similar scale.
-
Standardize data and write to disk:
cat data/interim/preprocessed_data.csv \ | kmflow process \ --std \ > processed_data.csv
-
Purpose: Perform principal component analysis to reduce dimensionality and extract component scores, loadings, proportion of variance explained, and proportion of cumulative variance explained.
-
CLI (
pca.py
): Accepts flags like--n-components <int>
, and writes a folder containing the PCA summary CSV files. -
Utils (
pca_utils.py
): Implements PCA decomposition (using scikit-learn), computes scores, loadings, proportion of variance, and cumulative variance explained. -
When to use: After preprocessing (wrangling and scaling) — and optionally clustering — to explore principal components and feed them into downstream analyses.
-
Running PCA with a seed, on 3 specific columns, in a processed dataset and writing to disk:
cat data/processed/DIR/std.csv \ | kmflow pca \ --seed 3429 \ --numeric-cols "<COLUMN>, <COLUMN2>, <COLUMN3>" \ --n-components 3 \ > processed_pca.csv
-
Purpose: Fit K-Means clustering to your data, assigning cluster labels and optionally running batch fits over multiple
k
values. -
CLI (
kmeans.py
): Accepts clustering flags (--n-clusters
,--init-method
,--seed
, etc.), and writes the labeled DataFrame to disk or stdout. -
Utils (
kmeans_utils.py
): Core functions that wrap scikit-learn’s KMeans, handle single—or multi-k
fits, manage random-state reproducibility, and append cluster labels to the DataFrame. -
When to use: Run after data has been wrangled and scaled (optional) to segment into groups before evaluation or plotting.
-
Running and writing k-means to disk on a processed dataset, using 8 clusters and a seed, with the default scikit-learn KMeans initialization method:
cat data/processed/std_outliers/std.csv \ | kmflow kmeans fit-km \ 8 \ --seed 123 \ --init k-means++ \ > clustered_8.csv
- Running and writing k-means to a predetermined destination on a processed dataset, using 8 clusters and a seed, with the algorithm set to
elkan
, initialization method asrandom
, and number of initializations (--n-init
) set to 100 iterations:
cat data/processed/std_outliers/std.csv \ | kmflow kmeans fit-km \ 8 \ --seed 9250 \ --algorithm elkan \ --init random \ --n-init 100 \ > data/processed/std_outliers/algo_elkan_init_random/clustered_8.csv
- Running and writing k-means to a predetermined destination on a processed dataset, using 8 clusters and a seed, with the algorithm set to
-
Purpose: Compute/benchmark clustering evaluation metrics (inertia, silhouette score, Calinski–Harabasz index, Davies–Bouldin score).
-
CLI (
evaluation.py
): Accepts similar flags fromkmeans.py
. -
Utils (
evaluation_utils.py
): Core functions that calculate each metric, assemble them into a tidy DataFrame, and hand off to I/O helpers. -
When to use: Immediately after all clustering is finished to generate a quantitative report (see "Benchmarking Multiple Metrics" section below).
-
Run and write a silhouette score analysis using scikit-learn KMeans on a model using a seed,
elkan
algorithm,random
initialization, and number of iterations (--n-init
) set to 50 iterations:cat data/processed/std_outliers/std.csv \ | kmflow evaluate silhouette \ --seed 2985 \ --algorithm elkan \ --init random \ --n-init 50 \ > data/processed/std_outliers/algo_elkan_init_random/std_silhouette.csv
kmflow also provides a benchmark
command inside evaluation.py
to find Calinski–Harabasz and Davies–Bouldin CSVs, merge them into one table, and write to stdout or disk. It uses this regex to discover result folders:
r"^algo_([^_]+)_init_(.+)$"
data/processed_root/
└── <variant>/
└── kmeans_<pipeline>/
└── algo_<algorithm>_init_<init>/
├── *_calinski.csv
└── *_davies.csv
-
<variant>
= your processing variant (e.g.std
, etc.) -
algo_<algorithm>_init_<init>
must match the regex above -
Clustering evaluation metric files must end in
_calinski.csv
or_davies.csv
kmflow evaluate benchmark <input_dir> [flag]
-
<input_dir>
= subfolder underdata/
(e.g.processed
) -
[flag]
--decimals
= round metric values (default 3 decimal places)
-
Evaluate and write to stdout:
kmflow evaluate benchmark processed \ --decimals 4
-
Write to disk
kmflow evaluate benchmark processed \ > benchmark.csv
-
Purpose: Merge raw and clustered DataFrames to produce per-cluster summary statistics and label counts.
-
CLI (
cluster_prep.py
):-
cluster-profiles
: reads a raw CSV and a clustered CSV, accepts--cluster-col
(and optional--key-col
to join on a key column), and writes a per-cluster profile table. -
map-clusters
: reads a clustered CSV, prompts interactively to map integer IDs to human labels, counts each label, and writes the counts table.
-
-
Utils (
cluster_prep_utils.py
):merge_cluster_labels()
aligns raw & clustered rows by index (or key column).get_cluster_profiles()
computes summary statistics (mean, median, min, max) per cluster.clusters_to_labels()
andcount_labels()
map IDs to labels and tally counts.
-
When to use: After K-Means clustering, to turn raw + cluster outputs into interpretable summaries and human-readable labels.
If your data is not on a similar scale, K-Means and PCA won't perform well, but scaled data loses its original units—making it hard to interpret clusters in context.
The cluster_prep.py
tools let you reattach cluster assignments to the unscaled raw data so you can:
- Compute summary statistics in the original measurement units
- Label clusters meaningfully based on real-world values
- Count and compare cluster sizes accurately
- Preserve the benefits of scaling for analysis while keeping outputs interpretable
Note: While stdin -> stdout piping is supported, I highly recommend using explicit file arguments as shown below.
-
Write to a file
kmflow cluster-prep cluster-profiles \ data/raw/DIR/example.csv \ data/processed/DIR/variant/algo_<algorithm>_init_<init>/std_clustered_5.csv \ cluster_5 \ > data/external/cluster_profile.csv
-
Write to stdout
kmflow cluster-prep cluster-profiles \ data/raw/DIR/example.csv \ data/processed/DIR/variant/algo_<algorithm>_init_<init>/std_clustered_5.csv \ cluster_5