kmflow

Modular data segmentation using PCA and K-Means in Python

Overview

kmflow is a lightweight Python package and CLI toolset for exploring, segmenting, and visualizing tabular data via principal-component analysis (PCA) and K-Means clustering. It also ships with a Streamlit dashboard (app/app.py) for interactive analysis.

Key features:

CLI workflows for preprocessing, clustering, PCA, evaluation & plotting
Streamlit dashboard for point-and-click analysis and download of results
Modular utils for wrangling, scaling, PCA summary, cluster prep, benchmarks
Cookiecutter-inspired layout that’s been customized for clarity and flexibility

Quickstart

Clone the repo

git clone git@github.com:you/kmflow.git
cd kmflow

Create & activate a virtual environment

# Create venv
python -m venv .venv

# Activate
source .venv/bin/activate    # macOS / Linux
source .venv/Scripts/activate # Windows

Install kmflow package (editable mode)
```
pip install -e .
```

Install dependencies

pip install -r requirements.txt

# If pip isn't installed, run this and then install dependencies
pip install -U pip

Run the dashboard if you want a more user-friendly experience (Optional)
```
cd app
streamlit run app.py
```
This will launch and host the dashboard on your local machine. Alternatively, you can skip hosting it locally and use the hosted version by following this link: K-Means Clustering Dashboard

If you see this message after clicking the link directly above:
```
This app has gone to sleep due to inactivity. Would you like to wake it back up?
```
It's completely fine to click:
```
Yes, get this app back up!
```
At this time, I'm not paying for a streamlit account and the free version doesn't allow me to keep the app up for very long if no one is using it.
Check out the Description below for a detailed breakdown and examples of how to use the CLI if you'd like to go beyond the dashboard experience.

Project Structure

├── app/                          
│   └── app.py                   # Streamlit dashboard entrypoint  
├── docs/                        # A default mkdocs project
│   └── mkdocs.md  
├── kmflow/                      # Python package  
│   ├── cli/                     # Command-line interface  
│   ├── utils/                   # Helper modules (wrangle, scale, PCA, plots, eval)  
│   └── config.py                # Global constants & defaults  
├── main/                        
│   └── entrypoint.py            # Console‐script entry point: builds the `kmflow` Typer app
|                                  by importing and registering all subcommands from kmflow/cli 
├── tests/                       
│   └── test_data.py             # Basic unit tests  
├── LICENSE                      
├── Makefile                     # `make data`, `make train`, etc.  
├── pyproject.toml               # Packaging + tool config (black, isort, etc.)  
├── README.md                    # You are here  
└── requirements.txt             # Pinned dependencies

Description

`init.py`

Three different files
- Top-level
- util
- CLI

File	Purpose
`kmflow/__init__.py`	Package initializer—defines `__version__`, exposes top-level API and makes `kmflow` importable.
`kmflow/cli/__init__.py`	CLI entry-point setup—imports and registers all subcommands into a single Typer app.
`kmflow/utils/__init__.py`	Utility namespace—collects and exposes core helper functions (`wrangle`, `kmeans`, `pca`, etc.) for easy import.

`kmflow/cli/wrangle.py` & `kmflow/utils/wrangle_utils.py`

Purpose: basic data-wrangling and IQR-based outlier handling as a prelude to scaling and clustering.
CLI (wrangle.py): create flags to invoke any custom cleaning steps you create and use built-in flags to detect or remove interquartile-range outliers.
Utils (wrangle_utils.py): core functions for:
- simple cleaning helpers (e.g. drop columns, drop rows)
- IQR outlier detection & optional removal
Extensibility: these modules form a scaffold—you’re expected to add dataset-specific wrangling functions (e.g. custom parsers, imputation routines) and register them in the CLI so unique “dirty” cases can be handled.
When to use: once your tabular data is in the general shape you need it to be in (e.g. no missing values, etc.), go ahead and run the IQR outlier & removal processes.

Examples on how to use `wrangle.py` in bash:

Note: export-outliers and --remove-outliers cannot be used in the same command.

Export outliers into a CSV file using raw data:

cat data/raw/raw_data.csv \
    | kmflow wrangle outlier \
    --export-outliers \
    > exported_outliers.csv

You can then expect to see a new CSV file inside your project by the same name you assigned. Upon opening the CSV file, you should see something like this (different column names, row_index, and outlier_value depending on your data):

row_index,column,outlier_value
2,headsize,125.0
7,balance,-0.375
23,length,27.6

Remove outliers and export reduced data into a predetermined destination for holding results:

cat data/raw/raw_data.csv \
    | kmflow wrangle outlier \
    --remove-outliers \
    > data/interim/removed_outliers.csv

`kmflow/cli/plots.py` & `kmflow/utils/plots_utils.py`

Purpose: A suite of diagnostic and exploratory visualizations for clustering and PCA results, from 2D/3D biplots, to inertia and silhouette charts.
CLI (plots.py): Exposes commands like (biplot, 3d-biplot, inertia, silhouette, etc.). Reads input via stdin, accepts plot-specific flags (e.g. --hue, --skip-scores, etc.) and writes output to disk using a properly named .png, or display the plot interactively (when --no-save is used).
Utils (plots_utils.py): Core plotting functions (using matplotlib, seaborn, and plotly under the hood) that handle data formatting, layout/styling, and figure saving. Includes shared helpers for colors, themes, and axis formatting.

Examples on how to use `plots.py` in bash:

Create and display an interactive histogram (not saved to disk):

cat data/processed/processed_data.csv \
    | kmflow plots histogram \
    <COLUMN> \
    --bins 50 \
    --no-save

Create and write a histogram to a predetermined destination:

cat data/processed/processed_data.csv \
    | kmflow plots histogram \
    <COLUMN> \
    --bins 25 \
    > reports/figures/histogram.png

`kmflow/cli/process.py` & `kmflow/utils/process_utils.py`

Purpose: Apply configurable scaling and transformations (normalization, standardization, min–max, log1p, Yeo–Johnson) to prepare data for PCA and K-Means.
CLI (process.py): Accepts flags for one or more transformations (e.g. --standardize, --minmax, --log1p, --yeo-johnson), and writes the transformed DataFrame to disk or stdout.
Utils (process_utils.py): Implements core scaling functions, handles column selection and transformer fitting, and integrates with I/O helpers for seamless DataFrame processing.
When to use: Run immediately after wrangling raw data to ensure features are on a comparable scale before PCA or K-Means clustering, if your data isn't already on a similar scale.

Example on how to use `process.py` in bash:

Standardize data and write to disk:

cat data/interim/preprocessed_data.csv \
    | kmflow process \
    --std \
    > processed_data.csv

`kmflow/cli/pca.py` & `kmflow/utils/pca_utils.py`

Purpose: Perform principal component analysis to reduce dimensionality and extract component scores, loadings, proportion of variance explained, and proportion of cumulative variance explained.
CLI (pca.py): Accepts flags like --n-components <int>, and writes a folder containing the PCA summary CSV files.
Utils (pca_utils.py): Implements PCA decomposition (using scikit-learn), computes scores, loadings, proportion of variance, and cumulative variance explained.
When to use: After preprocessing (wrangling and scaling) — and optionally clustering — to explore principal components and feed them into downstream analyses.

Example on how to use `pca.py` in bash:

Running PCA with a seed, on 3 specific columns, in a processed dataset and writing to disk:

cat data/processed/DIR/std.csv \
    | kmflow pca \
    --seed 3429 \
    --numeric-cols "<COLUMN>, <COLUMN2>, <COLUMN3>" \
    --n-components 3 \
    > processed_pca.csv

`kmflow/cli/kmeans.py` & `kmflow/utils/kmeans_utils.py`

Purpose: Fit K-Means clustering to your data, assigning cluster labels and optionally running batch fits over multiple k values.
CLI (kmeans.py): Accepts clustering flags (--n-clusters, --init-method, --seed, etc.), and writes the labeled DataFrame to disk or stdout.
Utils (kmeans_utils.py): Core functions that wrap scikit-learn’s KMeans, handle single—or multi-k fits, manage random-state reproducibility, and append cluster labels to the DataFrame.
When to use: Run after data has been wrangled and scaled (optional) to segment into groups before evaluation or plotting.

Example on how to use `kmeans.py` in bash:

Running and writing k-means to disk on a processed dataset, using 8 clusters and a seed, with the default scikit-learn KMeans initialization method:

cat data/processed/std_outliers/std.csv \
    | kmflow kmeans fit-km \
    8 \
    --seed 123 \
    --init k-means++ \
    > clustered_8.csv

Running and writing k-means to a predetermined destination on a processed dataset, using 8 clusters and a seed, with the algorithm set to elkan, initialization method as random, and number of initializations (--n-init) set to 100 iterations:

cat data/processed/std_outliers/std.csv \
    | kmflow kmeans fit-km \
    8 \
    --seed 9250 \
    --algorithm elkan \
    --init random \
    --n-init 100 \
    > data/processed/std_outliers/algo_elkan_init_random/clustered_8.csv

`kmflow/cli/evaluation.py` & `kmflow/utils/evaluation_utils.py`

Purpose: Compute/benchmark clustering evaluation metrics (inertia, silhouette score, Calinski–Harabasz index, Davies–Bouldin score).
CLI (evaluation.py): Accepts similar flags from kmeans.py.
Utils (evaluation_utils.py): Core functions that calculate each metric, assemble them into a tidy DataFrame, and hand off to I/O helpers.
When to use: Immediately after all clustering is finished to generate a quantitative report (see "Benchmarking Multiple Metrics" section below).

Example on how to use `evaluation.py` in bash:

Run and write a silhouette score analysis using scikit-learn KMeans on a model using a seed, elkan algorithm, random initialization, and number of iterations (--n-init) set to 50 iterations:

cat data/processed/std_outliers/std.csv \
    | kmflow evaluate silhouette \
    --seed 2985 \ 
    --algorithm elkan \
    --init random \
    --n-init 50 \
    > data/processed/std_outliers/algo_elkan_init_random/std_silhouette.csv

Benchmarking Multiple Metrics inside `evaluation.py`

kmflow also provides a benchmark command inside evaluation.py to find Calinski–Harabasz and Davies–Bouldin CSVs, merge them into one table, and write to stdout or disk. It uses this regex to discover result folders:

r"^algo_([^_]+)_init_(.+)$"

Directory & file naming conventions

data/processed_root/
└── <variant>/                      
    └── kmeans_<pipeline>/          
        └── algo_<algorithm>_init_<init>/  
            ├── *_calinski.csv  
            └── *_davies.csv

<variant> = your processing variant (e.g. std, etc.)
algo_<algorithm>_init_<init> must match the regex above
Clustering evaluation metric files must end in _calinski.csv or _davies.csv

Merge via the CLI

kmflow evaluate benchmark <input_dir> [flag]

<input_dir> = subfolder under data/ (e.g. processed)
[flag] --decimals = round metric values (default 3 decimal places)

Example on how to use `benchmark` in bash:

Evaluate and write to stdout:

kmflow evaluate benchmark processed \
--decimals 4

Write to disk

kmflow evaluate benchmark processed \
> benchmark.csv

`kmflow/cli/cluster_prep.py` & `kmflow/utils/cluster_prep_utils.py`

Purpose: Merge raw and clustered DataFrames to produce per-cluster summary statistics and label counts.
CLI (cluster_prep.py):
- cluster-profiles: reads a raw CSV and a clustered CSV, accepts --cluster-col (and optional --key-col to join on a key column), and writes a per-cluster profile table.
- map-clusters: reads a clustered CSV, prompts interactively to map integer IDs to human labels, counts each label, and writes the counts table.
Utils (cluster_prep_utils.py):
- merge_cluster_labels() aligns raw & clustered rows by index (or key column).
- get_cluster_profiles() computes summary statistics (mean, median, min, max) per cluster.
- clusters_to_labels() and count_labels() map IDs to labels and tally counts.
When to use: After K-Means clustering, to turn raw + cluster outputs into interpretable summaries and human-readable labels.

Why Prep My Clusters?

If your data is not on a similar scale, K-Means and PCA won't perform well, but scaled data loses its original units—making it hard to interpret clusters in context. The cluster_prep.py tools let you reattach cluster assignments to the unscaled raw data so you can:

Compute summary statistics in the original measurement units
Label clusters meaningfully based on real-world values
Count and compare cluster sizes accurately
Preserve the benefits of scaling for analysis while keeping outputs interpretable

Example on how to use `cluster_prep.py` in bash:

Note: While stdin -> stdout piping is supported, I highly recommend using explicit file arguments as shown below.

Write to a file

kmflow cluster-prep cluster-profiles \
  data/raw/DIR/example.csv \
  data/processed/DIR/variant/algo_<algorithm>_init_<init>/std_clustered_5.csv \
  cluster_5 \
  > data/external/cluster_profile.csv

Write to stdout

kmflow cluster-prep cluster-profiles \
  data/raw/DIR/example.csv \
  data/processed/DIR/variant/algo_<algorithm>_init_<init>/std_clustered_5.csv \
  cluster_5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

kmflow

Overview

Quickstart

Project Structure

Description

`init.py`

`kmflow/cli/wrangle.py` & `kmflow/utils/wrangle_utils.py`

Examples on how to use `wrangle.py` in bash:

`kmflow/cli/plots.py` & `kmflow/utils/plots_utils.py`

Examples on how to use `plots.py` in bash:

`kmflow/cli/process.py` & `kmflow/utils/process_utils.py`

Example on how to use `process.py` in bash:

`kmflow/cli/pca.py` & `kmflow/utils/pca_utils.py`

Example on how to use `pca.py` in bash:

`kmflow/cli/kmeans.py` & `kmflow/utils/kmeans_utils.py`

Example on how to use `kmeans.py` in bash:

`kmflow/cli/evaluation.py` & `kmflow/utils/evaluation_utils.py`

Example on how to use `evaluation.py` in bash:

Benchmarking Multiple Metrics inside `evaluation.py`

Directory & file naming conventions

Merge via the CLI

Example on how to use `benchmark` in bash:

`kmflow/cli/cluster_prep.py` & `kmflow/utils/cluster_prep_utils.py`

Why Prep My Clusters?

Example on how to use `cluster_prep.py` in bash:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 589 Commits
.devcontainer		.devcontainer
app		app
docs		docs
kmflow		kmflow
main		main
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

eagereconomist/kmflow

Folders and files

Latest commit

History

Repository files navigation

kmflow

Overview

Quickstart

Project Structure

Description

__init__.py

kmflow/cli/wrangle.py & kmflow/utils/wrangle_utils.py

Examples on how to use wrangle.py in bash:

kmflow/cli/plots.py & kmflow/utils/plots_utils.py

Examples on how to use plots.py in bash:

kmflow/cli/process.py & kmflow/utils/process_utils.py

Example on how to use process.py in bash:

kmflow/cli/pca.py & kmflow/utils/pca_utils.py

Example on how to use pca.py in bash:

kmflow/cli/kmeans.py & kmflow/utils/kmeans_utils.py

Example on how to use kmeans.py in bash:

kmflow/cli/evaluation.py & kmflow/utils/evaluation_utils.py

Example on how to use evaluation.py in bash:

Benchmarking Multiple Metrics inside evaluation.py

Directory & file naming conventions

Merge via the CLI

Example on how to use benchmark in bash:

kmflow/cli/cluster_prep.py & kmflow/utils/cluster_prep_utils.py

Why Prep My Clusters?

Example on how to use cluster_prep.py in bash:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

`init.py`

`kmflow/cli/wrangle.py` & `kmflow/utils/wrangle_utils.py`

Examples on how to use `wrangle.py` in bash:

`kmflow/cli/plots.py` & `kmflow/utils/plots_utils.py`

Examples on how to use `plots.py` in bash:

`kmflow/cli/process.py` & `kmflow/utils/process_utils.py`

Example on how to use `process.py` in bash:

`kmflow/cli/pca.py` & `kmflow/utils/pca_utils.py`

Example on how to use `pca.py` in bash:

`kmflow/cli/kmeans.py` & `kmflow/utils/kmeans_utils.py`

Example on how to use `kmeans.py` in bash:

`kmflow/cli/evaluation.py` & `kmflow/utils/evaluation_utils.py`

Example on how to use `evaluation.py` in bash:

Benchmarking Multiple Metrics inside `evaluation.py`

Example on how to use `benchmark` in bash:

`kmflow/cli/cluster_prep.py` & `kmflow/utils/cluster_prep_utils.py`

Example on how to use `cluster_prep.py` in bash:

Packages