🌊 TextWaves: simple NLP analysis on a tiny synthetic dataset

🖊️ Simon D. Angus, authored with AI-augmentation via SoDa Laboratory's Assistant AI tooling.

SoDa Laboratories, Monash Business Schoool, Monash University

Please fork to begin your project -- if using these tools for academic work, please cite as below.

📝 For more context around the text-as-data problem and applying these NLP 'waves' in a social-science context (i.e. narratives of democratic resilience) see: Angus, SD: "Tracking Policy-relevant Narratives of Democratic Resilience at Scale: from experts and machines, to AI & the transformer revolution" - SoDa Laboratories Working Paper 2024-07

TL;DR

Lean, hands-on repo to demonstrate three representation “waves” for text-as-data in social science, plus optional LLM-labelling, accompanying workshops/talks at U Greenwich, LDN, Sep 1–2 2025 — Methods in Social Change and Inertia Workshops.

Wave 1: TF‑IDF
Wave 2: Average GloVe embeddings
Wave 3: Transformer/LLM embeddings (OpenAI or SBERT fallback)
Optional: LLM labelling (zero-shot or few-shot) to directly assign labels to texts

We compare methods on a tiny, labelled synthetic corpus (3 classes × ~100 each). You can:

Cluster and assess separation/quality in 2D.
Train a simple logistic regression baseline (temporal train/test split: train on “history”, predict the “future”).
Run an LLM labelling pass and compare its accuracy/F1 to classic ML.

All runs write to a unified results CSV you can plot in one line.

🤔 Assumptions

You’re comfortable running a few terminal commands, have basic Python familiarity, and want a clear, minimal path to try NLP/AI on texts for social science questions.

🚀 Quick start (fastest path)

Install Python 3.10+
Create/activate a virtual environment and install dependencies. (💡 You don't need to do this -- you may have the required packages already available on your system -- but it's best practice to isolate dependencies per project, and will assist with reproducibility.)
```
python -m venv myvenv
source myvenv/bin/activate
pip install -r requirements.txt
```
(Optional) Add an OpenAI API key for OpenAI embeddings, LLM data synthesis and LLM labelling. (💡If you are new to 'keys', then Google "open ai key about"; and/or head to https://platform.openai.com to create/sign-in to the API platform and generate a key: click 'gear' icon for settings > API Keys (menu at left).)
- Copy .env.example to .env and set OPENAI_API_KEY=...
- No key? The pipeline still runs end-to-end:
  - OpenAI embeddings automatically fall back to SBERT (all-MiniLM-L6-v2).
  - LLM-based data synthesis falls back to the lexicon generator.
Run everything in one go, writing a unified metrics file

Classic ML only (TF‑IDF, GloVe, SBERT/OpenAI embeddings), lexicon synthetic data:
```
python -m src.cli.build_all
```
Also include LLM labelling (zero-shot by default):
```
python -m src.cli.build_all --include-llm
```
Few-shot LLM labelling instead (uses in-repo examples to guide the model):
```
python -m src.cli.build_all --include-llm --llm-prompt-style extended
```
Any of the above, but generating/using llm created synthetic data instead of lexicon:
```
python -m src.cli.build_all --data-source llm ...
```
Visualise the results

This reads the unified CSV and creates a comparison bar chart of weighted F1 across LR (by representation) and LLM:
```
python -m src.viz.results --data-source lexicon
```
Use --data-source llm if you built the LLM-authored synthetic dataset.

(Optional) Visuals on the fly

You can add these flags to any single-method run to save figures immediately:

python -m src.cli.run_pipeline --method tfidf --data-source lexicon \
    --plot-scatter --plot-timeseries --plot-future

That’s it. You now have:

A single CSV at outputs/tables/metrics_summary_{lexicon|llm|byod}.csv
A figure at outputs/figures/f1_comparison_{lexicon|llm}.png
Optional figures (if you used the flags): scatter/timeseries/future under outputs/figures/
A cache of run products you can plot later without recomputing under outputs/cache/

⚙ Pipeline approach (what’s happening under the hood)

The pipeline has five/six components you can mix and match:

Synthesise — tiny dataset (default 300 texts, 3 topics) using a lexicon or an LLM. Texts are timestamped so class prevalence ebbs/flows over time.
Vectorise — TF‑IDF, GloVe, or transformer-based embeddings (OpenAI with SBERT fallback).
Cluster and reduce — cluster (KMeans) and reduce to 2D (UMAP/PCA) for intuition and visualisation.
Train / predict — logistic regression trained on the vectors of the earliest slice of data (temporal split) and evaluated on the vectors of "future" texts.
Direct LLM labelling (optional) — zero-shot or few-shot labelling of texts by an LLM, compared to ground-truth labels.
Compare — a unified results CSV with clustering metrics and classification metrics, plus optional LLM labelling metrics.

Together these mirror common text-as-data workflows for computational social science and are easy to adapt to your own data.

💡 The idea is to allow you to see the strengths/weaknesses of different semantic representations (TF‑IDF, GloVe, LLM embeddings) of ecological texts on a simple, labelled dataset where you can control the data-generating process. You can also optionally compare to a direct LLM labelling process, either in 'zero-shot' (no examples, rely on model's world knowledge) or 'few-shot' (small in-repo examples per class) modes. These ideas are discussed in more detail in the paper referenced above.

Generate data explicitly (optional)

You can regenerate the synthetic data without running a full method:

Lexicon dataset (default):

python -m src.cli.run_pipeline --regen-data --data-source lexicon

LLM-authored dataset (requires key; defaults to gpt-4.1-nano-2025-04-14):

python -m src.cli.run_pipeline --regen-data --data-source llm

Each writes a CSV to synthetic_data/ (or data/ where noted) with columns text,label_name,timestamp,label.

Run a single method by hand (optional)

Choose your representation and run end-to-end on either dataset source:

# TF‑IDF example on the lexicon dataset
python -m src.cli.run_pipeline --method tfidf --data-source lexicon

# SBERT/OpenAI embeddings (auto-falls back to SBERT if no key)
python -m src.cli.run_pipeline --method openai --data-source llm

Use --reduce pca to switch from UMAP, or --clusterer kmeans to be explicit. See all options:

python -m src.cli.run_pipeline --help

Optional visualisations (from cache or on the fly)

You have two simple ways to get the nice visuals.

On the fly during a run (no extra commands):

python -m src.cli.run_pipeline --method tfidf --data-source lexicon \
    --plot-scatter --plot-timeseries --plot-future

Later, without recomputing, from cached outputs:

# After any prior run for a given (data_source, method)
# Uses outputs/cache/last_{data_source}_{method}.npz
python -m src.cli.plot_cached --what timeseries --data-source lexicon     # - no method required, just plots synthetic data timeseries
python -m src.cli.plot_cached --what scatter --data-source lexicon --method tfidf  # -- scatter of lowD vectors
python -m src.cli.plot_cached --what future --data-source lexicon --method tfidf   # -- future label predictions vs. groundtruth

# Or point to a specific cache (e.g., run_<hash>.npz)
python -m src.cli.plot_cached --what scatter --data-source llm --method openai \
    --cache outputs/cache/run_aaaaaaaaaaaaaaaa.npz

Generated figures go to outputs/figures/ with sensible names.

BYOD: Bring Your Own Data

You can plug in your own CSV with almost no friction.

Minimum you need is a text column. If you also provide label and/or timestamp, we’ll use them; otherwise we create a synthetic timestamp and, if no labels, put everything into a single class unlabeled so clustering and visuals still work.

Quick examples:

# Labeled data with custom column names
python -m src.cli.run_pipeline \
  --data-source byod \
  --byod-path my_texts.csv \
  --byod-text-col body \
  --byod-label-col stance \
  --byod-timestamp-col posted_at \
  --method tfidf \
  --plot-scatter --plot-timeseries

# Unlabeled data (clustering + visuals still work)
python -m src.cli.run_pipeline \
  --data-source byod \
  --byod-path my_texts.csv \
  --byod-text-col text \
  --method glove \
  --plot-scatter --plot-timeseries

# Later, plot from cache without recomputing
python -m src.cli.plot_cached --what scatter --data-source byod --method glove

Under the hood this uses src/data/byod.py to normalize your CSV to the standard schema: text, label_name, label (int-coded), timestamp.

Outputs (unified and simple)

Unified metrics CSV (always produced):
- outputs/tables/metrics_summary_{lexicon|llm|byod}.csv
- Columns include: representation and settings, clustering metrics (ARI, NMI, Silhouette), logistic regression metrics (Accuracy, weighted F1), and (if run) LLM labelling metrics.
LLM labelling cache (if run):
- outputs/tables/llm_labels_{data_source}_{prompt_style}_{model}.csv
- Stores the LLM’s chosen label and short “thought” per text.
Visualisations (optional, either on-the-fly or from cache):
- outputs/figures/scatter_{method}_{data}.png
- outputs/figures/timeseries_{method}_{data}.png
- outputs/figures/future_{method}_{data}.png
Cache of run products (enables instant plotting later, no recompute):
- outputs/cache/run_<hash>.npz (contains coords2d, clabels, labels, label_names, timestamps, test_idx, y_pred, cutoff_ts)
- Convenience link: outputs/cache/last_{data_source}_{method}.npz

Inference train/test controls (temporal split)

Logistic regression is trained on the earliest fraction of the dataset (by timestamp) and evaluated on the future portion — a realistic setting where future distributions are not leaked into training.

Default train fraction: 0.3 (30% history; 70% future).
Change it with, e.g.:

python -m src.cli.run_pipeline --method tfidf --train-frac 0.8

LLM labelling (zero-shot and few-shot)

You can ask an LLM to assign labels directly and compare that to classic ML baselines.

Zero-shot (default):

python -m src.cli.run_llm_label --data-source lexicon --prompt-style zero

Few-shot ("extended"): draws small in-repo examples per class as scaffolding

python -m src.cli.run_llm_label --data-source lexicon --prompt-style extended

All LLM-labelling runs write metrics into the same unified CSV used by classic ML, so your src.viz.results figure will include both.

Tips:

Add --regen-labels to force relabelling even if a cache exists.
Use --concurrency to control parallel API calls.
Select labelling model with --llm-model (defaults to gpt-4.1-nano-2025-04-14).

Dataset sources (lexicon vs LLM)

Lexicon generator (default): short texts created from clause-level templates with shared nouns to keep lexical overlap across classes.
LLM generator: short, social-media-like posts authored by an LLM. Defaults to gpt-4.1-nano-2025-04-14; override with --llm-model. If OPENAI_API_KEY is missing, it falls back to the lexicon generator.

Classes are the same in both datasets: deep_democracy, mainstream_democracy, anti_democratic.

Configuration

Edit configs/config.yaml (sensible defaults apply if missing):

Vectorizers
- TF‑IDF params
- GloVe model
- OpenAI embedding model and batch size
- SBERT fallback model (used automatically if no OpenAI key)
Data
- Counts, time span, optional label noise
Reduce/Cluster
- reduce.method (umap or pca), reduce.n_components, reduce.n_neighbors (UMAP)
- cluster.k for KMeans
LLM labelling
- llm_label.model and llm_label.prompt_style (zero or extended)

Notes:

GloVe is loaded via gensim.downloader (glove-wiki-gigaword-50) — requires internet on first use.
SBERT fallback (all-MiniLM-L6-v2) downloads on first use.
OpenAI models require an API key set in .env (or your shell env). Without a key, the pipeline degrades gracefully.

Bring your own data (BYOD)

Once you’re comfortable, adapt this as a swiss‑army‑knife starter:

Minimal expected columns (if you mirror the synthetic format): text, label_name (or label), and ideally a timestamp for temporal splits and time‑series visuals.
Start by replacing the data loader in src/data/load.py to point at your CSV; or add a new loader.
Keep the same CLI shape so you can still run build_all and produce unified results/plots without changing your workflow.

For convenience, the built-in BYOD loader (src/data/byod.py) already handles these cases via the CLI flags in the section above.

What this repo shows

How representation choices affect cluster structure and downstream performance (TF‑IDF → GloVe → LLM embeddings).
How different dataset sources (lexicon vs LLM‑authored) influence separability while keeping labels identical.
A single, unified results file combining classic ML and LLM labelling so you can compare quickly and communicate clearly.
Simple, modular code you can reuse (data generation, vectorisation, clustering, reduction, metrics, visualisation).
Graceful key handling via .env and caching to avoid repeated API calls.

License

MIT.

Acknowledgement

This repo was developed in AI-augmented collaboration with SoDa Laboratory's incredible Assistant AI platform, spearheaded and engineered by SoDa's AI Research Lead, Lachlan O'Neill.

To cite

If you fork/apply this repository for your research, please cite as:

Bibtex:

@misc{angus2025textwaves,
  author = {Angus, SD},
  title = {TextWaves: simple NLP analysis on a tiny synthetic dataset},
  year = {2025},
  url = {https://github.com/sodalabsio/textwaves},
  note = {GitHub repository}
}

APA:

Angus, S. (2025). TextWaves: simple NLP analysis on a tiny synthetic dataset [Computer software]. GitHub. https://github.com/sodalabsio/textwaves

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
configs		configs
readme_figs		readme_figs
src		src
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🌊 TextWaves: simple NLP analysis on a tiny synthetic dataset

TL;DR

🤔 Assumptions

🚀 Quick start (fastest path)

⚙ Pipeline approach (what’s happening under the hood)

Generate data explicitly (optional)

Run a single method by hand (optional)

Optional visualisations (from cache or on the fly)

BYOD: Bring Your Own Data

Outputs (unified and simple)

Inference train/test controls (temporal split)

LLM labelling (zero-shot and few-shot)

Dataset sources (lexicon vs LLM)

Configuration

Bring your own data (BYOD)

What this repo shows

License

Acknowledgement

To cite

About

Uh oh!

Releases

Packages

Languages

License

sodalabsio/textwaves

Folders and files

Latest commit

History

Repository files navigation

🌊 TextWaves: simple NLP analysis on a tiny synthetic dataset

TL;DR

🤔 Assumptions

🚀 Quick start (fastest path)

⚙ Pipeline approach (what’s happening under the hood)

Generate data explicitly (optional)

Run a single method by hand (optional)

Optional visualisations (from cache or on the fly)

BYOD: Bring Your Own Data

Outputs (unified and simple)

Inference train/test controls (temporal split)

LLM labelling (zero-shot and few-shot)

Dataset sources (lexicon vs LLM)

Configuration

Bring your own data (BYOD)

What this repo shows

License

Acknowledgement

To cite

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages