🖊️ Simon D. Angus, authored with AI-augmentation via SoDa Laboratory's Assistant AI tooling.
SoDa Laboratories, Monash Business Schoool, Monash University
Please fork to begin your project -- if using these tools for academic work, please cite as below.
📝 For more context around the text-as-data problem and applying these NLP 'waves' in a social-science context (i.e. narratives of democratic resilience) see: Angus, SD: "Tracking Policy-relevant Narratives of Democratic Resilience at Scale: from experts and machines, to AI & the transformer revolution" - SoDa Laboratories Working Paper 2024-07
Lean, hands-on repo to demonstrate three representation “waves” for text-as-data in social science, plus optional LLM-labelling, accompanying workshops/talks at U Greenwich, LDN, Sep 1–2 2025 — Methods in Social Change and Inertia Workshops.
-
Wave 1: TF‑IDF
-
Wave 2: Average GloVe embeddings
-
Wave 3: Transformer/LLM embeddings (OpenAI or SBERT fallback)
-
Optional: LLM labelling (zero-shot or few-shot) to directly assign labels to texts
We compare methods on a tiny, labelled synthetic corpus (3 classes × ~100 each). You can:
- Cluster and assess separation/quality in 2D.
- Train a simple logistic regression baseline (temporal train/test split: train on “history”, predict the “future”).
- Run an LLM labelling pass and compare its accuracy/F1 to classic ML.
All runs write to a unified results CSV you can plot in one line.
You’re comfortable running a few terminal commands, have basic Python familiarity, and want a clear, minimal path to try NLP/AI on texts for social science questions.
-
Install Python 3.10+
-
Create/activate a virtual environment and install dependencies. (💡 You don't need to do this -- you may have the required packages already available on your system -- but it's best practice to isolate dependencies per project, and will assist with reproducibility.)
python -m venv myvenv source myvenv/bin/activate pip install -r requirements.txt
-
(Optional) Add an OpenAI API key for OpenAI embeddings, LLM data synthesis and LLM labelling. (💡If you are new to 'keys', then Google "open ai key about"; and/or head to https://platform.openai.com to create/sign-in to the API platform and generate a key: click 'gear' icon for settings > API Keys (menu at left).)
- Copy
.env.example
to.env
and setOPENAI_API_KEY=...
- No key? The pipeline still runs end-to-end:
- OpenAI embeddings automatically fall back to SBERT (
all-MiniLM-L6-v2
). - LLM-based data synthesis falls back to the lexicon generator.
- OpenAI embeddings automatically fall back to SBERT (
- Copy
-
Run everything in one go, writing a unified metrics file
Classic ML only (TF‑IDF, GloVe, SBERT/OpenAI embeddings), lexicon synthetic data:
python -m src.cli.build_all
Also include LLM labelling (zero-shot by default):
python -m src.cli.build_all --include-llm
Few-shot LLM labelling instead (uses in-repo examples to guide the model):
python -m src.cli.build_all --include-llm --llm-prompt-style extended
Any of the above, but generating/using llm created synthetic data instead of lexicon:
python -m src.cli.build_all --data-source llm ...
-
Visualise the results
This reads the unified CSV and creates a comparison bar chart of weighted F1 across LR (by representation) and LLM:
python -m src.viz.results --data-source lexicon
Use
--data-source llm
if you built the LLM-authored synthetic dataset. -
(Optional) Visuals on the fly
You can add these flags to any single-method run to save figures immediately:
python -m src.cli.run_pipeline --method tfidf --data-source lexicon \ --plot-scatter --plot-timeseries --plot-future
That’s it. You now have:
- A single CSV at
outputs/tables/metrics_summary_{lexicon|llm|byod}.csv
- A figure at
outputs/figures/f1_comparison_{lexicon|llm}.png
- Optional figures (if you used the flags): scatter/timeseries/future under
outputs/figures/
- A cache of run products you can plot later without recomputing under
outputs/cache/
The pipeline has five/six components you can mix and match:
-
Synthesise — tiny dataset (default 300 texts, 3 topics) using a lexicon or an LLM. Texts are timestamped so class prevalence ebbs/flows over time.
-
Vectorise — TF‑IDF, GloVe, or transformer-based embeddings (OpenAI with SBERT fallback).
-
Cluster and reduce — cluster (KMeans) and reduce to 2D (UMAP/PCA) for intuition and visualisation.
-
Train / predict — logistic regression trained on the vectors of the earliest slice of data (temporal split) and evaluated on the vectors of "future" texts.
-
Direct LLM labelling (optional) — zero-shot or few-shot labelling of texts by an LLM, compared to ground-truth labels.
-
Compare — a unified results CSV with clustering metrics and classification metrics, plus optional LLM labelling metrics.
Together these mirror common text-as-data workflows for computational social science and are easy to adapt to your own data.
💡 The idea is to allow you to see the strengths/weaknesses of different semantic representations (TF‑IDF, GloVe, LLM embeddings) of ecological texts on a simple, labelled dataset where you can control the data-generating process. You can also optionally compare to a direct LLM labelling process, either in 'zero-shot' (no examples, rely on model's world knowledge) or 'few-shot' (small in-repo examples per class) modes. These ideas are discussed in more detail in the paper referenced above.
You can regenerate the synthetic data without running a full method:
Lexicon dataset (default):
python -m src.cli.run_pipeline --regen-data --data-source lexicon
LLM-authored dataset (requires key; defaults to gpt-4.1-nano-2025-04-14
):
python -m src.cli.run_pipeline --regen-data --data-source llm
Each writes a CSV to synthetic_data/
(or data/
where noted) with columns text,label_name,timestamp,label
.
Choose your representation and run end-to-end on either dataset source:
# TF‑IDF example on the lexicon dataset
python -m src.cli.run_pipeline --method tfidf --data-source lexicon
# SBERT/OpenAI embeddings (auto-falls back to SBERT if no key)
python -m src.cli.run_pipeline --method openai --data-source llm
Use --reduce pca
to switch from UMAP, or --clusterer kmeans
to be explicit. See all options:
python -m src.cli.run_pipeline --help
You have two simple ways to get the nice visuals.
- On the fly during a run (no extra commands):
python -m src.cli.run_pipeline --method tfidf --data-source lexicon \
--plot-scatter --plot-timeseries --plot-future
- Later, without recomputing, from cached outputs:
# After any prior run for a given (data_source, method)
# Uses outputs/cache/last_{data_source}_{method}.npz
python -m src.cli.plot_cached --what timeseries --data-source lexicon # - no method required, just plots synthetic data timeseries
python -m src.cli.plot_cached --what scatter --data-source lexicon --method tfidf # -- scatter of lowD vectors
python -m src.cli.plot_cached --what future --data-source lexicon --method tfidf # -- future label predictions vs. groundtruth
# Or point to a specific cache (e.g., run_<hash>.npz)
python -m src.cli.plot_cached --what scatter --data-source llm --method openai \
--cache outputs/cache/run_aaaaaaaaaaaaaaaa.npz
Generated figures go to outputs/figures/
with sensible names.
You can plug in your own CSV with almost no friction.
Minimum you need is a text
column. If you also provide label
and/or timestamp
, we’ll use them; otherwise we create a synthetic timestamp and, if no labels, put everything into a single class unlabeled
so clustering and visuals still work.
Quick examples:
# Labeled data with custom column names
python -m src.cli.run_pipeline \
--data-source byod \
--byod-path my_texts.csv \
--byod-text-col body \
--byod-label-col stance \
--byod-timestamp-col posted_at \
--method tfidf \
--plot-scatter --plot-timeseries
# Unlabeled data (clustering + visuals still work)
python -m src.cli.run_pipeline \
--data-source byod \
--byod-path my_texts.csv \
--byod-text-col text \
--method glove \
--plot-scatter --plot-timeseries
# Later, plot from cache without recomputing
python -m src.cli.plot_cached --what scatter --data-source byod --method glove
Under the hood this uses src/data/byod.py
to normalize your CSV to the standard schema: text
, label_name
, label
(int-coded), timestamp
.
-
Unified metrics CSV (always produced):
outputs/tables/metrics_summary_{lexicon|llm|byod}.csv
- Columns include: representation and settings, clustering metrics (ARI, NMI, Silhouette), logistic regression metrics (Accuracy, weighted F1), and (if run) LLM labelling metrics.
-
LLM labelling cache (if run):
outputs/tables/llm_labels_{data_source}_{prompt_style}_{model}.csv
- Stores the LLM’s chosen label and short “thought” per text.
-
Visualisations (optional, either on-the-fly or from cache):
outputs/figures/scatter_{method}_{data}.png
outputs/figures/timeseries_{method}_{data}.png
outputs/figures/future_{method}_{data}.png
-
Cache of run products (enables instant plotting later, no recompute):
outputs/cache/run_<hash>.npz
(contains coords2d, clabels, labels, label_names, timestamps, test_idx, y_pred, cutoff_ts)- Convenience link:
outputs/cache/last_{data_source}_{method}.npz
Logistic regression is trained on the earliest fraction of the dataset (by timestamp) and evaluated on the future portion — a realistic setting where future distributions are not leaked into training.
- Default train fraction:
0.3
(30% history; 70% future). - Change it with, e.g.:
python -m src.cli.run_pipeline --method tfidf --train-frac 0.8
You can ask an LLM to assign labels directly and compare that to classic ML baselines.
- Zero-shot (default):
python -m src.cli.run_llm_label --data-source lexicon --prompt-style zero
- Few-shot ("extended"): draws small in-repo examples per class as scaffolding
python -m src.cli.run_llm_label --data-source lexicon --prompt-style extended
All LLM-labelling runs write metrics into the same unified CSV used by classic ML, so your src.viz.results
figure will include both.
Tips:
- Add
--regen-labels
to force relabelling even if a cache exists. - Use
--concurrency
to control parallel API calls. - Select labelling model with
--llm-model
(defaults togpt-4.1-nano-2025-04-14
).
-
Lexicon generator (default): short texts created from clause-level templates with shared nouns to keep lexical overlap across classes.
-
LLM generator: short, social-media-like posts authored by an LLM. Defaults to
gpt-4.1-nano-2025-04-14
; override with--llm-model
. IfOPENAI_API_KEY
is missing, it falls back to the lexicon generator.
Classes are the same in both datasets: deep_democracy
, mainstream_democracy
, anti_democratic
.
Edit configs/config.yaml
(sensible defaults apply if missing):
-
Vectorizers
- TF‑IDF params
- GloVe model
- OpenAI embedding model and batch size
- SBERT fallback model (used automatically if no OpenAI key)
-
Data
- Counts, time span, optional label noise
-
Reduce/Cluster
reduce.method
(umap
orpca
),reduce.n_components
,reduce.n_neighbors
(UMAP)cluster.k
for KMeans
-
LLM labelling
llm_label.model
andllm_label.prompt_style
(zero
orextended
)
Notes:
- GloVe is loaded via
gensim.downloader
(glove-wiki-gigaword-50
) — requires internet on first use. - SBERT fallback (
all-MiniLM-L6-v2
) downloads on first use. - OpenAI models require an API key set in
.env
(or your shell env). Without a key, the pipeline degrades gracefully.
Once you’re comfortable, adapt this as a swiss‑army‑knife starter:
- Minimal expected columns (if you mirror the synthetic format):
text
,label_name
(orlabel
), and ideally atimestamp
for temporal splits and time‑series visuals. - Start by replacing the data loader in
src/data/load.py
to point at your CSV; or add a new loader. - Keep the same CLI shape so you can still run
build_all
and produce unified results/plots without changing your workflow.
For convenience, the built-in BYOD loader (src/data/byod.py
) already handles these cases via the CLI flags in the section above.
- How representation choices affect cluster structure and downstream performance (TF‑IDF → GloVe → LLM embeddings).
- How different dataset sources (lexicon vs LLM‑authored) influence separability while keeping labels identical.
- A single, unified results file combining classic ML and LLM labelling so you can compare quickly and communicate clearly.
- Simple, modular code you can reuse (data generation, vectorisation, clustering, reduction, metrics, visualisation).
- Graceful key handling via
.env
and caching to avoid repeated API calls.
MIT.
This repo was developed in AI-augmented collaboration with SoDa Laboratory's incredible Assistant AI platform, spearheaded and engineered by SoDa's AI Research Lead, Lachlan O'Neill.
If you fork/apply this repository for your research, please cite as:
Bibtex:
@misc{angus2025textwaves,
author = {Angus, SD},
title = {TextWaves: simple NLP analysis on a tiny synthetic dataset},
year = {2025},
url = {https://github.com/sodalabsio/textwaves},
note = {GitHub repository}
}
APA:
Angus, S. (2025). TextWaves: simple NLP analysis on a tiny synthetic dataset [Computer software]. GitHub. https://github.com/sodalabsio/textwaves