Note: this repository should only be used for education purpose. For production use, I'd recommend using https://github.com/bab2min/tomotopy which is more production-ready
Hierarchical Latent Dirichlet Allocation (hLDA) addresses the problem of learning topic hierarchies from data. The model relies on a non‑parametric prior called the nested Chinese restaurant process, which allows for arbitrarily large branching factors and easily accommodates growing data collections. The hLDA model combines this prior with a likelihood based on a hierarchical variant of Latent Dirichlet Allocation.
The original papers describing the algorithm are:
- Hierarchical Topic Models and the Nested Chinese Restaurant Process
- The Nested Chinese Restaurant Process and Bayesian Nonparametric Inference of Topic Hierarchies
This repository contains a pure Python implementation of the Gibbs sampler for hLDA. It is intended for experimentation and as a reference implementation. The code follows the approach used in the original Mallet implementation but with a simplified interface and a fixed depth for the tree.
Key features include:
- Python 3.11+ support with minimal third‑party dependencies.
- A small set of example scripts demonstrating how to run the sampler.
- Utilities for visualising the resulting topic hierarchy.
- Test suite for verifying the sampler on synthetic data and a small BBC corpus.
The package can be installed directly from PyPI:
pip install hlda
Alternatively, to develop locally, clone this repository and install it in editable mode:
git clone https://github.com/joewandy/hlda.git
cd hlda
pip install -e .
pre-commit install
The easiest way to get started is by using the sample BBC dataset provided in the
data/
directory. You can run the full demonstration from the command line:
python examples/bbc_demo.py --data-dir data/bbc/tech --iterations 20
If you installed the package from PyPI you can run the same demo via the
hlda-run
command:
hlda-run --data-dir data/bbc/tech --iterations 20
To write the learned hierarchy to disk in JSON format, pass
--export-tree <file>
when running the script:
python scripts/run_hlda.py --data-dir data/bbc/tech --export-tree tree.json
If you make use of the BBC dataset, please cite the publication by Greene and
Cunningham (2006) as detailed in CITATION.cff
.
Example scripts for the BBC dataset and synthetic data are available in the
examples/
directory.
Within Python you can also construct the sampler directly:
from hlda.sampler import HierarchicalLDA
corpus = [["word", "word", ...], ...] # list of tokenised documents
vocab = sorted({w for doc in corpus for w in doc})
hlda = HierarchicalLDA(corpus, vocab, alpha=1.0, gamma=1.0, eta=0.1,
num_levels=3, seed=0)
hlda.estimate(iterations=50, display_topics=10)
The package provides a HierarchicalLDAEstimator
that follows the scikit-learn API. This allows using the sampler inside a standard Pipeline
.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from hlda.sklearn_wrapper import HierarchicalLDAEstimator
vectorizer = CountVectorizer()
prep = FunctionTransformer(
lambda X: (
[[i for i, c in enumerate(row) for _ in range(int(c))] for row in X.toarray()],
list(vectorizer.get_feature_names_out()),
),
validate=False,
)
pipeline = Pipeline([
("vect", vectorizer),
("prep", prep),
("hlda", HierarchicalLDAEstimator(num_levels=3, iterations=10, seed=0)),
])
pipeline.fit(documents)
assignments = pipeline.transform(documents)
The repository includes a small test suite that checks the sampler on both the BBC corpus and synthetic data. After installing the development dependencies you can run:
pytest -q
All tests should pass in a few seconds.
This project is licensed under the terms of the MIT license. See
LICENSE.txt
for details.