GitHub - joewandy/hlda: Gibbs sampler for the Hierarchical Latent Dirichlet Allocation topic model

Hierarchical Latent Dirichlet Allocation

Note: this repository should only be used for education purpose. For production use, I'd recommend using https://github.com/bab2min/tomotopy which is more production-ready

Hierarchical Latent Dirichlet Allocation (hLDA) addresses the problem of learning topic hierarchies from data. The model relies on a non‑parametric prior called the nested Chinese restaurant process, which allows for arbitrarily large branching factors and easily accommodates growing data collections. The hLDA model combines this prior with a likelihood based on a hierarchical variant of Latent Dirichlet Allocation.

The original papers describing the algorithm are:

Overview

This repository contains a pure Python implementation of the Gibbs sampler for hLDA. It is intended for experimentation and as a reference implementation. The code follows the approach used in the original Mallet implementation but with a simplified interface and a fixed depth for the tree.

Key features include:

Python 3.11+ support with minimal third‑party dependencies.
A small set of example scripts demonstrating how to run the sampler.
Utilities for visualising the resulting topic hierarchy.
Test suite for verifying the sampler on synthetic data and a small BBC corpus.

Installation

The package can be installed directly from PyPI:

pip install hlda

Alternatively, to develop locally, clone this repository and install it in editable mode:

git clone https://github.com/joewandy/hlda.git
cd hlda
pip install -e .
pre-commit install

Usage

The easiest way to get started is by using the sample BBC dataset provided in the data/ directory. You can run the full demonstration from the command line:

python examples/bbc_demo.py --data-dir data/bbc/tech --iterations 20

If you installed the package from PyPI you can run the same demo via the hlda-run command:

hlda-run --data-dir data/bbc/tech --iterations 20

To write the learned hierarchy to disk in JSON format, pass --export-tree <file> when running the script:

python scripts/run_hlda.py --data-dir data/bbc/tech --export-tree tree.json

If you make use of the BBC dataset, please cite the publication by Greene and Cunningham (2006) as detailed in CITATION.cff.

Example scripts for the BBC dataset and synthetic data are available in the examples/ directory.

Within Python you can also construct the sampler directly:

from hlda.sampler import HierarchicalLDA

corpus = [["word", "word", ...], ...]  # list of tokenised documents
vocab = sorted({w for doc in corpus for w in doc})

hlda = HierarchicalLDA(corpus, vocab, alpha=1.0, gamma=1.0, eta=0.1,
                       num_levels=3, seed=0)
hlda.estimate(iterations=50, display_topics=10)

Integration with scikit-learn

The package provides a HierarchicalLDAEstimator that follows the scikit-learn API. This allows using the sampler inside a standard Pipeline.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from hlda.sklearn_wrapper import HierarchicalLDAEstimator

vectorizer = CountVectorizer()
prep = FunctionTransformer(
    lambda X: (
        [[i for i, c in enumerate(row) for _ in range(int(c))] for row in X.toarray()],
        list(vectorizer.get_feature_names_out()),
    ),
    validate=False,
)

pipeline = Pipeline([
    ("vect", vectorizer),
    ("prep", prep),
    ("hlda", HierarchicalLDAEstimator(num_levels=3, iterations=10, seed=0)),
])

pipeline.fit(documents)
assignments = pipeline.transform(documents)

Running the tests

The repository includes a small test suite that checks the sampler on both the BBC corpus and synthetic data. After installing the development dependencies you can run:

pytest -q

All tests should pass in a few seconds.

License

This project is licensed under the terms of the MIT license. See LICENSE.txt for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
data/bbc		data/bbc
examples		examples
scripts		scripts
src/hlda		src/hlda
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.cff		CITATION.cff
LICENSE.txt		LICENSE.txt
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hierarchical Latent Dirichlet Allocation

Overview

Installation

Usage

Integration with scikit-learn

Running the tests

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

joewandy/hlda

Folders and files

Latest commit

History

Repository files navigation

Hierarchical Latent Dirichlet Allocation

Overview

Installation

Usage

Integration with scikit-learn

Running the tests

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages