Chroma Workshop: Topic Modeling for RAG Systems

Prerequisites: Make sure that you've downloaded and saved the 560 conversations to a file at the path ./data/conversations.json

Installation

Clone this repository:

git clone https://github.com/567-labs/how-to-look-at-data.git
cd how-to-look-at-data

Install dependencies (using uv):
```
uv pip install -r pyproject.toml
```
Set up your environment variables:
- You will need an OPENAI_API_KEY for running this topic modelling proccess. We're using the OpenAI Text-Embedding-3-Small embeddings for clustering and OpenAI models for summarisation (used by kura).

Notebooks

1. Cluster Conversations

Discover patterns in 560 real user queries from Weights & Biases documentation using Kura's LLM-enhanced topic modeling.

You'll learn:

How topic modeling reveals query patterns invisible to keyword analysis
Converting raw queries into clustered insights using embeddings
Why default summaries miss critical domain-specific details

2. Better Summaries

Transform generic clustering results into precise, actionable insights with custom summarization models.

You'll learn:

Building domain-specific summary models for your use case
How better summaries dramatically improve clustering quality
Reducing noise to focus on what matters for your users

3. Classifiers

Build production-ready classifiers that achieve 90%+ accuracy through systematic prompt engineering.

You'll learn:

Creating weak labels for rapid dataset creation
Iterative prompt improvement techniques
Deploying classifiers to monitor query patterns in real-time

About

This repository was created for the AI Engineering Summit. It demonstrates practical techniques for analyzing and improving Retrieval-Augmented Generation (RAG) systems using real-world data and modern topic modeling tools.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
lib		lib
md		md
slidev		slidev
.gitignore		.gitignore
1. Cluster Conversations.ipynb		1. Cluster Conversations.ipynb
2. Better Summaries.ipynb		2. Better Summaries.ipynb
3. Classifiers.ipynb		3. Classifiers.ipynb
PRECOMMIT_HOOK.md		PRECOMMIT_HOOK.md
README.md		README.md
app.py		app.py
convert.py		convert.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Chroma Workshop: Topic Modeling for RAG Systems

Installation

Notebooks

1. Cluster Conversations

2. Better Summaries

3. Classifiers

About

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

567-labs/how-to-look-at-data

Folders and files

Latest commit

History

Repository files navigation

Chroma Workshop: Topic Modeling for RAG Systems

Installation

Notebooks

1. Cluster Conversations

2. Better Summaries

3. Classifiers

About

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages