Prerequisites: Make sure that you've downloaded and saved the 560 conversations to a file at the path
./data/conversations.json
-
Clone this repository:
git clone https://github.com/567-labs/how-to-look-at-data.git cd how-to-look-at-data
-
Install dependencies (using uv):
uv pip install -r pyproject.toml
-
Set up your environment variables:
- You will need a
GOOGLE_API_KEY
and anOPENAI_API_KEY
for running this topic modelling proccess. We're using the OpenAI Text-Embedding-3-Small embeddings for clustering and the Gemini-2.0-flash models for summarisation (used bykura
).
- You will need a
1. Cluster Conversations: Understand query patterns in large RAG applications using topic modeling and Kura, with real user queries from the Weights & Biases documentation. This notebook covers data preparation, clustering, and analysis of user query themes.
2. Better Summaries: Learn how to create domain-specific, concise summaries for Weights & Biases queries to produce more meaningful and actionable topic clusters.
3. Classifiers: Learn how to create classifiers that can detect and monitor these topics explicitly that you've identified in production.
This repository was created for the AI Engineering Summit. It demonstrates practical techniques for analyzing and improving Retrieval-Augmented Generation (RAG) systems using real-world data and modern topic modeling tools.