This project implements the first part of a trend detection system inspired by the BERTrend paper. It focuses on extracting and storing topics from time-sliced text data using BERTopic and Elasticsearch.
- ✅ Time slicing of input documents by day/week/month
- ✅ Text embeddings using Sentence-BERT (
all-MiniLM-L6-v2
) - ✅ Dimensionality reduction (UMAP, via BERTopic)
- ✅ Clustering using HDBSCAN (default in BERTopic)
- ✅ Topic modeling using BERTopic with configurable parameters
- ✅ Topic naming using class-based TF-IDF (c-TF-IDF)
- ✅ Filtering of outlier/noisy topics
- ✅ Storing results to Elasticsearch (
bertrend_results_*
) - ✅ Modular pipeline design for easy extension and UI integration
- ✅ Inspect results via script or UI-ready JSON
-
Ingest Input Data
- Load CSV or raw data with timestamp and text fields
- Save documents to
bertrend_data
index in Elasticsearch
-
Slice & Model Topics
- Data is grouped into time slices (e.g., daily)
- BERTopic is applied on each slice independently
- Each topic includes representative docs and keywords
- Results are stored to
bertrend_results_<job_id>
-
Inspect Results
- Run CLI script to preview top topics and metadata
- Can be visualized or exported later to UI
Modify configs/config.yaml
to adjust:
- Time slice granularity
- BERTopic parameters (top_n_words, embedding model, etc.)
- Elasticsearch index names
python -m venv .env
source .env/bin/activate
pip install -r requirements.txt
cd elastic/
docker-compose up -d
python src/ingest_data.py
run by passing parameters:
python -m src.topic_extraction --index bertrend_data --from_date 2025-01-01 --to_date 2025-03-30 --job_id test_01 --slice_unit week