Quora Dataset — Determine if two questions ask the same thing
This repository presents a modular and reproducible pipeline for detecting semantically equivalent questions in the Quora Question Pairs (QQP) dataset. It implements a Mixture-of-Experts (MoE) architecture that blends predictions from both classical machine learning models and pretrained transformer-based models through a softmax-trained gating mechanism. Extensive preprocessing, diverse feature engineering, and dimensionality reduction (IPCA & UMAP) contribute to robust and generalizable performance.
-
Multi-resolution sentence embeddings (MiniLM, MPNet)
-
Lexical, semantic, structural, and fuzzy features (~3600 features)
-
Dimensionality reduction using IPCA (k95) and UMAP
-
Multiple expert types: LR, SVM, XGB, LGBM, BERT, RoBERTa, CrossEnc, etc.
-
Learnable soft-gate over expert logits (trained on validation log-loss)
-
Cached embeddings, predictions, models, and gate weights
-
Evaluation logged via lightweight .csv logger for reproducibility
Natural-Language-Processing-and-Text-Mining/
├── data/
│ ├── quora.csv
│ ├── splits/ # train/valid/test (no question leakage)
│ └── processed/ # clean text, embeddings, features
│ ├── question_meta.csv
│ ├── clean_questions.npy
│ ├── question_embeddings_384.npy / 768.npy
│ └── X_*_{ipca, umap}.npy
│
├── models/
│ ├── custom/ # Pickled classical experts (LR, SVM, etc.)
│ ├── pretrained/ # Logistic head weights for DistilBERT
│ ├── pred_cache/ # Per-expert .npy prediction logs
│ ├── features*/ # TF-IDF, SVD, PCA caches
│ └── gates/ # MoE gate weights and expert subset indices
│
├── notebooks/
│ ├── 0_split.ipynb
│ ├── 1_eda.ipynb
│ ├── 2_preprocessing.ipynb
│ ├── 3_feature_engineering.ipynb
│ ├── 4_models.ipynb
│ └── 5_benchmarks_lr{X}_ep{Y}.ipynb
│
├── src/
│ ├── preprocessing.py # Cleaners, SBERT cache, len/word stats
│ ├── features.py # Feature blocks, IPCA/UMAP
│ ├── custom_models.py # LR/XGB/LGBM/etc. experts
│ ├── pretrained_models.py # BERT, RoBERTa, Distil, CrossEnc, MoE
│ └── logs.py # CSV logger by event type
│
└── metric_logs/
├── splits.csv
├── eda.csv / eda_summary.csv
├── preprocessing.csv
├── features.csv
├── models.csv
├── gates.csv
└── benchmarks_lrX_epY.csv
- Fixes reproducible splits (train/valid/test)
- Drops rows with nulls
- Saves
train.csv
,valid.csv
,test.csv
todata/splits/
- Explores class balance, length distributions, top tokens
- Logs EDA statistics to
metric_logs/eda.csv
- Creates per-question artifacts:
- Raw, cleaned, lowercased
- Lengths (chars/words)
- Sentence-transformer embeddings (768D)
- Saves:
question_meta.csv
clean_questions.npy
question_embeddings.npy
- Computes 3,598-dimensional features:
- TF-IDF word/char
- SVD word/char
- Fuzzy scores
- Embedding distances
- Graph-based stats
- Dimensionality reduction:
- PCA (retain 95% variance)
- UMAP (n_neighbors=15, min_dist=0.1)
- Combo: PCA → UMAP (for smooth manifold)
- Saves 3 versions for both
dim=384
anddim=768
:- IPCA only (X_*_ipca.npy)
- UMAP only (X_*_umap.npy)
- IPCA + UMAP (X_*_ipca_umap.npy)
-
Two branches:
- Pretrained experts use full (768D or 384D) sentence embeddings
- Custom experts use reduced 3D (PCA/UMAP) engineered features
-
Experts trained:
BertExpert
RobertaExpert
XLNetExpert
(optional)CrossEncExpert
QuoraDistilExpert
with LR on [|u−v|, u·v]LRFeatureExpert
,XGB
,LGBM
,KNN
,RF
,SVM
-
MoE Gate:
- Gated softmax over expert outputs
- Trained on all expert combinations
- Logs all results with validation log-loss
- Top-10 subsets are retrained on Train+Valid
- Loads top-10 MoE gates from
models/gates/
- Loads corresponding
moe_*_idxs.npy
subsets - Runs
.predict_prob()
ontest.csv
- Outputs:
test_LL
,test_ACC
,test_F1
,test_PREC
,test_REC
,test_AUC
,seconds
- Logs to:
metric_logs/benchmarks.csv
- Result tables of the benchmarks.
- Correlation between metrics and hyperparameters.
- Evaluation metrics: Log-Loss, Accuracy, F1, Precision, Recall, ROC-AUC, Inference Time
- Gated combinations evaluated on test set under 3 configurations:
- lr=0.001, epochs=1
- lr=0.01, epochs=2
- lr=0.05, epochs=10
- Results are stored in
metric_logs/benchmarks_lrX_epY.csv
- Heatmap visualizations and correlation analyses assess metric interdependence
Save this to requirements.txt
:
numpy
pandas
scikit-learn
matplotlib
seaborn
sentence-transformers
transformers
xgboost
lightgbm
torch
umap-learn
rapidfuzz
networkx
Optional:
sentencepiece # required for XLNetExpert
- Download the
quora.csv
dataset intodata/
- Run
main.py
or the notebooks in order:0_split.ipynb
1_eda.ipynb
2_preprocessing.ipynb
3_feature_engineering.ipynb
4_models.ipynb
5_benchmarks*.ipynb
6_summary.ipynb
- Inspect results in
metric_logs/
or plot from CSVs - Run ablations by altering feature reduction or expert list
This repository was developed as part of a course assignment. It includes:
-
Modular architecture with clear separation of concerns
-
Feature-based + transformer-based modeling synergy
-
Log-based tracking for reproducibility
-
Validation-driven MoE tuning
-
Full support for ablation and metric correlation analysis