Natural-Language-Processing-and-Text-Mining

Quora Dataset — Determine if two questions ask the same thing

Overview

This repository presents a modular and reproducible pipeline for detecting semantically equivalent questions in the Quora Question Pairs (QQP) dataset. It implements a Mixture-of-Experts (MoE) architecture that blends predictions from both classical machine learning models and pretrained transformer-based models through a softmax-trained gating mechanism. Extensive preprocessing, diverse feature engineering, and dimensionality reduction (IPCA & UMAP) contribute to robust and generalizable performance.

Key Highlights

Multi-resolution sentence embeddings (MiniLM, MPNet)
Lexical, semantic, structural, and fuzzy features (~3600 features)
Dimensionality reduction using IPCA (k95) and UMAP
Multiple expert types: LR, SVM, XGB, LGBM, BERT, RoBERTa, CrossEnc, etc.
Learnable soft-gate over expert logits (trained on validation log-loss)
Cached embeddings, predictions, models, and gate weights
Evaluation logged via lightweight .csv logger for reproducibility

Project Structure

Natural-Language-Processing-and-Text-Mining/
├── data/
│   ├── quora.csv
│   ├── splits/                 # train/valid/test (no question leakage)
│   └── processed/              # clean text, embeddings, features
│       ├── question_meta.csv
│       ├── clean_questions.npy
│       ├── question_embeddings_384.npy / 768.npy
│       └── X_*_{ipca, umap}.npy
│
├── models/
│   ├── custom/                 # Pickled classical experts (LR, SVM, etc.)
│   ├── pretrained/             # Logistic head weights for DistilBERT
│   ├── pred_cache/             # Per-expert .npy prediction logs
│   ├── features*/              # TF-IDF, SVD, PCA caches
│   └── gates/                  # MoE gate weights and expert subset indices
│
├── notebooks/
│   ├── 0_split.ipynb
│   ├── 1_eda.ipynb
│   ├── 2_preprocessing.ipynb
│   ├── 3_feature_engineering.ipynb
│   ├── 4_models.ipynb
│   └── 5_benchmarks_lr{X}_ep{Y}.ipynb
│
├── src/
│   ├── preprocessing.py        # Cleaners, SBERT cache, len/word stats
│   ├── features.py             # Feature blocks, IPCA/UMAP
│   ├── custom_models.py        # LR/XGB/LGBM/etc. experts
│   ├── pretrained_models.py    # BERT, RoBERTa, Distil, CrossEnc, MoE
│   └── logs.py                 # CSV logger by event type
│
└── metric_logs/
    ├── splits.csv
    ├── eda.csv / eda_summary.csv
    ├── preprocessing.csv
    ├── features.csv
    ├── models.csv
    ├── gates.csv
    └── benchmarks_lrX_epY.csv

Notebook-by-Notebook Overview

0_split.ipynb

Fixes reproducible splits (train/valid/test)
Drops rows with nulls
Saves train.csv, valid.csv, test.csv to data/splits/

1_eda.ipynb

Explores class balance, length distributions, top tokens
Logs EDA statistics to metric_logs/eda.csv

2_preprocessing.ipynb

Creates per-question artifacts:
- Raw, cleaned, lowercased
- Lengths (chars/words)
Sentence-transformer embeddings (768D)
Saves:
- question_meta.csv
- clean_questions.npy
- question_embeddings.npy

3_feature_engineering.ipynb

Computes 3,598-dimensional features:
- TF-IDF word/char
- SVD word/char
- Fuzzy scores
- Embedding distances
- Graph-based stats
Dimensionality reduction:
- PCA (retain 95% variance)
- UMAP (n_neighbors=15, min_dist=0.1)
- Combo: PCA → UMAP (for smooth manifold)
Saves 3 versions for both dim=384 and dim=768:
- IPCA only (X_*_ipca.npy)
- UMAP only (X_*_umap.npy)
- IPCA + UMAP (X_*_ipca_umap.npy)

4_models.ipynb

Two branches:
- Pretrained experts use full (768D or 384D) sentence embeddings
- Custom experts use reduced 3D (PCA/UMAP) engineered features
Experts trained:
- BertExpert
- RobertaExpert
- XLNetExpert (optional)
- CrossEncExpert
- QuoraDistilExpert with LR on [|u−v|, u·v]
- LRFeatureExpert, XGB, LGBM, KNN, RF, SVM
MoE Gate:
- Gated softmax over expert outputs
- Trained on all expert combinations
- Logs all results with validation log-loss
- Top-10 subsets are retrained on Train+Valid

5_benchmarks*.ipynb

Loads top-10 MoE gates from models/gates/
Loads corresponding moe_*_idxs.npy subsets
Runs .predict_prob() on test.csv
Outputs:
- test_LL, test_ACC, test_F1, test_PREC, test_REC, test_AUC, seconds
Logs to: metric_logs/benchmarks.csv

6_summary.ipynb

Result tables of the benchmarks.
Correlation between metrics and hyperparameters.

Evaluation Pipeline

Evaluation metrics: Log-Loss, Accuracy, F1, Precision, Recall, ROC-AUC, Inference Time
Gated combinations evaluated on test set under 3 configurations:
- lr=0.001, epochs=1
- lr=0.01, epochs=2
- lr=0.05, epochs=10
Results are stored in metric_logs/benchmarks_lrX_epY.csv
Heatmap visualizations and correlation analyses assess metric interdependence

Requirements (minimal)

Save this to requirements.txt:

numpy
pandas
scikit-learn
matplotlib
seaborn
sentence-transformers
transformers
xgboost
lightgbm
torch
umap-learn
rapidfuzz
networkx

Optional:

sentencepiece # required for XLNetExpert

How to Reproduce

Download the quora.csv dataset into data/
Run main.pyor the notebooks in order:
- 0_split.ipynb
- 1_eda.ipynb
- 2_preprocessing.ipynb
- 3_feature_engineering.ipynb
- 4_models.ipynb
- 5_benchmarks*.ipynb
- 6_summary.ipynb
Inspect results in metric_logs/ or plot from CSVs
Run ablations by altering feature reduction or expert list

Academic Use

This repository was developed as part of a course assignment. It includes:

Modular architecture with clear separation of concerns
Feature-based + transformer-based modeling synergy
Log-based tracking for reproducibility
Validation-driven MoE tuning
Full support for ablation and metric correlation analysis

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
metric_logs		metric_logs
notebooks		notebooks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
NLP-Report_Lazaros-Panitsidis_Konstantinos-Kravaritis.pdf		NLP-Report_Lazaros-Panitsidis_Konstantinos-Kravaritis.pdf
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Natural-Language-Processing-and-Text-Mining

Overview

Key Highlights

Project Structure

Notebook-by-Notebook Overview

0_split.ipynb

1_eda.ipynb

2_preprocessing.ipynb

3_feature_engineering.ipynb

4_models.ipynb

5_benchmarks*.ipynb

6_summary.ipynb

Evaluation Pipeline

Requirements (minimal)

How to Reproduce

Academic Use

About

Uh oh!

Releases

Packages

Languages

License

LazarosPan/Natural-Language-Processing-and-Text-Mining

Folders and files

Latest commit

History

Repository files navigation

Natural-Language-Processing-and-Text-Mining

Overview

Key Highlights

Project Structure

Notebook-by-Notebook Overview

0_split.ipynb

1_eda.ipynb

2_preprocessing.ipynb

3_feature_engineering.ipynb

4_models.ipynb

5_benchmarks*.ipynb

6_summary.ipynb

Evaluation Pipeline

Requirements (minimal)

How to Reproduce

Academic Use

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages