An end-to-end GPU-accelerated pipeline to scrape, classify, and summarise Indian legal case documents. Built for scale, tuned for precision.
- Web Scraping – Extracts case data from official court sources into CSV (~250MB).
- Preprocessing – Lowercasing, stop word removal, lemmatization, and n-gram generation.
- Classification – Heuristic rule-based prediction using manually vectorized keywords.
- Summarization – Concise, 100-word summaries using transformer models (T5 / BART) with professional legal tone.
- 📄 Scrapes and structures raw legal text from court portals.
- 🧹 NLP preprocessing using SpaCy and NLTK.
- 🧠 Heuristic classification of content (issues, petitions, conclusions, arguments).
- 📝 T5/BART-based summarization via HuggingFace or local inference.
- ⚡ CUDA-enabled for fast training and inference.
- 🧩 Modular pipeline design.
- 🧠 Build a custom summarization model replicating T5 architecture.
- 📦 Scale dataset up to 1TB with broader legal domain coverage.
- ⚙️ Integrate parallel computing for large-scale training.
- 🌐 Deploy complete webapp with case upload, live classification, and summarization.
- 🧪 Fine-tune on domain-specific legal jargon for Indian courts.