📚 Course: Natural Language Processing (CSE556)
👨🏫 Instructor: Dr. Shad Akhtar, IIIT-Delhi
🧠 Semester: Winter 2025
🏫 Institute: IIIT Delhi
🛠️ Repo Overview: This repository contains all the assignments submitted as part of the graduate-level NLP course, covering core and advanced topics in language modeling, sentiment analysis, question answering, and multimodal understanding.
- Understand the fundamentals of statistical and neural NLP methods.
- Apply linguistic and syntactic analysis for deeper language understanding.
- Design, implement, and evaluate NLP models for real-world tasks like language modeling, sentiment analysis, and claim normalization.
- Explore state-of-the-art transformer-based and multimodal architectures.
- Text Preprocessing
- Language Modelling
- Word Embeddings (Word2Vec, GloVe, FastText)
- PoS Tagging & Hidden Markov Models
- Sequence Learning
- Neural Language Models (MLP, GRU, LSTM)
- Transformers and Attention Mechanisms
- Fine-tuning of Pretrained Models (BERT, BART, RoBERTa, SpanBERT)
- Sequence Labeling and Aspect-Based Sentiment Analysis
- Text Classification (Fake News, Hate Speech, Deception Detection)
- Conversational Dialogue
- Summarization
- Question Answering (SQuAD v2)
- Multimodal NLP (Sarcasm Explanation via MuSE architecture)
- Syntax Parsing
- Implemented a custom WordPiece tokenizer from scratch using only standard Python libraries.
- Created a vocabulary from the corpus and tokenized sentences from a test dataset.
- Output includes a vocabulary file and tokenized JSON output.
- Built a CBOW-based Word2Vec model from scratch using PyTorch.
- Used the tokenizer from Task 1 to prepare the dataset.
- Trained embeddings and computed cosine similarities to validate the model.
- Included training/validation loss plots and similarity analysis.
- Developed and trained three variants of a multi-layer perceptron (MLP) for next-word prediction.
- Integrated custom Word2Vec embeddings.
- Compared model architectures based on accuracy and perplexity.
- Included a prediction pipeline for next-token generation.
- Sequence labeling using RNNs/GRUs with GloVe/FastText
- BIO tagging and F1-score evaluation
- Sentiment classification for aspect terms
- Models: RNN/GRU/LSTM and fine-tuning BERT, BART, RoBERTa
- Fine-tuning SpanBERT and SpanBERT-CRF for SQuAD v2
- Evaluation using Exact Match (EM) score
- Implementation of transformer components (positional encoding, self-attention, etc.)
- Language modeling using the Shakespeare dataset
- Fine-tuning BART and T5 for social media claim rewriting
- Evaluation using ROUGE-L, BLEU-4, and BERTScore
- Vision + Text fusion using ViT and BART
- Implementation of Shared Fusion Mechanism
- Evaluation using ROUGE, BLEU, METEOR, BERTScore
This work was completed under the mentorship of Dr. Shad Akhtar, whose lectures and assignments deeply strengthened my understanding of modern NLP techniques. This work was done in a group of three collaborators: Akshat Chaw Parmar, Rishi Pendyala and Vimal Jayant Subburaj and has been forked from this repository
Feel free to reach out for collaborations or questions related to the code or topics:
📧rishi22403@iiitd.ac.in