This repository serves as a central hub for various Natural Language Processing (NLP) assignments, experiments, and projects. It includes practical tasks focused on core NLP techniques and tools using Python and popular libraries like NLTK and Scikit-learn.
Task No. | Topic | Description |
---|---|---|
1 | Text Preprocessing | Tokenization, stopword removal, stemming, and lemmatization |
2 | POS Tagging | Part-of-speech tagging using NLTK and evaluation with the Penn Treebank |
3 | Named Entity Recognition (NER) | Entity detection using spaCy with CoNLL-2003 dataset |
4 | Ambiguity Analysis | Lexical, syntactic, and semantic ambiguities using Brown Corpus |
5 | Sentiment Analysis | ML-based sentiment model on IMDB movie reviews |
6 | Text Classification | News article classification using 20 Newsgroups dataset |
7 | Language Modeling | N-gram language model evaluated with WikiText-2 |
8 | Machine Translation | English-to-French translation using seq2seq model on WMT14 |
9 | Text Generation | RNN-based text generator trained on literary data from Project Gutenberg |
10 | Rule-Based Chatbot | Simple chatbot with predefined rules and dialogue corpus |
➡️ See branch: assignment-i-2024
Task No. | Topic | Description |
---|---|---|
1 | Tokenization | Sentence and word tokenizer using Reuters-21578 dataset |
2 | Stemming | Porter Stemmer applied on Brown Corpus |
3 | Lemmatization | WordNet lemmatizer with comparison to stemming using Gutenberg Corpus |
4 | Bag of Words (BoW) | Convert documents into numerical vectors using 20 Newsgroups dataset |
5 | TF-IDF | Feature extraction from IMDB Movie Reviews |
6 | Morphological Analysis | Root form detection using Universal Dependencies |
7 | Regex Pattern Extraction | Extract dates, emails, etc. from Enron Email Dataset |
8 | Levenshtein Edit Distance | Compare word pairs using edit distance (WordNet or custom dataset) |
9 | Preprocessing Pipeline | Includes tokenization, normalization, and vectorization (Amazon Reviews) |
10 | Spell Checker | Suggest spelling corrections using edit distance and Birkbeck corpus |
➡️ See branch: assignment-ii-2024
A new folder titled Learning Task
has been added to the repository. It currently includes:
- 📝
Natural Language Preprocessing.ipynb
– A notebook demonstrating core text preprocessing techniques - 🧪
Small Task.ipynb
– A mini NLP task or experiment (details inside notebook)
This section will grow as more ad-hoc or exploratory tasks are added.
git clone https://github.com/yourusername/nlp-task.git
cd nlp-task
Switch to the relevant branch:
git checkout assignment-i-2024
# or
git checkout assignment-ii-2024
- Python 3.8+
- NLTK
- spaCy
- Scikit-learn
- Pandas & NumPy
- TensorFlow / PyTorch (as required)
- Hugging Face Transformers (optional)
- NLTK corpora: https://www.nltk.org/nltk_data/
- IMDB reviews: https://ai.stanford.edu/~amaas/data/sentiment/
- 20 Newsgroups: http://qwone.com/~jason/20Newsgroups/
- CoNLL-2003: https://www.clips.uantwerpen.be/conll2003/ner/
- WikiText-2: https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/
- WMT14: http://www.statmt.org/wmt14/translation-task.html
- Project Gutenberg: https://www.gutenberg.org/
- Cornell Movie Dialogues: https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html
Datasets and tools used from:
- NLTK
- Stanford AI
- UCI ML Repository
- Hugging Face Datasets
- Kaggle
- Universal Dependencies