Document Classification-NLP

Classification of Botany books in PDF format into Animal species, Plant Species or Mixed type using NLP toolkit~nltk.

Training Data and Test data contains 3 columns - 'File_ID', 'Is_Mixed','Topic'
Each File ID is unique and contains details of a topic with varied number of pages
Each PDF needs to be read, identify the topic and determine whether it focuses on animals, plants or both
For each PDF file we shall check the highest probability of a word occuring and identify the topic

Training Data Topic Distribution

This is a supervised multi-class classification problem where we will be finding the topic of the document based on high probabilty word occurence.

From the training data we have the content of each File ID and the label informing whether it belongs to animal, plant or mixed type.

If the word occurence probability is high for both plant & animal species then, classify it as 'Not Applicable'
If the word occurence probability is high for plant species then, classify it as 'Plant Species'
If the word occurence probability is high for animal species then, classify it as 'Animal Species'
Once, the classification is done, we can provide the 'Is_Mixed' value column based on condition

A comparative study on the output after TF-IDF and counterVectorizer is provided, where we see CV outperform TF-IDF

Confusion matrix on TF-IDF vectorized inputs

Confusion matrix on CounterVector vectorized inputs

AUC_ROC Score for TF-IDF = 0.9288

AUC_ROC Score for CounterVector = 0.9864

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Clustering_word2vec_Final_Submit.ipynb		Clustering_word2vec_Final_Submit.ipynb
Document_Classification_NLP.ipynb		Document_Classification_NLP.ipynb
README.md		README.md