Classification of Botany books in PDF format into Animal species, Plant Species or Mixed type using NLP toolkit~nltk.
- Training Data and Test data contains 3 columns - 'File_ID', 'Is_Mixed','Topic'
- Each File ID is unique and contains details of a topic with varied number of pages
- Each PDF needs to be read, identify the topic and determine whether it focuses on animals, plants or both
- For each PDF file we shall check the highest probability of a word occuring and identify the topic
Training Data Topic Distribution
This is a supervised multi-class classification problem where we will be finding the topic of the document based on high probabilty word occurence.
From the training data we have the content of each File ID and the label informing whether it belongs to animal, plant or mixed type.
- If the word occurence probability is high for both plant & animal species then, classify it as 'Not Applicable'
- If the word occurence probability is high for plant species then, classify it as 'Plant Species'
- If the word occurence probability is high for animal species then, classify it as 'Animal Species'
- Once, the classification is done, we can provide the 'Is_Mixed' value column based on condition
A comparative study on the output after TF-IDF and counterVectorizer is provided, where we see CV outperform TF-IDF
Confusion matrix on TF-IDF vectorized inputs
Confusion matrix on CounterVector vectorized inputs
AUC_ROC Score for TF-IDF = 0.9288
AUC_ROC Score for CounterVector = 0.9864