Automatic Chapter Classification

Goal: Automatically tag Qs to respective chapters on CMS

Problem Statement: Currently, to be able to add any Q on CMS, the creator needs to tag its syllabus, chapter, topic etc. This causes a significant delay since tagging a Q to one of the 30 chapters in JEE Chemistry, say, is a non-trivial process, and needs the time of curriulum creator. This process needs to be done 90 times for getting a single JEE Main mock paper typed.

To speed up our question creation process on CMS, we should be able to automatically classify a Question into any of the given chapters for a particular syllabus, using its text.

Great reference: A great reference for this problem is this paper which goes over a lot of the choices that we have to make.

Running the code

R Code

You can directly work on the RStudio file and all everything will run except the python code. There is an issue with RStudio that somehow doesn't let data be shared between R and python.

R and Python code

Make all the changes you want, and then:

"Source" the runRmd.R from within RStudio
Run from the command line:

Rscript runRmd.R

This will generate a file called report.html with a nice output and all that.

Parameters

These are the choices we have to make while using this:

How to parse the questions: The questions come from CMS and include:
1. Question Text
2. Options
3. Solutions Which combination of the above performs best? Current option: All of the above. However, maybe we should omit solutions because later, when questions are given as input, they will mostly be entered without solutions (True?)
How to clean the question data: These are the options we have and decisions need to be made for each of them:
1. Removing non-UTF8 characters (\u0080-\uffff in the UTF8 character set
2. Trimming all extra whitespace
3. Removing numbers (Is this a good idea?)
4. Stop words ("is", "of", etc.) (Is this a good idea?)
5. Removing non-alphanumeric characters (This is probably a good idea)
Vectorizations: Options are:
1. Bag of words with or without n-grams
Features: These are our current choices:
1. Term Frequency
2. TF-IDF
3. Latent Symantic Analysis (LSA): Nice PDF link
4. Other questions we have to ask
  1. How do we include domain-specific knowledge (equations, chemical symbols, etc.)
Algorithm used: These are some popular choices
1. Naive Bayes
2. SVM
3. Random Forests
4. kNNs
5. ...
6. ...
7. Some form of Deep Learning using TensorFlow - Gluon : Deep Learning for NLP - Also, see references for glove : 7 8
Measure of accuracy: Some choices that we have:
1. Confusion Matrix: Good because it tells us about the cross category performance. A little bad because it does not deal well with dataset imbalance, which we have a bit of.
2. Specificity, Sensitivity, Precision, Recall: All the fancy things
3. F-Score: Major F-score and Minor F-score The problem with 2 and 3 above is that they deal only with binary data. In fact, Major and Minor F-scores are solutions to this problem, as far as I understand.
Hyperparameter Tuning : https://docs.google.com/spreadsheets/d/1wWhhZF8cgr7RbfdTaplosuqtOJAOpwzG-Y_LtEbWgro/edit?usp=sharing

Useful links:

[1] tm, e1071

[2] tm, e1071, wordcloud

[3] Python sklearn Naive Bayes

[4] Confusion Matrix

[5] Evaluation of text classification

[6] Nice tutorial on text classification in Python. Includes a lot of methods and the libraries to use.

[5] Evaluation of text classification https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-text-classification-1.html

[6] Python Implementation - https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/

[7] Text Features - https://towardsdatascience.com/understanding-feature-engineering-part-4-deep-learning-methods-for-text-data-96c44370bbfa

[8] Glove on Github - https://github.com/stanfordnlp/GloVe#train-word-vectors-on-a-new-corpus

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Archive		Archive
Useful_Literature		Useful_Literature
data		data
jupyter		jupyter
notes		notes
r-markdown		r-markdown
src		src
.Rhistory		.Rhistory
.gitignore		.gitignore
README.md		README.md
qs_topicwise_json.rb		qs_topicwise_json.rb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Automatic Chapter Classification

Running the code

R Code

R and Python code

Parameters

Useful links:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

peerlearning/auto_chapter_classification

Folders and files

Latest commit

History

Repository files navigation

Automatic Chapter Classification

Running the code

R Code

R and Python code

Parameters

Useful links:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages