NLP-Analytics

Code examples for open source NLP analytics and lists of resources

Presentations

NLP feature engineering

Resources

Java
1. OpenNLP
Python
1. scikit-learn
  1. TFIDF
2. nltk
3. spacy.io - part of speech and entity extraction
4. Gensim - tfidf, word2vec and others
C
1. Senna
DataSets
1. nltk datasets

Bag Of Words Feature Extraction

Clean text of noise content (Ex. email headers and signatures, non-text documents, bad html tags)
Tokenize each document into a list of features and feature counts. Features can be:
1. sequence of non-whitespace chatacters (most common)
2. character n-grams ( Ex. "Hello world" -> "he" "el" "ll" "lo" "o " " w" "wo" ...). Note, character n-grams are resistent to OCR errors, misspellings, and match on partial root words.
3. word n-grams (Ex. "The brown fox" -> "the" "the brown" "brown" "brown fox")
4. grammar parsed noun and verb phrases plus words
5. language specific word parsing like Chinese, German, etc.
6. word hashes modulo N
Remove stopwords like "a", "an", "the", "for", etc.
Optional word transforms
1. stem ("walks" -> "walk", "walking" -> "walk" )
2. lematization ("are" -> "be", "is" -> "be")
Assemble a global document count for each word
Form vocabulary from document frequencies bettween 90% of documents to 2 documents.
Build TDIDF wieghts from doc counts
Multiply document counts by TFIDF weight and optionally normalize to Euclidian length of 1.
Plug weighted counts into your favorite machine learning algorithm

Common Bag of Words Analytics

K-means Clustering
Classification
Word Vectors - Converts sparce word counts into dense vector representing "context" of word
Topic Modeling - identifies multiple "topic vectors" that sum in different amounts to form each document in the corpus.
1. Latent Dirichlet allocation

Running Jupyter Notebook Examples

install python (recomend python 3)
1. on mac: brew install python3
pip3 install jupyter
install any prerequisite packages (Ex. pip3 install sklearn
cd to directory where *.ipynb files are
jupyter notebook
this should open a web browser where you can launch each notebook
"shift-return" runs each command
ctrl-c from command line exits

Examples in this repo:

K-means clustering of movie subtitles with sci-kit learn. (link)

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
python-sklearn-classification		python-sklearn-classification
python-sklearn-kmeans		python-sklearn-kmeans
rnn-lstm-text-classification		rnn-lstm-text-classification
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NLP-Analytics

Presentations

Resources

Bag Of Words Feature Extraction

Common Bag of Words Analytics

Running Jupyter Notebook Examples

Examples in this repo:

About

Uh oh!

Releases

Packages

Languages

License

david-hagar/NLP-Analytics

Folders and files

Latest commit

History

Repository files navigation

NLP-Analytics

Presentations

Resources

Bag Of Words Feature Extraction

Common Bag of Words Analytics

Running Jupyter Notebook Examples

Examples in this repo:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages