Code examples for open source NLP analytics and lists of resources
- Java
- Python
- scikit-learn
- nltk
- spacy.io - part of speech and entity extraction
- Gensim - tfidf, word2vec and others
- C
- DataSets
- Clean text of noise content (Ex. email headers and signatures, non-text documents, bad html tags)
- Tokenize each document into a list of features and feature counts. Features can be:
- sequence of non-whitespace chatacters (most common)
- character n-grams ( Ex. "Hello world" -> "he" "el" "ll" "lo" "o " " w" "wo" ...). Note, character n-grams are resistent to OCR errors, misspellings, and match on partial root words.
- word n-grams (Ex. "The brown fox" -> "the" "the brown" "brown" "brown fox")
- grammar parsed noun and verb phrases plus words
- language specific word parsing like Chinese, German, etc.
- word hashes modulo N
- Remove stopwords like "a", "an", "the", "for", etc.
- Optional word transforms
- stem ("walks" -> "walk", "walking" -> "walk" )
- lematization ("are" -> "be", "is" -> "be")
- Assemble a global document count for each word
- Form vocabulary from document frequencies bettween 90% of documents to 2 documents.
- Build TDIDF wieghts from doc counts
- Multiply document counts by TFIDF weight and optionally normalize to Euclidian length of 1.
- Plug weighted counts into your favorite machine learning algorithm
- K-means Clustering
- Classification
- Word Vectors - Converts sparce word counts into dense vector representing "context" of word
- Topic Modeling - identifies multiple "topic vectors" that sum in different amounts to form each document in the corpus.
- install python (recomend python 3)
- on mac:
brew install python3
- on mac:
pip3 install jupyter
- install any prerequisite packages (Ex.
pip3 install sklearn
- cd to directory where *.ipynb files are
jupyter notebook
- this should open a web browser where you can launch each notebook
- "shift-return" runs each command
ctrl-c
from command line exits
- K-means clustering of movie subtitles with sci-kit learn. (link)