Skip to content

david-hagar/NLP-Analytics

Repository files navigation

NLP-Analytics

Code examples for open source NLP analytics and lists of resources

Presentations

  1. NLP feature engineering

Resources

  1. Java
    1. OpenNLP
  2. Python
    1. scikit-learn
      1. TFIDF
    2. nltk
    3. spacy.io - part of speech and entity extraction
    4. Gensim - tfidf, word2vec and others
  3. C
    1. Senna
  4. DataSets
    1. nltk datasets

Bag Of Words Feature Extraction

  1. Clean text of noise content (Ex. email headers and signatures, non-text documents, bad html tags)
  2. Tokenize each document into a list of features and feature counts. Features can be:
    1. sequence of non-whitespace chatacters (most common)
    2. character n-grams ( Ex. "Hello world" -> "he" "el" "ll" "lo" "o " " w" "wo" ...). Note, character n-grams are resistent to OCR errors, misspellings, and match on partial root words.
    3. word n-grams (Ex. "The brown fox" -> "the" "the brown" "brown" "brown fox")
    4. grammar parsed noun and verb phrases plus words
    5. language specific word parsing like Chinese, German, etc.
    6. word hashes modulo N
  3. Remove stopwords like "a", "an", "the", "for", etc.
  4. Optional word transforms
    1. stem ("walks" -> "walk", "walking" -> "walk" )
    2. lematization ("are" -> "be", "is" -> "be")
  5. Assemble a global document count for each word
  6. Form vocabulary from document frequencies bettween 90% of documents to 2 documents.
  7. Build TDIDF wieghts from doc counts
  8. Multiply document counts by TFIDF weight and optionally normalize to Euclidian length of 1.
  9. Plug weighted counts into your favorite machine learning algorithm

Common Bag of Words Analytics

  1. K-means Clustering
  2. Classification
    1. Logistic regression
    2. SVM
    3. Naive Bayes
  3. Word Vectors - Converts sparce word counts into dense vector representing "context" of word
    1. Latent Semantic Analysis
    2. Word2Vec
    3. GloVe
  4. Topic Modeling - identifies multiple "topic vectors" that sum in different amounts to form each document in the corpus.
    1. Latent Dirichlet allocation

Running Jupyter Notebook Examples

  1. install python (recomend python 3)
    1. on mac: brew install python3
  2. pip3 install jupyter
  3. install any prerequisite packages (Ex. pip3 install sklearn
  4. cd to directory where *.ipynb files are
  5. jupyter notebook
  6. this should open a web browser where you can launch each notebook
  7. "shift-return" runs each command
  8. ctrl-c from command line exits

Examples in this repo:

  1. K-means clustering of movie subtitles with sci-kit learn. (link)

About

Code examples for open source NLP analytics and lists of resources

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published