GitHub - jisaacso/learnit: Data science in the cloud. DSaaS.

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
learnit		learnit
.gitignore		.gitignore
README		README
requirements.txt		requirements.txt
run.sh		run.sh

Repository files navigation

To run, make sure you have python (2.7), virtualenv and pip
in your PATH

./run.sh <port>

NOTE: this will install a virtualenv with dependencies
the first time it is run. This may take a few minutes
and requires an internet connection.

Brief discussion:

I have built a web app for uploading a tsv dataset of the format
<label>\t<textData>
and building a machine learning model. Once the model is built, a user may enter test sentences to try the model on. The model supports arbitrary multiclass classification but is limited to datasets which can fit into the server's memory.

The chosen model was a tfidf vector space to feed into a shallow neural network. The network was built with one dense, fully connected layer between the input and a softmax output layer.

The model was built a model to perform well on a movie review dataset. The model was validated in two ways. First, evalutation on a held-out set. This is described below. Second, validation against metamind's api. I made an account on metamind.io, signed up for an API key, trained a model using the provided dataset and compared performance on a (very small) hand generated dataset. They performed comparably and the code is provided.

I initially looked into l2 regularized logistic regression and TFIDF features. Unfortunately this performed with poor cross validated performance (average F1 measure was around .75). Also, due to the small size of training data, features such as `and` had strong predictive power for the positive class. Since this is nonsense I looked into ways of reducing the input feature space.

First, I evaluated TruncatedSVD for fast dimensionality reduction of TFIDF. Predictive performance stayed approximately until the number of SVD dimensions dropped below 200.

Second, I looked at using Word2Vec embeddings as a feature represenetation. A Word2Vec model was built, bootstrapping off the pretrained Google News vectors. Unfortunately, while these embeddings are indeed dense and small (300 output units) it's difficult to represent a single document as an operation over single-word embeddings. I tried both max and average across each Word2Vec dimension without much difference in accuracy. F1 measure remained around .75.

Finally, I moved to Keras library for looking at feeding the raw TFIDF vectors directly into a shallow neural network. This provided the highest accuracy, with a cross validation misclassification around ~5% and a validation misclassification around ~10% (80/20 split). This convinced me to stick with Keras.

There are many areas for improvement here. To name a few:
* numerics
Let's get smarter about choice of both model and model parameter. It would be ideal to choose a quick-to-train, stupid model initially and only move to more sophisticaed models if F1 cross validation performace is weak. Also, it's worth while to do a grid search for optimal model parameters (dropout percentage, number of network units, etc.). This would likely require larger hardware and parallelzied model training. This is not difficult, but it is time consuming. Finally, it would be cool to revisit Word2Vec features but use a CNN input layer to represent documents. This should capture more granularity than simple averaging.

* code quality
This library deserves better testing (ala tox).

* backend support (hbase model persistence)
Right now, everytime the server shuts down the single model is destroyed. This is sad and unnecessary. With a little more time, it would not be difficult to save every uploaded model into a persistent data store. When the current Keras model saves model parameters it amounts to about 10mb on disk. This is not large, but big enough that some though should be put into storage and access. I would start with hbase with keys comprising userid and modelhash and values containing model parameters.

* streaming datasets (large data)
This model can only support datasets which can fit in memory on one machine. This is pretty insane for a generalized learner. It would be better to stream the dataset into persistent storage (again hbase) and build a feature extraction method which uses Spark to extract feature vectors via map/reduce. Depending on the problem these features will either be sparse (likely able to be run on one machine in ram) or dense (requiring more sophisticated parallelized matrix operations).