Skip to content

jefferyYu/Learning-Sentence-Embeddings-for-cross-domain-sentiment-classification

Repository files navigation

Author

Jianfei YU jfyu.2014@phdis.smu.edu.sg

Feb 22, 2017

Data and Code for:

Learning Sentence Embeddings with Auxiliary Tasks for Cross-Domain Sentiment Classification EMNLP 2016 https://aclweb.org/anthology/D/D16/D16-1023.pdf

Here is the link to the PyTorch Implementation of our CNN-Aux model (evaluate on the Amazon product review datasets)

https://github.com/NUSTM/CNN-aux

Requirements

  1. Python 2.7
  2. hdf5 (h5py)
  3. Torch7

I. Data

  1. The data consists of five domains: the first two are movie domains, the latter three are respectively digital device, laptop and restaurant.

  2. The following is the summarization of all the data

Datasets Sentences Words
rt-polarity(MV1) 10662 18765
stsa.binary(MV2) 9613 16186
custrev(CR) 3770 5340
laptop(LT) 1907 2837
restaurant(RT) 1572 2930
  1. Each domain's data is under the folder "preprocess codes//data", and the data with Part-of-speech taggings are under the folder "preprocess codes/tag_data".

II. Evaluation

  • For evaluation, we treat every source-target domain pair as a task, and hence have 18 tasks (The first two are not considered since they are both from movie domains).

  • In our experiments, we use a Transductive Learning setting. That is, we use the unlabeled data from both the development set and the test set in the target domain to train our model, and we only tune parameters on the development set in the target domain.

  • Example: assume that the task is MV2->LT. During training stage, we will use labeled data from MV2 (i.e., stsa.binary.txt), and unlabeled data from LT (i.e., laptop.txtdev and laptop.txttest). Note that we cannot make use of labels from LT during training stage. During test stage, we will use the trained model to tune on the development set (i.e., laptop.txtdev), and apply the best tuned model to test on the test set (i.e., laptop.txttest)

III. Code

Part1: Pre-process code (under the folder "preprocess codes"):

Step 1. Obtain Pivot Word List
python IG2.py
Step 2. The pivot word lists of 18 source-target pairs will be in the folder "pivotlist", and then you can run the following codes:
python preprocess_dasa_mt_2_2.py

Part2: Model Code:

  • To run the baseline (NaiveNN in our paper), you can just run:
sh run1.sh
  • To run our Joint model (Joint in our paper), you can just run:
sh run2.sh

Part3: Results:

  • Our running results are shown in the two files "NaiveNNresult.txt" and "Jointresult.txt", which are slightly different from the results we report in our paper.
  • The reason is that in our previous experiments, we use a random seed for both NaiveNN model and Joint Model. But now for fair comparison, we set the seed in both models to the same value 1, which means that all the overlapped parameters in them will be initialized to the same value.

----------------------------------------training details----------------------------------------------------------

  • Note that here we employ an alternating optimization approach for training our Joint model. In each epoch, we first use labeled source domain data to optimize all the model parameters (one sentence, one true label, two auxiliary labels), and then switch to using unlabeled target domain data to optimize the parameters corresponding to the auxiliary task (one sentence, two auxiliary labels).

Acknowledgements

Licence:

Singapore Management University

About

Learning Sentence Embeddings with Auxiliary Tasks for Cross-Domain Sentiment Classification

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published