-
Notifications
You must be signed in to change notification settings - Fork 23
Tutorial: Using DAML to Label the Sentiment of VMware Reddit and Twitter Comments
In this tutorial, we create a data annotation project to label the movie reviews from the sample IMDB dataset into two sentiment categories (positive, negative). Specifically:
- Download sample dataset
- Create an DAML project
- Assign the labeling task to annotators
- Label 50 data points
- Track and monitor labeling progress
- Export and share the labeled dataset
- [Optional] Train a text classification model to predict sentiment for new comments
Our dataset consists of 500 text strings which have the following baseline:
- positive 250
- negative 250
Each piece of text is greater than 50 characters in length and includes basic preprocessing to remove ".com" urls.
Whether we agree with the average sentiment will depend on how annotators interpret the samples, but this is a good baseline to evaluate.
You can download the imdb-500.csv dataset in the docs folder.
Now that we have the dataset, let's create a new project in DAML. Login at yourservice.com and navigate to the Projects page. Click Create New Annotation Project.
Complete the following:
- Project Name: My Sentiment Data
-
Task Instructions:
- "Review each comment carefully and determine the overall sentiment in the two categories:
- Positive: the comment indicates a positive outcome for the movie
- Negative: the comment indicates a negative outcome for the movie
- Skip tickets if you're unsure"
- "Review each comment carefully and determine the overall sentiment in the two categories:
-
CSV Upload:
- Click Upload CSV and choose the imdb-500.csv
- Enter a name, select Has Header and click OK
- Upload may take up to 30 seconds
- Step 1: Preview CSV tab appears, you can see some text examples
- Step 2: Set Data:
- Select "No Labels" from the Labels drop-down
- Select "Review" from the checkbox
- Click "SET DATA"
- Positive
- Negative
- "Choose the sentiment of the text string"
- Choose Sequential
- Add your own email *@vmware.com
- Add anyone on your team
- Add someperson@vmware.com
Now you're ready (and anyone you assigned the project to) to start annotating! Navigate to the Annotate tab and click Start for your project.
The annotation interface presents the following:
- Annotation details for each project
- A project toggle between projects
- Progress bar for annotators to track their own progress
- Full history of labelled examples (you can click on them to return to a previous ticket)

A single click will record the label. You have the option to skip and return to a previous ticket (you can also do this by clicking on a ticket under Progress). Label 50 tickets before going to the next step.
On the Projects tab, you can perform these actions:
-
Edit the project details:
- Rename the project
- Add more annotators
- Change the assignment logic
-
Download the project data:
- This will generate a CSV with the current labelled data
-
Share the project data:
- A key objective for DAML is to promote sharing of high quality datasets. Using this option, you allow other teams to download your annotation project from the Community Datasets tab
-
Delete the project:
- This is not reversible and requires confirmation
DAML aggregates all labels in real time. If you have multiple assigned annotators, you can view project progress by clicking on the project name under the Projects tab.

You'll see project details, # annotations per user, # annotations per category as well as a table view of all labels.
When your annotation project is complete or when you're ready to export labelled data for experimentation, you can generate a file for download in the Projects tab. A notification will be sent to your email for large datasets.
- The format of the CSV includes project columns and adds a new column for each category along with their tabulated counts
- these tabulated counts may be used to measure annotator disagreements or as probabilistic labels