Tutorial: Using DAML to Label the Sentiment of VMware Reddit and Twitter Comments

Goal

In this tutorial, we create a data annotation project to label the movie reviews from the sample IMDB dataset into two sentiment categories (positive, negative). Specifically:

Download sample dataset
Create an DAML project
Assign the labeling task to annotators
Label 50 data points
Track and monitor labeling progress
Export and share the labeled dataset
[Optional] Train a text classification model to predict sentiment for new comments

Terminology: data annotation describes the process of labeling data to train machine learning models. Data annotation as a phrase is used interchangeably with data labeling.

Dataset

Our dataset consists of 500 text strings which have the following baseline:

positive 250
negative 250

Each piece of text is greater than 50 characters in length and includes basic preprocessing to remove ".com" urls.

Whether we agree with the average sentiment will depend on how annotators interpret the samples, but this is a good baseline to evaluate.

You can download the imdb-500.csv dataset in the docs folder.

Using DAML:

Now that we have the dataset, let's create a new project in DAML. Login at yourservice.com and navigate to the Projects page. Click Create New Annotation Project.

Create New Annotation Project

Complete the following:

Project Name: My Sentiment Data
Task Instructions:
- "Review each comment carefully and determine the overall sentiment in the two categories:
  - Positive: the comment indicates a positive outcome for the movie
  - Negative: the comment indicates a negative outcome for the movie
- Skip tickets if you're unsure"
CSV Upload:
- Click Upload CSV and choose the imdb-500.csv
Enter a name, select Has Header and click OK
Upload may take up to 30 seconds
Step 1: Preview CSV tab appears, you can see some text examples
Step 2: Set Data:
- Select "No Labels" from the Labels drop-down
- Select "Review" from the checkbox
- Click "SET DATA"

Total Rows: DAML calculates this after removing non-ASCII characters

Max Annotations: Limits the # of labels per example. Use 3 for this tutorial.

Labels: Add the following labels

Positive
Negative

Annotation Question: You can rephrase any annotation question. Enter:

"Choose the sentiment of the text string"

Case Assignment Logic:

Choose Sequential

Assignee: Press enter to add assignees by email

Add your own email *@vmware.com
Add anyone on your team
Add someperson@vmware.com

Click Create to finish setup for your project

If you have enabled email in your app config, your will receive a notification of project creation. Additionally, email notifications will be sent to the assignees (your own email included) with a link to annotate on the project.

Annotate Data:

Now you're ready (and anyone you assigned the project to) to start annotating! Navigate to the Annotate tab and click Start for your project.

The annotation interface presents the following:

Annotation details for each project
A project toggle between projects
Progress bar for annotators to track their own progress
Full history of labelled examples (you can click on them to return to a previous ticket)

DAML presents the text string along with buttons for each label. If there are more than 6 choices, DAML will present a drop-down selector.

A single click will record the label. You have the option to skip and return to a previous ticket (you can also do this by clicking on a ticket under Progress). Label 50 tickets before going to the next step.

Managing Your DAML Project:

On the Projects tab, you can perform these actions:

Edit the project details:
- Rename the project
- Add more annotators
- Change the assignment logic
Download the project data:
- This will generate a CSV with the current labelled data
Share the project data:
- A key objective for DAML is to promote sharing of high quality datasets. Using this option, you allow other teams to download your annotation project from the Community Datasets tab
Delete the project:
- This is not reversible and requires confirmation

Tracking Project Progress:

DAML aggregates all labels in real time. If you have multiple assigned annotators, you can view project progress by clicking on the project name under the Projects tab.

You'll see project details, # annotations per user, # annotations per category as well as a table view of all labels.

Exporting Data:

When your annotation project is complete or when you're ready to export labelled data for experimentation, you can generate a file for download in the Projects tab. A notification will be sent to your email for large datasets.

The format of the CSV includes project columns and adds a new column for each category along with their tabulated counts
- these tabulated counts may be used to measure annotator disagreements or as probabilistic labels

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tutorial: Using DAML to Label the Sentiment of VMware Reddit and Twitter Comments

Goal

Dataset

Using DAML:

Create New Annotation Project

Annotate Data:

Managing Your DAML Project:

Tracking Project Progress:

Exporting Data:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally