Database Design

The database has 6 collections: users, labels, articles, heatmaps, jobs, results.

Articles

Stores the articles used for labelling and training the ai model. Remains read-only for most of its lifetime. These are just guidelines and it is best to check reddit's documentation for this particular object.

_id: Mongo's unique ID for this article, used for most of the labeling side API calls
title: Title of the article as it is seen on reddit.
subreddit: The subreddit this article belongs to.
name: The name is the type of the 'object' + the unique name by reddit. (article) + 76rjtv(reddit id).
upvotes: The number of upvotes of this article.
downvotes: This one is a mystery to me.
score: Usually same as upvotes.
locked: Whether a moderator has locked the article from more comments.
num_comments: Total comments of the article.
url: If a post if a link post, then it has an url. Usually the news article the post is referring to
created_utc: UTC time where it was created
last_modified: Date last modified, format unknown.
article_id: Reddit's assigned ID to the article. Used for AI side API calls.
archived: Whether the post is old enough to be archived, and therefore locked in time.
author_fullname: Same as name field, but for the author
author: Human readable name of the author.
comments: Array of comment subtypes.
targets: Array of strings representing the possible targets of interest in the article. This is a required field and is added in by the reddit.py script.

Comment Subtype

comment: The actual text of the comment.
created_utc: The utc created time of the comment.
distinguished: A mystery to me.
edited: Whether the comment has been edited during its lifetime.
comment_id: Reddit ID of the comment.
score: Upvotes minus downvotes.
sticked: Whether a moderator has stickied this comment to the top of the thread.
author_fullname: Similar to author_fullname in article.
author: Author of comment as seen on reddit.
replies: Recursive array of more comment subtypes.

Heatmaps

These are a sort of cache to help get the comments with the highest upvotes as the one to be labelled by volunteers. Can be generated by reddit.py using the 'heatmap' command.

_id: Mongo's unique id for this heatmap.
article: Mongo id of the article this heatmap caches.
heatmap: A tuple of the comment_address and score of that comment. Note: comment_address is simply multiple array indices separated by a comma that represent the 'address' of that comment within the forest of comments.

Users

Volunteers who help label the comments in the system.

_id: Mongo's unique id for this user.
username: The username of this user.
password: The BCrypt encrypted password of this user.
role: Either admin or user. Admin-created accounts have endorsed set to true.
endorsed: Just an identifier to easily delete bogus labels.
assignedArticles: Array of the mongo id of articles assigned to this user for labeling.
checkpoint: Represents the save point of the user. First number is assignedArticle indice, then second number is heatmap indice.

Labels

The labels made by the volunteers

_id: Mongo's unique id for this label.
labeller: The name of the user who made this label. Useful for inter-rate reliability tests.
article_id: Article that this label refers to.
comment_address: Address of the comment in the forest of the article the label is referring to.
label: The stance of the comment against the target according to the labeller.
target: The target of interest of this label.

Jobs

Job queue system polled by worker.py for analysis.

article: The reddit id of the article requested for analysis.

Results

Results made by the worker.py on requests found in jobs collection.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Database Design

The database has 6 collections: users, labels, articles, heatmaps, jobs, results.

Articles

Comment Subtype

Heatmaps

Users

Labels

Jobs

Results

Uh oh!

Uh oh!

Clone this wiki locally