Skip to content

Database Design

GinkREAL edited this page Mar 17, 2019 · 2 revisions

The database has 6 collections: users, labels, articles, heatmaps, jobs, results.


Articles

Stores the articles used for labelling and training the ai model. Remains read-only for most of its lifetime. These are just guidelines and it is best to check reddit's documentation for this particular object.

  • _id: Mongo's unique ID for this article, used for most of the labeling side API calls
  • title: Title of the article as it is seen on reddit.
  • subreddit: The subreddit this article belongs to.
  • name: The name is the type of the 'object' + the unique name by reddit. (article) + 76rjtv(reddit id).
  • upvotes: The number of upvotes of this article.
  • downvotes: This one is a mystery to me.
  • score: Usually same as upvotes.
  • locked: Whether a moderator has locked the article from more comments.
  • num_comments: Total comments of the article.
  • url: If a post if a link post, then it has an url. Usually the news article the post is referring to
  • created_utc: UTC time where it was created
  • last_modified: Date last modified, format unknown.
  • article_id: Reddit's assigned ID to the article. Used for AI side API calls.
  • archived: Whether the post is old enough to be archived, and therefore locked in time.
  • author_fullname: Same as name field, but for the author
  • author: Human readable name of the author.
  • comments: Array of comment subtypes.
  • targets: Array of strings representing the possible targets of interest in the article. This is a required field and is added in by the reddit.py script.

Comment Subtype

  • comment: The actual text of the comment.
  • created_utc: The utc created time of the comment.
  • distinguished: A mystery to me.
  • edited: Whether the comment has been edited during its lifetime.
  • comment_id: Reddit ID of the comment.
  • score: Upvotes minus downvotes.
  • sticked: Whether a moderator has stickied this comment to the top of the thread.
  • author_fullname: Similar to author_fullname in article.
  • author: Author of comment as seen on reddit.
  • replies: Recursive array of more comment subtypes.

Heatmaps

These are a sort of cache to help get the comments with the highest upvotes as the one to be labelled by volunteers. Can be generated by reddit.py using the 'heatmap' command.

  • _id: Mongo's unique id for this heatmap.
  • article: Mongo id of the article this heatmap caches.
  • heatmap: A tuple of the comment_address and score of that comment. Note: comment_address is simply multiple array indices separated by a comma that represent the 'address' of that comment within the forest of comments.

Users

Volunteers who help label the comments in the system.

  • _id: Mongo's unique id for this user.
  • username: The username of this user.
  • password: The BCrypt encrypted password of this user.
  • role: Either admin or user. Admin-created accounts have endorsed set to true.
  • endorsed: Just an identifier to easily delete bogus labels.
  • assignedArticles: Array of the mongo id of articles assigned to this user for labeling.
  • checkpoint: Represents the save point of the user. First number is assignedArticle indice, then second number is heatmap indice.

Labels

The labels made by the volunteers

  • _id: Mongo's unique id for this label.
  • labeller: The name of the user who made this label. Useful for inter-rate reliability tests.
  • article_id: Article that this label refers to.
  • comment_address: Address of the comment in the forest of the article the label is referring to.
  • label: The stance of the comment against the target according to the labeller.
  • target: The target of interest of this label.

Jobs

Job queue system polled by worker.py for analysis.

  • article: The reddit id of the article requested for analysis.

Results

Results made by the worker.py on requests found in jobs collection.

Clone this wiki locally