Skip to content

Core Data Store

Nuno Nunes edited this page Sep 26, 2010 · 4 revisions

This refers to the data store (data base) we use on the core part of the system (not necessarily the data store that all the collectors or data analyzers will use).

Basic design principles

The most likely choice will be to go with a NoSQL solution. SQL-based datastores may be needed for specific components further "out" from the core system, but for the kinds of speed and the ability to scale well and fast that we will need this will probably prove to be the best solution.

Data structures

Note: This is a work in heavy progress, please discuss and edit as you see fit.

Items

'Items' refers to all of the items the systems collects and that it needs to analyze on behalf of the users. It may be an email message, an RSS item, a tweet, etc.

Main items data structure

The elements that should be stored in the data store for each item that is collected by the system are as follow:

  • item_id - A UUID associated with each item as it is received and that will identify it during it's full lifecycle;

  • item_type - Tweet, email, RSS item, IM, ... TODO: Should there be a global registry of item types? In any event the system should support "unknown" data types and use just the fields defined in this section to analyze them;

  • raw_data - The raw item data as received by the collector. This may be useful for specialized analyzers that know how to deal in depth with items of this type, but it may also potentially take up lots of space (in such cases as emails with big attachments). Should we keep these 'as is'? Should we keep a compressed copy? Should we just dump it and rely solely on the other data that we store?

  • text_data - A smaller text representation of the item. For some item types (tweets, IM messages) this could possibly be enough for most analyzers to work with. It will also be useful for displaying a "simplified" version of the item on some interfaces (think SMS, IM, mobile web);

  • location - Location where the item was created (if available);

  • timestamp_created - Timestamp of the item's creation (if it is available);

  • timestamp_received - Timestamp of the time when the item first entered the exocortex system;

  • specific_data - A structure (hash-like) of data specific to this item type. This must be specified in the section pertaining to each item type that is supported or known to the system.

Items' evaluation data structure

Each analyzer that subscribes to items of a given type will get a change to rate them. All of these ratings must be stored and made available for the butlers to make decisions based upon them.

The ratings will consist of:

  • item_id - The unique ID of the item in question;

  • user_id - This rating was made for this item as pertaining to this specific user (some items may be available to multiple users--think of an RSS feed that is followed by several users);

  • urgency_rating - See the Basic Concepts section;

  • importance_rating - See the Basic Concepts section;

  • rating_entity - The identification of the analyzer that provided this ratings;

  • timestamp - When was this rating computed.

Open issues

Software choice

  • Cassandra - I'm using this in production in a work-related project and I like it for its incredible write speed and simplicity. On the other hand it is simply a dumb store, with no structure associated with the data it stores, so it will be an extra burden on the applications to use it;

  • Redis - The all singing, all dancing, swiss army-like data store. I like it very much for it's features and the way it handles data (things like lists and sets and their associated native functions are amazingly useful). It also provides a messaging solution that might eventually be useful for the core messaging service. Still need to read up on scalability, though;

Clone this wiki locally