Skip to content

Takeoutthetrash

nunonunes edited this page Dec 19, 2011 · 2 revisions

Take Out The Trash (or: I'd like my feeds clean, if you please)

Brief description of the problem: Brownian has a number of ways to collect data (RSS feeds, twitter searches, etc) and everyone likes to add their own stuff to the spider, but in the middle of the stuff we want to see, we also get some trash.

A prime example of this is the twitter search for "softwood" that often times comes up with news about the woodworking industry, when what we really care about are tweets about our meme and our skateboards.

Solving this as a ML problem

If we tackle this as a machine learning problem, then we can try to implement an online-learning algorithm that gets trained by explicit feedback from the users (for example, whenever Brownian relays a tweet about the wrong kind of softwood, someone could give it a command indicating that that was a bad piece of information, and so it would reinforce it's internal scoring mechanism.

Details that we need to think about are:

  • Is there an only database of "good stuff", regardless of the source (tweeter, RSS, ...), or do we keep separate engines for different sources?
  • No explicit feedback == no update to the algorithm; Negative feedback => bad news bit; Should there be an explicit Positive feedback command?
  • What are the command(s) that we should use to train Brownian? (brownian: [baditem|gooditem])
  • Should anyone be allowed to train Drownian, or just a group of admins? Only authenticated people? Only "regulars"? (requires the emotion chip developments to recognize regulars at the Pub) ;-)
  • Should we share the database for the whole bot, or should we use a separate database for these wacky projects?
Clone this wiki locally