Skip to content

Kinesis

Daniel Blazevski edited this page Sep 1, 2016 · 4 revisions

Introduction

Kinesis is a popular ingestion tool developed by Amazon. It is a service managed by AWS, so unlike other tools (e.g. Apache Kafka), Kinesis does not require you to to set up and configure on individual servers. The key concepts for using Kinesis for stream processing are

  • A stream: A queue for incoming data to reside in. Stream are labeled by a string. For example, Amazon might have an "Orders" stream, a "Customer-Review" stream, and so on.

  • A shard: A stream can be composed of one or more shards. One shard is able can read data at a rate of up to 2 MB/sec and can write up to 1,000 records/sec up to a max of 1 MB/sec. A user should specify the number of shards that coincides with the amount of data expected to be present in their system.

  • Producer: A producer is a source of data, typically generated external to your system in real-world applications (e.g. user click data)

  • Consumer: Once the data is placed in a stream, it can be processed and stored somewhere (e.g. on HDFS or a database). Anything that reads in data from a stream is said to be a consumer.

Example of creating a stream

The boto library is a convenient way to write Python scripts to initialize and make use of AWS tools. Using the boto libary one can create a kinesis stream in a few lines of code (need to clean....)

import boto
kinesis = kinesis.connect_to_region("us-east-1")
streamName = "myStream"
kinesis.create_stream(name = streamName, number_shards = 1)
i = 0
while(true):
     kinesis.put_record(streamName, json.dumps(i), "partitionkey")
     i = i + 1

Example of placing data into a stream

Example of consuming data from a stream

Clone this wiki locally