-
Notifications
You must be signed in to change notification settings - Fork 61
Kinesis
Kinesis is a popular ingestion tool developed by Amazon. It is a service managed by AWS, so unlike other tools (e.g. Apache Kafka), Kinesis does not require you to to set up and configure on individual servers. The key concepts for using Kinesis for stream processing are
-
A stream: A queue for incoming data to reside in. Stream are labeled by a string. For example, Amazon might have an "Orders" stream, a "Customer-Review" stream, and so on.
-
A shard: A stream can be composed of one or more shards. One shard is able can read data at a rate of up to 2 MB/sec and can write up to 1,000 records/sec up to a max of 1 MB/sec. A user should specify the number of shards that coincides with the amount of data expected to be present in their system.
-
Producer: A producer is a source of data, typically generated external to your system in real-world applications (e.g. user click data)
-
Consumer: Once the data is placed in a stream, it can be processed and stored somewhere (e.g. on HDFS or a database). Anything that reads in data from a stream is said to be a consumer.
The boto
library is a convenient way to write Python scripts to initialize and make use of AWS tools. Using the boto libary one can create a kinesis stream in a few lines of code (need to clean....)
import boto
kinesis = kinesis.connect_to_region("us-east-1")
streamName = "myStream"
kinesis.create_stream(name = streamName, number_shards = 1)
i = 0
while(true):
kinesis.put_record(streamName, json.dumps(i), "partitionkey")
i = i + 1
Find out more about the Insight Data Engineering Fellows Program in New York and Silicon Valley, apply today, or sign up for program updates.
You can also read our engineering blog here.