Home

Short Name

Clickstream analysis using Apache Spark and Apache Kafka

Short Description

Use Apache Spark and Apache Kafka to demonstrate how to detect real-time trending topics on the Wikipedia web site. Apache Kafka will act as a message queue, and the Apache Spark structured streaming engine will be used to perform the analytics.

Offering Type

Cognitive

Introduction

Built for anyone using data to create Jupyter notebooks and other artifacts, this journey will show the power of the open-source helper library PixieDust. With PixieDust hosted on IBM Data Science Experience, a developer or other user can quickly create charts, graphs, and tables without complex code, in an interactive and dynamic manner. In addition, PixieApps are used to embed UI elements directly in the Jupyter Notebook. Given an open-source data provider like the City of San Francisco DataSF Open Data, PixieDust and IBM's Data Science Experience can empower the user to analyze and share data visualizations and notebooks.

Author

by Prashant Sharma

Code

https://github.com/IBM/kafka-streaming

Demo

N/A

Video

TBD

Overview

When the reader has completed this journey, they will understand how to:

Use Jupyter Notebooks to load, visualize, and analyze data
Run Notebooks in IBM Data Science Experience
Perform clickstream analysis using Apache Spark Structured Streaming.
Build a low-latency processing stream utilizing Apache Kafka.

Flow

User connects with Apache Kafka service and sets up a running instance of a clickstream.
Run a Jupyter Notebook in IBM's Data Science Experience that interacts with the underlying Apack Spark service. Alternatively, this can be done locally by running the Spark Shell.
The Spark service reads and processes data from the Kafka service.
Processed Kafka data is relayed back to the user via the Jupyter Nodebook (or console sink if running locally).

Included Components

Apache Spark: An open-source distributed computing framework that allows you to perform large-scale data processing.
Apache Kafka: Kafka is used for building real-time data pipelines and streaming apps. It is designed to be horizontally scalable, fault-tolerant and fast.
IBM Data Science Experience: Analyze data using RStudio, Jupyter, and Python in a configured, collaborative environment that includes IBM value-adds, such as managed Spark.

Featured Technologies

Jupyter Notebooks: An open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!