Skip to content

Objective is to build a scalable and efficient data analytics pipeline architecture in Google Cloud Platform on amazon product reviews dataset to achieve the specific goals, for end-users such as Marketing & Sales Department along with Sellers

Notifications You must be signed in to change notification settings

AneshaaK/Amazon-ProdReview-Analysis-GCP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Analytics Pipeline in Google Cloud Platform
for Consumer Reviews of Amazon Products

This project objective is to build a scalable and efficient data analytics pipeline architecture in Google Cloud Platform on amazon product reviews dataset to achieve the following goals, for End-Users such as Marketing & Sales Department along with Sellers:

  • Most reviewed products
  • Most rated products
  • Popular category based on gender
  • The popularity of free shipping products

Project Architecture

image

Big Data Life cycle

  • Data Ingestion
    • create a pub/sub topic to send notifications whenever a batch file is uploaded into a specific bucket
    • create a pub/sub topic to publish streaming data every 0.5 secs on specific GCS bucket
  • Data Preparation - cleaning, filtering and formatting
    • Dataflow subscribes to the topic that publishes the metadata of the batch file
    • Dataflow subscribes to the streaming topic in a fixed time window
  • Data Analytics
    • Data stored in Bigquery where we wrote sql queries to create views
  • Data Visualization
    • created dashboard using Looker Studio by connecting Bigquery

Contributors

Aneshaa Kasula, Viritha Vanama and Mohini Patil

About

Objective is to build a scalable and efficient data analytics pipeline architecture in Google Cloud Platform on amazon product reviews dataset to achieve the specific goals, for end-users such as Marketing & Sales Department along with Sellers

Topics

Resources

Stars

Watchers

Forks

Languages