Skip to content

Our responsibility was to build a tool that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-to-text model..

Notifications You must be signed in to change notification settings

10-Academy-batch-4-week-9/speech-to-text-data-collection

Repository files navigation

Speech-to-text-data-collection

Introduction

In today’s data-driven world of cut-throat competition, creating, executing and monitoring different tasks and large volumes of data is no small feat. Most companies, hence need an automated solution, that will help them manage their daily tasks.

The Apache Kafka and Airflow are open-source task management platforms that help companies create seamlessly functioning workflows to organise, execute and monitor their tasks. Although these platforms seem to perform related tasks, some crucial differences between the two set up them apart. Spark Streaming API enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka.

Our responsibility was to build a tool that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-t0-text model..

Key Topics

Loading ....

Learning Objectives

Skills/Task:

  • Create and maintain an Apache Kafka cluster
  • Work with Apache Airflow and Apache Spark
  • Apply Structured Streaming to process streaming data.
  • Building data pipelines and orchestration workflows

Knowledge: Enterprise-grade data engineering - using Apache and Databricks tools

Helpful Links

Loading....

Technologies used

  • Apache Kafka: To sequentially log streaming data into specific topics
  • Apache Airflow: used to create, orchestrate and monitor data workflows. In other words, he will be used to create and update the model, whilst also scheduling such tasks.
  • S3 Buckets: For storing transformed streaming data
  • Appache Spark : It will be used for data preprocessing,to validate the data and finally transform the data into corpus text.

DAG

Contributors

About

Our responsibility was to build a tool that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-to-text model..

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 7