Skip to content
View SunilKuruba's full-sized avatar

Block or report SunilKuruba

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
SunilKuruba/README.md

Sunil Kuruba

👨🏻‍💻 Data Engineer

I'm currently pursuing an MS in Computer Science at the University of Illinois Chicago, graduating in Dec 2025, with a focus on advanced data engineering, cloud computing, distributed systems, and big data technologies. My academic work aligns with my passion for solving real-world data challenges through scalable, cloud-native backend architectures.

I’m an AWS Certified Data Engineer – Associate, skilled in building secure, cost-efficient, and high-performance data pipelines using services such as Amazon S3, Glue, EMR, Kinesis, Redshift, Athena, and DynamoDB. My expertise spans ETL/ELT, data lake architectures, governance, and real-time analytics at scale.

Previously, I worked as a Senior Software Engineer at Fivetran, contributing to core pipeline infrastructure used by enterprise customers. I re-architected the BigQuery data writer to support partitioning, clustering, and JSON — reducing customer costs by ~90%. I also built a high-performance DynamoDB connector (15× faster) and improved MongoDB syncs using Change Streams (5× speedup).

Beyond connectors, I authored the Isolated Endpoint Sync (IES) framework — now adopted in 500+ connectors across 10+ teams — and led the design of the Data Preview feature for smoother onboarding. I also onboarded the 2024 intern batch through a structured training program and conducted technical interviews for various engineering roles.

Passionate about building scalable, reliable, and innovative data solutions that drive real business value. Actively exploring new opportunities in data engineering, backend systems, and cloud infrastructure.

Core Strengths

  • Data Engineering & Pipelines: Designing and implementing scalable ETL/ELT pipelines, schema evolution, data modeling, connector development, orchestration, and real-time data processing.
  • Distributed Systems & Processing: Apache Spark, Hadoop MapReduce, Apache Flink, Kafka, AWS Kinesis, and gRPC for processing large-scale datasets in both batch and real-time systems.
  • Cloud Platforms & Services:
    • AWS: EC2, Lambda, S3, EMR, Glue, Step Functions, RDS, DynamoDB, Redshift, Athena, EventBridge, IAM, KMS, CloudWatch, CloudTrail.
    • GCP: BigQuery, Compute Engine, Pub/Sub, Cloud Storage.
    • Azure: Azure VM, Azure Blob Storage.
  • Databases & Warehousing: Experience with modern data warehouses and databases like Snowflake, BigQuery, Redshift, DynamoDB, MySQL, PostgreSQL, MongoDB, and SQL Server.
  • Programming & Backend Development: Proficient in Java (Advanced), SQL (Advanced), Scala, Python, C++, Shell Scripting; expertise in backend architecture, REST APIs, and service frameworks.
  • DevOps & Infrastructure: Skilled in Docker, Kubernetes, Terraform, GitHub Actions, CI/CD, observability tools like New Relic, CloudWatch, and SonarQube for maintaining robust production systems.
  • Data Quality & Governance: Ensuring data integrity and reliability through validation frameworks, governance practices, and monitoring across the data lifecycle.
  • Collaboration & Leadership: Onboarded interns with structured training, led design efforts, and contributed to engineering hiring processes.

Certification

AWS Certified Data Engineer – Associate

Verified on Credly
Demonstrates ability to design, build, secure, and maintain data analytics solutions on AWS that are efficient, scalable, and cost-optimized. Proficient in:

  • Data lake and lakehouse architecture
  • Real-time and batch data ingestion
  • Data transformation using Glue, EMR
  • Querying with Athena, Redshift
  • Secure access via IAM, encryption, and governance

Education

University of Illinois Chicago

Master of Science in Computer Science | Aug 2024 – Dec 2025

RV College of Engineering, Bengaluru

Bachelor of Engineering in Computer Science | Aug 2016 – May 2020

🧑‍💻 Experience

Senior Software Engineer

Fivetran · Bengaluru, India
Mar 2023 – Aug 2024 · 1 yr 6 mos

  • Redesigned and developed a new BigQuery data writer aligned with SQL-based writers, eliminating 90% of maintenance overhead.
  • Enhanced Warehouse Data Writer throughput by 30% by implementing multithreaded concurrent processing for split files.
  • Added support for JSON data types in BigQuery, ensuring seamless schema evolution and data compatibility.
  • Introduced partitioning and clustering in BigQuery writer to reduce customer costs by ~90% — a hackathon-winning optimization.
  • Led infrastructure improvements across distributed data pipelines and contributed to system-level performance gains.

Software Engineer 2

Fivetran · Bengaluru, India
Sep 2021 – Mar 2023 · 1 yr 7 mos

  • Engineered a high-performance DynamoDB connector with 15× speedup in incremental syncs.
  • Improved MongoDB connector using Change Streams to achieve 5× faster data ingestion with reduced latency.
  • Designed support for Azure CosmosDB for MongoDB API, expanding Fivetran’s connector catalog.
  • Built Data Preview functionality using the IES framework to simplify customer onboarding and demo experiences.

Software Engineer

Fivetran · Bengaluru, India
Jun 2020 – Aug 2021 · 1 yr 3 mos

  • Authored Isolated Endpoint Sync (IES) — a hackathon-winning framework now adopted by 500+ connectors and 10+ teams.
  • Built a public Shopify connector app with OAuth-based merchant onboarding, GraphQL extraction, and failover capabilities.
  • Enhanced Stripe connector with multithreading and connected accounts support for scale and fault-tolerance.
  • Developed an ETL connector for ADP REST APIs with complete ERD-based schema documentation.

Software Engineering Intern

Fivetran · Bengaluru, India
Jan 2020 – May 2020 · 5 mos

  • Built webhook-based incremental sync mechanism for Recharge connector, achieving a 10× increase in extract performance.
  • Benchmarked performance of full ETL pipelines using Snowflake, delivering optimization insights for production rollouts.
  • Contributed to multiple API-based connectors and gained hands-on experience with Fivetran’s connector lifecycle.

Projects

AWS vs GCP Data Pipeline Benchmarking

GitHub: View Project
Benchmarks real-time data pipelines on AWS and GCP using a common IoT workload. Evaluates performance, cost, and sustainability.
Tech: AWS Kinesis, GCP Pub/Sub, Lambda, Dataflow, Python


AWS Bedrock LLM Conversation API with Ollama

GitHub: View Project
Built a cloud-native conversational API using AWS Bedrock and Ollama for multi-turn LLM-based dialogue.
Tech: Scala, Akka HTTP, gRPC, AWS Lambda, Docker


Distributed Neural Network Training & Sentence Generation

GitHub: View Project
Built a Spark-based deep learning pipeline to train and generate text using DL4J and AWS EMR.
Tech: Scala, Apache Spark, DL4J, AWS EMR


Social-Aware Movie Revenue Prediction

GitHub: View Project
A machine learning pipeline that predicts movie box office revenue by combining traditional metadata (e.g., budget, genre, cast) with sentiment and emotion signals extracted from Reddit and YouTube.
Tech: Python, scikit-learn, NLP, Reddit & YouTube API, Data Visualization, EDA


Hadoop-based LLM Tokenization & Embeddings

GitHub: View Project
Created a distributed NLP pipeline using custom tokenizers and Hadoop MapReduce to generate text embeddings.
Tech: Scala, Hadoop, AWS EMR


Help Session Activity Management System

GitHub: View Project
Designed the backend data model for scheduling and managing academic help sessions between TAs and students.
Tech: SQL, Database Design, ER Diagram

Pinned Loading

  1. AWS-VS-GCP-Data-Pipeline-Comparative-Analysis-of-Real-Time-Data-Streaming AWS-VS-GCP-Data-Pipeline-Comparative-Analysis-of-Real-Time-Data-Streaming Public

    This project benchmarks real-time data pipelines on AWS and GCP using a common IoT workload. It evaluates performance, cost, and sustainability across two architectures.

    Python

  2. AWS-Bedrock-Based-LLM-Conversation-API-with-Ollama-Integration-with-Dockerized-Deployment AWS-Bedrock-Based-LLM-Conversation-API-with-Ollama-Integration-with-Dockerized-Deployment Public

    A cloud-native conversational API that integrates AWS Bedrock and Ollama for multi-turn dialogue generation. Built using Scala with Akka HTTP, gRPC, and AWS Lambda, and deployed via Docker for port…

    Scala

  3. Apache-Spark-and-AWS-EMR-Distributed-Neural-Network-Training-and-Sentence-Generation Apache-Spark-and-AWS-EMR-Distributed-Neural-Network-Training-and-Sentence-Generation Public

    A scalable deep learning pipeline designed for training sentence-generation neural networks using Apache Spark and DL4J on AWS EMR. The project leverages Spark RDDs for distributed data preprocessi…

    Scala

  4. Apache-Hadoop-and-AWS-EMR-Distributed-LLM-Text-Processing-and-Embeddings Apache-Hadoop-and-AWS-EMR-Distributed-LLM-Text-Processing-and-Embeddings Public

    A scalable NLP pipeline for processing and embedding large text corpora using Hadoop MapReduce on AWS EMR. Implements custom tokenization and embedding generation using Scala, deployed via a single…

    Scala

  5. Data-Modeling-Help-Session-Activity-Management-System Data-Modeling-Help-Session-Activity-Management-System Public

    A backend data model and ERD design for managing academic help sessions between students and teaching assistants. The system handles TA shift scheduling, session bookings, appointments, and student…