Cloud Analytics for Data-science Applications Notes ~~premaster's course!
- Outline for Lecture 1 Cloud Introduction
- Answers to Potential Questions from Lecture 1
- Outline for Lecture 2: AWS Resources
- Answers to Potential Questions from Lecture 2
- Outline for Lecture 3: Data Warehousing
- Answers to Potential Questions from Lecture 3
- Outline for Lecture 4: AWS Redshift
- Answers to Potential Questions from Lecture 4
- Outline for Lecture 5: AWS Glue and Athena
- Answers to Potential Questions from Lecture 5
- Outline for Lecture 6: AWS QuickSight
- Answers to Potential Questions from Lecture 6
- Diagrams - AWS Architecture and Services Overview
- Delivery of computing services over the internet.
- Services include: storage, databases, networking, software.
- On-demand self-service: Provision resources without human intervention.
- Broad network access: Available via the internet from any device.
- Resource pooling: Multiple customers share resources dynamically.
- Rapid elasticity: Scale up or down based on demand.
- Measured service: Pay-as-you-go pricing model.
- IaaS (Infrastructure as a Service): Provides virtualized computing resources.
- PaaS (Platform as a Service): Offers platforms for application development.
- SaaS (Software as a Service): Delivers applications over the internet.
- Public Cloud: Shared by multiple users; accessible via the internet.
- Private Cloud: Dedicated infrastructure for a single organization.
- Hybrid Cloud: Combines public and private clouds.
- Community Cloud: Shared among organizations with similar requirements.
- Cost efficiency (no upfront hardware investment).
- Scalability and elasticity.
- Disaster recovery and business continuity.
- Accessibility and collaboration.
- Data security and privacy.
- Downtime risks.
- Compliance with regulations.
- Vendor lock-in concerns.
Answer: Cloud computing is the delivery of various computing services (e.g., servers, storage, databases) over the internet, allowing users to access and pay for only what they use.
Answer:
- On-demand self-service: Provision resources automatically.
- Broad network access: Access services from anywhere with internet.
- Resource pooling: Shared resources for efficiency.
- Rapid elasticity: Adjust resources to match workload changes.
- Measured service: Billing based on actual usage.
Answer:
- IaaS (e.g., AWS EC2): Provides virtual servers and storage for users to manage.
- PaaS (e.g., AWS Elastic Beanstalk): Offers platforms for developers to deploy applications.
- SaaS (e.g., Google Workspace): Delivers ready-to-use software applications.
Answer:
- Public Cloud: Accessible to anyone; lower cost but less control.
- Private Cloud: Exclusive use for one organization; more secure.
- Hybrid Cloud: Combines public and private, offering flexibility.
Answer: Resource pooling ensures efficient use of infrastructure by sharing resources among multiple users, reducing costs and enabling scalability.
Answer:
- Data Security: Risk of breaches and unauthorized access.
- Downtime: Dependence on internet availability and provider uptime.
- Compliance: Meeting regulations like GDPR or HIPAA.
- Vendor Lock-in: Difficulty switching providers due to compatibility issues
- Leading cloud service provider offering scalable and flexible solutions.
- Covers computing, storage, databases, networking, and more.
- Compute: EC2 (Elastic Compute Cloud), Lambda.
- Storage: S3 (Simple Storage Service), EBS (Elastic Block Store).
- Databases: RDS, DynamoDB, Aurora.
- Networking: VPC (Virtual Private Cloud), Route 53.
- Provides resizable compute capacity in the cloud.
- Types of instances: General purpose, Compute optimized, Storage optimized, etc.
- Features: Auto-scaling, Elastic Load Balancer, Key Pairs.
- Object storage service with virtually unlimited capacity.
- Use cases: Backup and restore, data archiving, static website hosting.
- Features: Buckets, versioning, access control, lifecycle policies.
- Manages access to AWS resources securely.
- Components: Users, Groups, Policies, Roles.
- Provides fine-grained access control.
- Allows users to create an isolated virtual network.
- Components: Subnets, Route tables, Gateways, Security groups.
- Provides full control over networking.
- CloudWatch: Monitors AWS resources and applications.
- CloudTrail: Logs AWS API calls for auditing and compliance.
- AWS is responsible for "security of the cloud" (infrastructure).
- Customers are responsible for "security in the cloud" (data, permissions).
Answer:
Amazon EC2 provides scalable, resizable compute capacity in the cloud.
Use Cases:
- Hosting web applications.
- Running batch processing jobs.
- Supporting distributed applications.
Answer:
- S3: Object storage for scalable, unstructured data storage (e.g., backups, data lakes).
- EBS: Block storage for persistent data attached to EC2 instances (e.g., OS storage, databases).
Answer:
- Users: Individuals with specific credentials.
- Groups: Collections of users sharing permissions.
- Policies: JSON documents defining access permissions.
- Roles: Temporary access for AWS services or external entities.
Answer:
AWS VPC allows users to create isolated virtual networks, providing control over IP ranges, subnets, routing tables, and security groups. It ensures secure networking for AWS resources.
Answer:
- CloudWatch: Monitors operational metrics (e.g., CPU usage, disk I/O).
- CloudTrail: Logs API activity for auditing and compliance.
Answer:
AWS is responsible for infrastructure security (e.g., hardware, network). Customers are responsible for managing data, user access, and application security.
Answer: Lifecycle policies automate the transition of data between S3 storage classes or delete objects after a set period. This reduces storage costs.
Answer: Auto-scaling automatically adjusts the number of EC2 instances based on demand, ensuring optimal performance during traffic spikes while reducing costs during low activity.
Answer: IAM provides fine-grained access control, enabling organizations to grant the least privilege necessary for resources, reducing risks of unauthorized access.
Answer:
- General Purpose: Balanced performance (e.g., t2.micro).
- Compute Optimized: For high-performance computing (e.g., c5.large).
- Memory Optimized: For in-memory databases (e.g., r5.large).
- Storage Optimized: For intensive read/write workloads (e.g., i3.large).
- Definition: A data warehouse is a centralized repository for structured, historical data used for analysis and decision-making.
- Features: Subject-oriented, integrated, time-variant, and non-volatile.
- Purpose: Support business intelligence, reporting, and data mining.
- Data Sources: Operational databases, flat files, external systems.
- Staging Area: Prepares raw data through ETL processes.
- Data Warehouse: Central repository for cleansed, consolidated data.
- Data Marts: Subsets of data tailored for specific departments or functions.
- Extract: Collect data from multiple sources.
- Transform: Cleanse, validate, and convert data into a consistent format.
- Load: Store transformed data into the warehouse.
- Top-Down Approach: Centralized warehouse created first, followed by data marts.
- Bottom-Up Approach: Independent data marts created first, integrated later.
- OLAP (Online Analytical Processing): Optimized for data analysis, uses multidimensional models (e.g., star schema).
- OLTP (Online Transaction Processing): Handles daily transactional operations, uses relational models.
- Star Schema: Central fact table connected to denormalized dimension tables.
- Snowflake Schema: Dimension tables further normalized.
- Centralized data for consistent reporting.
- Historical data storage for trend analysis.
- Faster query performance for complex analytics.
- High implementation cost.
- Complex ETL processes.
- Data quality and integration issues.
Answer:
A data warehouse is a centralized repository that integrates data from multiple sources for analysis. It supports business intelligence, trend analysis, and decision-making by providing consistent, historical data.
Answer:
- Subject-Oriented: Organized around key business subjects like sales or customers.
- Integrated: Consolidates data from various sources into a unified format.
- Time-Variant: Stores historical data, enabling trend analysis over time.
- Non-Volatile: Data is stable and not subject to frequent changes.
Answer:
-
OLAP:
- Purpose: Data analysis and reporting.
- Schema: Star or snowflake.
- Example: Sales trend analysis.
-
OLTP:
- Purpose: Daily transaction processing.
- Schema: Normalized relational models.
- Example: Online order processing.
Answer:
- Extract: Retrieve data from source systems.
- Transform: Clean, standardize, and validate data.
- Load: Insert the processed data into the data warehouse.
Answer:
-
Top-Down Approach:
- Centralized warehouse created first, followed by departmental data marts.
- Advantage: Strong consistency across the organization.
- Disadvantage: High cost and longer implementation time.
-
Bottom-Up Approach:
- Data marts are built first and later integrated into a central warehouse.
- Advantage: Faster and cost-effective for small organizations.
- Disadvantage: Risk of inconsistent data.
Answer:
- Star Schema: A denormalized structure with:
- Fact Table: Contains metrics like sales amount or quantities.
- Dimension Tables: Describe dimensions like time, location, or product.
Answer:
- Improved decision-making through unified, historical data.
- Faster analytics with optimized query performance.
- Simplifies reporting by providing a single source of truth.
Answer:
- Complex ETL processes.
- Maintaining data quality and consistency.
- High implementation and maintenance costs.
Answer:
Data marts store subsets of data tailored for specific departments or functions, such as finance or marketing, enabling faster and more focused analysis.
Answer:
The staging area temporarily holds raw data from source systems for cleansing and transformation before loading it into the data warehouse.
- Fully managed, scalable data warehouse service.
- Optimized for analytics and business intelligence workloads.
- Supports petabyte-scale data storage and querying.
- Columnar Storage: Stores data in columns for faster analytical queries.
- Massively Parallel Processing (MPP): Distributes data and queries across multiple nodes.
- Compression: Reduces storage costs by compressing columnar data.
- Concurrency Scaling: Handles spikes in query loads.
- Integration with AWS Services: Seamlessly connects to S3, Glue, and QuickSight.
- Distribution Styles:
- KEY: Based on the values of a specified column.
- EVEN: Data evenly distributed across nodes.
- ALL: Replicates entire table on all nodes (for small tables).
- Optimizes data placement for query performance.
- Sort Keys: Determines the order of data storage for faster queries.
- Distribution Keys: Ensures efficient data placement across nodes.
- Vacuuming and Analyzing: Keeps data organized and up-to-date.
- Sources: S3, DynamoDB, Kinesis, or on-premises databases.
- COPY Command: Efficiently loads data from S3 or other sources.
- Best Practices: Use compression and split data into smaller files.
- Node Types:
- Leader Node: Manages query execution and result aggregation.
- Compute Nodes: Store data and execute queries.
- Scaling clusters to handle growing workloads.
- Encryption at rest and in transit.
- Role-based access control via IAM.
- VPC for network isolation.
- BI and reporting.
- Big data analytics.
- Data consolidation from multiple sources.
Answer:
Amazon Redshift is a fully managed data warehouse service optimized for large-scale data analytics.
Use Cases:
- Business intelligence and reporting.
- Analyzing historical data for trends.
- Consolidating data from multiple sources for a unified view.
Answer:
- Columnar Storage: Redshift stores data by columns instead of rows, enabling faster read performance for analytical queries.
- Massively Parallel Processing (MPP): Queries are distributed across multiple nodes, allowing parallel execution and faster results.
Answer:
- KEY: Data is distributed based on the values in a specified column, ensuring related rows are stored on the same node.
- EVEN: Data is evenly distributed across all nodes for uniform workloads.
- ALL: Replicates entire tables across all nodes, useful for small tables used in many joins.
Answer:
Sort keys determine the order of data storage. Queries filtering or aggregating data based on the sort key can skip unnecessary rows, improving performance.
Answer:
-
Upload data to an S3 bucket.
-
Grant Redshift access to the S3 bucket using an IAM role.
-
Use the
COPY
command to load data:COPY table_name FROM 's3://bucket-name/file' IAM_ROLE 'arn:aws:iam::account-id:role/role-name' FORMAT AS CSV;
Answer:
- Leader Node: Manages query parsing, optimization, and result aggregation.
- Compute Nodes: Execute queries and store data distributed across them.
Answer:
- Encryption: Encrypts data at rest (e.g., with AWS KMS) and in transit (e.g., TLS).
- IAM Integration: Role-based access control.
- VPC: Provides network isolation.
Answer:
- Choose appropriate distribution keys and sort keys.
- Regularly VACUUM and ANALYZE tables.
- Use compression to reduce I/O and storage costs.
- Split data into smaller files for faster ingestion.
Answer:
Concurrency scaling allows Redshift to add temporary clusters during high query loads, ensuring consistent performance without delays.
Answer:
- Redshift: Cloud-based, scalable, and fully managed. Supports columnar storage and MPP.
- On-Premises: Requires hardware investment and manual management. Typically row-based storage.
- Fully managed Extract, Transform, Load (ETL) service.
- Automates schema discovery, data transformation, and job execution.
- Key components: Glue Data Catalog, Crawlers, Jobs, and Triggers.
- Data Catalog: Centralized metadata repository.
- Crawlers: Automatically detect and infer schema for datasets.
- Jobs: Perform data transformations and loading.
- Triggers: Automate workflows based on schedules or events.
- ETL workflows for cleaning and transforming data.
- Data integration from various sources to S3, Redshift, or RDS.
- Schema discovery and metadata management.
- Serverless query service for analyzing data stored in S3.
- Uses standard SQL for querying.
- No infrastructure management; pay-per-query pricing.
- Supports various data formats (e.g., CSV, Parquet, JSON).
- Integrates with AWS Glue Data Catalog for schema management.
- Supports partitioning for query optimization.
- Glue: Focuses on ETL and data preparation.
- Athena: Focuses on querying and analyzing prepared data.
- Integration: Glue prepares data and registers schemas; Athena queries it.
- Building data lakes with S3.
- Ad-hoc querying for data exploration and reporting.
- Data preprocessing for machine learning or analytics.
- Use Glue Crawlers for schema automation.
- Partition data in S3 for faster queries in Athena.
- Optimize file formats (e.g., Parquet) to reduce query costs.
Answer: AWS Glue is a fully managed ETL service for automating data preparation.
Main Components:
- Data Catalog: Metadata repository for datasets.
- Crawlers: Detect and infer schema automatically.
- Jobs: Execute ETL workflows.
- Triggers: Schedule or automate workflows.
Answer: Crawlers scan datasets (e.g., in S3) to infer schema and create or update tables in the Glue Data Catalog. This reduces manual effort in defining data structures.
Answer: The Glue Data Catalog is a centralized repository for metadata, including table definitions and schema. It ensures consistent data management and integrates seamlessly with Athena.
Answer: Athena queries data stored in S3 using SQL. It directly interacts with the Glue Data Catalog for schema information, enabling serverless and scalable analytics.
Answer:
- Glue: Focused on ETL for data preparation and integration.
- Athena: Designed for querying and analyzing prepared data.
Glue is typically used to preprocess data, while Athena provides analysis capabilities.
Answer:
- Use Glue Crawlers to discover schemas and populate the Data Catalog.
- Run Glue Jobs to clean, transform, and store data in S3.
- Query the prepared data in S3 using Athena with SQL.
Answer:
- Serverless architecture; no infrastructure management.
- Pay-per-query model reduces costs for infrequent usage.
- Supports multiple file formats and integrates with Glue for schema management.
Answer: Partitioning organizes data by specific keys (e.g., date), enabling Athena to scan only relevant data partitions instead of the entire dataset.
Answer: These columnar file formats reduce storage requirements and improve query performance by minimizing data scanned.
Answer:
- Building data lakes for scalable storage and querying.
- Ad-hoc analytics on raw or processed data in S3.
- ETL workflows for preprocessing data for reporting or machine learning.
- Cloud-based Business Intelligence (BI) service.
- Creates interactive dashboards and data visualizations.
- Accessible from any device with a web browser.
- Interactive Dashboards: Real-time, visually rich analytics.
- SPICE Engine: Super-fast, Parallel, In-memory Calculation Engine for faster query execution.
- Natural Language Querying: Amazon Q allows users to ask questions in plain language.
- Integration: Works with AWS services like Redshift, Athena, S3, RDS, and other data sources.
- Machine Learning Insights: Offers anomaly detection and forecasting.
- Sharing & Collaboration: Share dashboards securely with team members.
- Connect Data Sources: (e.g., S3, Redshift, Athena).
- Import and Prepare Data: Use datasets and SPICE for analysis.
- Create Visualizations: Charts, tables, and graphs.
- Publish Dashboards: Monitor metrics and aid decision-making.
- Stores data in-memory for faster querying.
- Enables fast performance for multiple users querying the same data.
- Scales efficiently to handle large datasets.
- Redshift: Direct connection for large-scale analytics.
- Athena: Queries raw data stored in S3.
- S3: Stores data for visualization.
- Glue: Prepares and catalogs data with the Data Catalog.
- Scalability: Adapts to growing data and user requirements.
- Cost-Effective: Pay-per-session pricing.
- Ease of Use: Intuitive interface for non-technical users.
- Security: Role-based access control with IAM integration.
- Business Reporting and Dashboards.
- Real-Time Operational Monitoring.
- Ad-Hoc Analysis and Predictive Modeling.
- Use SPICE for high-performance dashboards.
- Optimize datasets for efficient querying.
- Leverage machine learning features for advanced analytics.
Answer:
- Amazon QuickSight is a cloud-based BI service for creating dashboards and visualizations.
- Interactive dashboards with real-time insights.
- SPICE engine for fast, in-memory data processing.
- Natural language querying with Amazon Q.
- Seamless integration with AWS services like Redshift and Athena.
Answer:
- SPICE (Super-fast, Parallel, In-memory Calculation Engine) stores data in-memory, enabling:
- Faster queries.
- Support for multiple users querying the same dataset simultaneously.
Answer:
- Connect to a data source (e.g., S3, Redshift).
- Prepare and import data into SPICE.
- Create visualizations using charts, graphs, or tables.
- Publish and share the dashboard with stakeholders.
Answer:
- Redshift: Provides direct connectivity for querying large-scale data.
- Athena: Analyzes raw data stored in S3 using SQL.
- Glue: Prepares and catalogs data for analysis.
- S3: Stores data files for visualization.
Answer:
- Amazon Q allows users to interact with data by asking questions in natural language (e.g., "What were last month's sales?").
- Simplifies analytics for non-technical users.
Answer:
- QuickSight uses a pay-per-session pricing model:
- Users are charged only for active sessions.
- Cost-effective for organizations with varying usage levels.
Answer:
- Bar charts, line graphs, and pie charts.
- Heat maps and scatter plots.
- Tables and pivot tables.
- Geographic maps for spatial data.
Answer:
- Integrates with CloudWatch and other AWS services.
- Creates dashboards that display metrics and alerts in real-time.
- Enables effective operational monitoring.
Answer:
SPICE | Live Querying |
---|---|
Stores data in-memory for faster access. | Queries data directly from the source. |
Ideal for frequently queried datasets. | Ensures up-to-date results. |
High performance for multiple users. | Dependent on the source system's performance. |
Answer:
- Business Intelligence Dashboards for executives.
- Operational Monitoring of real-time metrics.
- Analyzing historical trends and forecasting future outcomes.
Diagrams include EC2 Overview, AMI Workflow, EC2 Lifecycle, VPC Network - VPC Architecture Diagram, VPC Architecture with Subnets, VPC Region and IP Range, Subnet Configuration, ACL, ACL (Access Control List) in VPC, Security Groups, Storage Types Overview, Block vs. Object Storage, S3 URL Breakdow, CloudFormation Workflow, Load Balancer Target Groups, Autoscaling & Load Balancing, and Advanced Autoscaling & Load Balancing.
The image provides an overview of Amazon EC2 (Elastic Compute Cloud), focusing on instance types and sizes.
Amazon EC2 offers several instance types tailored for different workloads:
-
General Purpose:
- Balanced workloads for computing, memory, and networking.
- Example Use Cases: Web servers, development/test environments.
-
Compute Optimized:
- For compute-heavy applications requiring high performance.
- Example Use Cases: Machine learning, high-performance web servers.
-
Memory Optimized:
- Ideal for memory-intensive applications.
- Example Use Cases: In-memory databases, real-time big data processing.
-
Accelerated Computing:
- Designed for tasks requiring specialized hardware like GPUs.
- Example Use Cases: Machine learning training, graphic rendering.
-
Storage Optimized:
- Suited for high I/O workloads needing fast, large-scale storage.
- Example Use Cases: Data warehouses, distributed file systems.
-
HPC Optimized:
- High-performance computing for scientific modeling and simulations.
- Example Use Cases: Genomics, financial modeling.
Instances are categorized by size, represented by a combination of a type prefix and a size suffix (e.g., t3.medium
, a1.large
):
- Prefix (e.g., "t3"): Indicates the instance type.
- Suffix (e.g., "medium"): Determines the CPU and memory configuration.
The table provides examples of virtual CPUs (vCPUs) and their corresponding capabilities for different instance types:
Instance Type | vCPU | Memory (GB) | Special Notes |
---|---|---|---|
t3.micro | 2 | 1 | General-purpose, burstable CPU. |
c5.large | 2 | 4 | Compute-optimized. |
r5.xlarge | 4 | 32 | Memory-optimized. |
g4dn.xlarge | 4 | 16 | Accelerated computing with GPUs. |
i3.large | 2 | 8 | Storage-optimized with NVMe SSD. |
hpc6a.48xlarge | 96 | 384 | HPC-optimized for high scalability. |
- An Amazon Machine Image (AMI) is a pre-configured template used to launch EC2 instances.
- Includes essential components:
- Operating system.
- Application server.
- Pre-installed applications.
- Eliminates the need for manual operating system installation and server setup, enabling faster deployments.
- Demonstrates how an AMI is used to create an EC2 instance.
- The instance includes:
- Root Volume: Attached storage for the instance.
- Highlights the reusability of AMIs:
- AMI #1: Used to create Instance #1.
- New AMI: After modifications to Instance #1, a new AMI is created.
- Reusability: The new AMI is then used to launch additional instances.
- Streamlined Instance Launching:
- Pre-configured templates save time and reduce manual setup.
- Consistency:
- Reusable AMIs ensure uniformity across instances.
- Version Control:
- Easily update and save new versions of AMIs after modifications.
- Scalability:
- Launch multiple instances from the same AMI for large-scale applications.
Amazon EC2 instances go through several lifecycle states, each with distinct characteristics and implications for billing and resource management:
-
Pending:
- The instance is in the process of launching.
- Billing has not yet started.
-
Running:
- The instance is active and operational.
- Billing begins in this state.
-
Stopping/Stopped:
- The instance transitions to a stopped state.
- For EBS-backed instances:
- Data on the root volume is retained.
- Billing for computing stops, but charges for storage continue.
-
Rebooting:
- A temporary restart of the instance.
- No data is lost during this phase.
- Billing continues without interruption.
-
Terminated:
- The instance is permanently deleted.
- Associated resources, such as storage volumes, are released (unless marked to persist).
Understanding the lifecycle states of EC2 instances is essential for:
-
Cost Management:
- Ensure unused instances are stopped or terminated to minimize costs.
-
Resource Optimization:
- Manage instance states efficiently to match application requirements.
-
Data Retention:
- Recognize when data persists (e.g., stopped instances with EBS) versus when it is permanently deleted (e.g., terminated instances).
- A Virtual Private Cloud (VPC) is a logically isolated network within AWS.
- Enables complete control over networking, including:
- Subnets: Logical subdivisions of the VPC.
- Security Groups: Act as virtual firewalls for instance-level access control.
- Network Access Configurations: Define routing and internet access.
-
Public Subnet:
- Hosts resources accessible from the internet, such as EC2 instances.
- Uses an Internet Gateway (IGW) for connectivity.
-
Private Subnet:
- Isolated from direct internet access.
- Ideal for sensitive resources like databases (e.g., Amazon RDS).
- Internet access for private resources is managed via a NAT Gateway or NAT Instance.
-
Availability Zones (AZs):
- Two zones (e.g., AZ A and AZ B) are shown for redundancy and fault tolerance.
- Resources are distributed across AZs to improve high availability.
-
Access Control:
- Security Groups: Define inbound and outbound traffic rules for instances.
- Network ACLs (NACLs): Optional, subnet-level rules for additional traffic control.
-
Corporate Connectivity:
- Integration with on-premises data centers is shown via:
- VPN: Secure connections over the internet.
- AWS Direct Connect: Dedicated, high-speed private connections.
- Integration with on-premises data centers is shown via:
-
Customizable Networking:
- Configure IP ranges, subnets, route tables, and gateways.
-
Secure Resource Isolation:
- Resources are isolated at the network level for enhanced security.
-
Seamless Hybrid Connectivity:
- Combine on-premises data centers with AWS resources.
- Default route table for the VPC.
- Handles local traffic:
- Example: Traffic within the VPC (e.g., 10.1.0.0/16) is routed locally.
- Does not allow internet access.
- A user-defined route table with additional configurations.
- Allows internet access:
- Example: Routes traffic to an Internet Gateway (IGW) for external communication.
- Destination:
0.0.0.0/0
(represents all IP addresses outside the VPC).
- Connected to the Internet Gateway for external access.
- Resources in this subnet (e.g., EC2 instances) are accessible from the internet.
- Isolated from direct internet access.
- Traffic remains within the VPC or uses a NAT Gateway (if configured) for outbound internet access.
- Red X: Represents a blocked pathway for internet traffic directly from the private subnet.
- Main Route Table: Restricts traffic to remain within the VPC.
- Custom Route Table: Enables internet access via an Internet Gateway.
- Subnet Differentiation:
- Public Subnet: Internet-enabled.
- Private Subnet: Internet-isolated.
-
Public Subnet:
- CIDR Block:
10.1.1.0/24
. - Supports up to 256 IP addresses.
- Allows internet access via an Internet Gateway.
- CIDR Block:
-
Private Subnet:
- CIDR Block:
10.1.3.0/24
. - Supports up to 256 IP addresses.
- Isolated from the internet for hosting secure resources like databases.
- CIDR Block:
- Both subnets are distinct subsets of the VPC's IP range (
10.1.0.0/16
). - Ensures there is no overlap, avoiding conflicts in resource allocation.
- Both subnets are deployed in Zone A of the selected AWS region.
- Regional zones improve redundancy and fault tolerance for resources.
Subnet Type | Internet Access | Use Case |
---|---|---|
Public | Yes (via Internet Gateway) | Hosting public-facing resources like EC2 instances or web servers. |
Private | No (isolated) | Securely hosting sensitive resources like databases or backend applications. |
- The VPC is deployed in the AWS region N. Virginia (
us-east-1
). - Regional placement ensures optimized latency and redundancy for resources.
- The VPC is assigned the CIDR block:
10.1.0.0/16
.- Supports up to 65,536 IP addresses within this range.
- Provides flexibility to define smaller subnets within the VPC.
- Demonstrates the scope of available IP addresses for:
- Subnets.
- EC2 instances.
- Other resources within the VPC.
- Ensures sufficient IP allocation for scaling applications and supporting complex architectures.
- Public Subnet:
- Allow inbound HTTP/HTTPS.
- Allow inbound SSH.
- Private Subnet:
- Allow inbound from public subnet for application traffic.
- Deny all other inbound traffic.
- A Network Access Control List (ACL) is a network-level security mechanism in AWS.
- Operates at the subnet boundary, providing an additional layer of security.
- Configured to allow or deny specific traffic based on defined rules.
- Stateless:
- Inbound and outbound rules are evaluated separately.
- Subnet-level Protection:
- ACLs apply to all resources within a subnet.
- Rule Evaluation:
- Rules are evaluated in numerical order (from lowest to highest).
- Public Subnet:
- Used for internet-facing resources.
- ACL rules permit inbound HTTP/HTTPS traffic and SSH access.
- Private Subnet:
- Hosts sensitive resources like databases.
- ACL rules restrict inbound traffic, allowing only necessary communication from the public subnet.
- Two zones (A and B) provide redundancy:
- Each zone contains both public and private subnets.
- ACLs are applied consistently across all subnets for uniform security.
- Enhanced Security:
- Adds an extra layer of protection alongside Security Groups.
- Granular Traffic Control:
- Allows fine-tuning of allowed and denied traffic at the subnet level.
- Support for Multi-Subnet Architectures:
- Ensures secure communication between public and private subnets in multi-AZ deployments.
- A Security Group acts as a virtual firewall for EC2 instances.
- Controls inbound and outbound traffic for individual instances.
- Operates at the instance level, offering fine-grained control over traffic.
- Stateful:
- Automatically allows return traffic for permitted inbound or outbound requests.
- Instance-Level Protection:
- Unlike ACLs, security groups are applied directly to EC2 instances.
- Customizable Rules:
- Rules specify:
- Allowed ports (e.g., 22 for SSH, 80 for HTTP).
- Allowed protocols (e.g., TCP, UDP).
- Allowed IP ranges or specific IPs.
- Rules specify:
- EC2 instances hosted in public subnets are protected by security groups.
- Typical security group rules for public subnets:
- Allow inbound HTTP/HTTPS traffic (ports 80 and 443) from the internet.
- Allow inbound SSH access (port 22) from a specific IP range for management purposes.
- Deny all other inbound traffic by default.
- EC2 instances in private subnets also have security groups.
- Typical security group rules for private subnets:
- Allow inbound traffic only from instances in the public subnet (e.g., application servers).
- Deny all other inbound traffic.
- Enhanced Instance Security:
- Ensures that only trusted traffic reaches EC2 instances.
- Customizable Traffic Control:
- Tailor rules to specific application requirements.
- Simplified Management:
- Rules can be updated dynamically without disrupting instance operations.
-
Cloud Networking Models:
- The VPC and subnet diagrams illustrate networking within a cloud environment, tying back to deployment models:
- Public Clouds: Represented by public subnets with internet access.
- Private Clouds: Represented by private subnets, isolated for secure resources.
- The VPC and subnet diagrams illustrate networking within a cloud environment, tying back to deployment models:
-
EC2 Lifecycle:
- The EC2 lifecycle diagram connects to foundational cloud computing characteristics:
- On-Demand Self-Service: Users can start, stop, or terminate instances as needed.
- Resource Pooling: EC2 instances utilize shared infrastructure dynamically.
- The EC2 lifecycle diagram connects to foundational cloud computing characteristics:
-
AWS EC2 and AMI:
- Focused diagrams and workflows emphasize core AWS compute services:
- EC2: Launching and managing virtual servers.
- AMI: Streamlining instance creation with pre-configured templates.
- Focused diagrams and workflows emphasize core AWS compute services:
-
VPC and Networking:
- Key AWS networking services are highlighted:
- Routing: Custom route tables enable secure and efficient traffic management.
- ACL: Subnet-level control adds an additional security layer.
- Security Groups: Instance-level firewalls for fine-grained traffic control.
- Key AWS networking services are highlighted:
-
Subnet Configuration:
- Subnet configurations align with AWS regions and availability zones:
- Demonstrates the importance of distributing resources across multiple zones for redundancy and fault tolerance.
- Subnet configurations align with AWS regions and availability zones:
- These diagrams and explanations bridge theoretical cloud computing concepts from Lecture 1 with practical AWS implementations covered in Lecture 2.
- They emphasize key cloud principles like network isolation, scalability, and security, while showcasing AWS's capabilities in enabling these features.
- ELB: Distributes traffic across instances in multiple subnets.
- Autoscaling: Automatically adjusts instance count based on load.
- Subnets: Ensure high availability by using multiple AZs.
- DynamoDB: NoSQL database for scalable storage.
- S3: Object storage for data backup and archival.
- CloudWatch: Monitoring and alerting service.
Storage Type | Use Case | Access Type | Durability |
---|---|---|---|
S3 | Backup, archival | Object | 99.999999999% |
Glacier | Long-term archival | Object | High |
EBS | Persistent instance storage | Block | High |
EFS | Shared file systems | File | High |
- Block Storage: EBS, allows incremental updates to files.
- Object Storage: S3, stores data as objects with metadata.
- Example:
https://bucket-name.s3.amazonaws.com/object-key
- Bucket Name: Unique identifier for your S3 storage.
- Object Key: Path to the stored object.
- Write a CloudFormation template.
- Upload template to AWS.
- Deploy infrastructure as defined in the template.
- ELB sends requests to target groups.
- Target groups distribute traffic to specific EC2 instances.
- Multi-zone EC2 instances.
- ELB distributes traffic evenly across all zones.