Guidance for a Media Lake on AWS

Table of Contents

Overview

Cost

Base Services Cost Table

Usage Based Cost Example Table

Quick Deployment (CloudFormation)

Development Environment Setup and CDK Deployment

Clone the repository

Prepare the environment

Configure AWS account and region

Configuration Setup

Deploy using AWS CDK

Development

Prerequisites

Operating System

Deployment Validation

Running the Guidance

Login

Connect Storage

Ingest Media

Enable Semantic Search and Integrations

Process and Retrieve Assets

Next Steps

Project Structure

Key Components

Storage Connectors

Processing Pipelines

AWS Services Used

Security Features

Supported Media Types

Audio Files

Video Files

Image Files

Cleanup

Manual Cleanup (AWS Console)

FAQ, Known Issues, and Additional Considerations

Revisions

Notices

Authors

Overview

Guidance for a Media Lake on AWS provides a comprehensive, serverless, and scalable platform for media ingestion, processing, management, metadata management, and workflow orchestration on AWS. Media lake enables you to connect multiple storage sources, known as connectors, ingest and organize media at scale, creating a unified search space for your media. Workflows, knows as pipelines, run customizable processing workflows (such as proxy/thumbnail generation and AI enrichment), and integrate with both AWS native and partner services.

High-Level Overview

Diagram: Media Lake provides a comprehensive serverless platform connecting storage sources, processing pipelines, and enrichment services with secure user interfaces and API endpoints for scalable media management workflows.

Application Architecture

Diagram: Media lake application layer shows the React UI, API Gateway endpoints, Lambda functions, and data flow between Cognito authentication, DynamoDB storage, and OpenSearch indexing for user interactions and asset management.

Pipeline Execution and Deployment

Diagram: Media lake processes media through S3 ingestion, EventBridge routing, Lambda orchestration, Step Functions, and enrichment with metadata, search, and integration endpoints.

Cost

You are responsible for the cost of the AWS services used while running this Guidance.

Base Infrastructure Cost (without variable workloads): As of July 2025, the cost for running this Guidance with the small deployment configuration in the US East (N. Virginia) region is approximately $423.62 per month for the baseline services only.

Variable Workload Costs: Additional costs will be incurred based on actual usage:

Media processing and enrichment services (Lambda, Step Functions, MediaConvert, TwelveLabs, Transcription)
Media and Metadata storage (OpenSearch, DynamoDB, S3)
Interfactions with the user interface and viewing media(CloudFront, Data Transfer Out, Step Functions, Lambda, OpenSearch and DynamoDB queries)

The total monthly cost will vary based on the volume of media processed, storage requirements, and usage patterns.

We recommend creating a Budget through AWS Cost Explorer to help manage costs. Prices are subject to change. For full details, refer to the pricing webpage for each AWS service used in this guidance.

Base Services Cost Table - OpenSearch Deployment

Service & Usage	How It Relates to Your Team's Usage	Estimated Monthly Cost (USD)
Cognito (Users)	50 active users signing in and using the system each month	$2.00
OpenSearch Service (Search)	Search and index storage and compute	t3.small: $28.72 (1 instance) Storage: $2.44 (gp3)
OpenSearch Ingestion (OSI)	Data ingestion processing units	$350.40 (2 OCUs)
NAT Gateway (VPC)	Outbound internet access from VPC	$33.30
WAF (Web Application Firewall)	API & web protection (rules + ACLs + requests)	WebACL: $5.00 Rules: $2.00
TOTAL	Monthly cost estimate for small deployment	$423.60

Base Services Cost Table - S3 Vectors Deployment

Service & Usage	How It Relates to Your Team's Usage	Estimated Monthly Cost (USD)
Cognito (Users)	50 active users signing in and using the system each month	$2.00
WAF (Web Application Firewall)	API & web protection (rules + ACLs + requests)	WebACL: $5.00 Rules: $2.00
TOTAL	Monthly cost estimate for small deployment	$9.00

Usage Based Cost Example Table

Service & Usage	How It Relates to Your Team’s Usage	Estimated Monthly Cost (USD)
S3 + Step Functions (Uploads)	1,000 new media files uploaded/month, each triggering a workflow	S3 storage: $23.55 Step Functions: $2.40
S3/CloudFront (Images, Audio, Video)	All users viewing/downloading images, audio, and video each month (aggregate, served via S3 and CloudFront)	S3 requests: $0.05 + $0.40 CloudFront data: $29.75 CloudFront requests: $0.03
Total Media Downloaded (S3/CloudFront)	About 350GB of media files viewed/downloaded per month	S3 data transfer out: $45.00
Web/App Requests (CloudFront)	25,000 clicks or page loads/month through CDN	$0.03
API Gateway (API Requests)	500,000 system actions/searches/uploads per month	$1.75
Lambda (All Automated and API Processing)	All automated backend tasks, API logic, file processing, and event handling; includes 1,000,000 invocations/month	~$13.00
Database Usage (DynamoDB)	200,000 new or updated records per month (write/read/storage)	Writes: $18.75 Reads: $7.50 Storage: $25.00
Message Queues (SQS)	10,000 standard and 1,000 FIFO auto-messages per month	Standard: $0.002 FIFO: $0.0005
Workflow Automations (Step Functions)	1,000 automated workflows (pipelines), 20 steps each every month	$2.40
WAF (Web Application Firewall)	API & web protection (rules + ACLs + requests)	Requests: $0.30
Encryption (KMS)	311,000 encryption/decryption actions per month	$15.00.00
Monitoring/Logging (CloudWatch)	Storage, metrics, logs for all services	Data: $7.50 Storage: $0.07
EventBridge	Event-driven triggers	$0.01
X-Ray (Tracing)	Distributed trace monitoring	$5.00
TOTAL	Monthly cost estimate for usage-based services	$197.50

Quick Deployment (CloudFormation)

Recommended for most users - Deploy media lake quickly using the pre-built CloudFormation template without setting up a development environment.

Prerequisites

An AWS account with appropriate permissions to create and manage resources
Access to the AWS Console

Deployment Steps

Download the CloudFormation template
- Download medialake.template from the GitHub repository
Deploy using AWS Console
- Go to the AWS Console > CloudFormation > "Create Stack" > "With new resources (standard)"
- Choose Upload a template file, select medialake.template
- Set stack name to medialake-cf
Configure template parameters:

Initial Media Lake User
- InitialUserEmail: Email address for the initial administrator account (required)
- InitialUserFirstName: First name of the initial administrator (1-50 characters, letters/spaces/hyphens/periods only)
- InitialUserLastName: Last name of the initial administrator (1-50 characters, letters/spaces/hyphens/periods only)
Media Lake Configuration
- MediaLakeEnvironmentName: Environment identifier (1-4 alphanumeric characters, default: dev)
- OpenSearchDeploymentSize: Controls the size of your OpenSearch cluster
  - small: Suitable for development and testing environments
  - medium: Recommended for moderate production workloads
  - large: Designed for high-volume production environments
Media Lake Deployment Configuration
- SourceType: Deployment source method
  - Git: Deploy directly from a public Git repository
  - S3PresignedURL: Deploy from a ZIP file via presigned URL
- GitRepositoryUrl: Public Git repository URL (default: AWS Solutions Library media lake repository)
- S3PresignedURL: Presigned URL for ZIP file download (required when using S3PresignedURL source type)
Note: You can use the default deployment configuration settings without making any changes. The defaults are configured to deploy from the official AWS Solutions Library repository.
Complete deployment
- Accept the required IAM capabilities and deploy
- Monitor the stack creation progress in the CloudFormation console
Initiate deployment
- Click "Create stack" to begin the deployment process
- The initial CloudFormation stack will be created first
Monitor CodePipeline deployment
- A CodePipeline will be automatically created to deploy the CDK code
- This deployment process will take approximately 1 hour to complete
- You will receive a welcome email at the address you provided once deployment is finished
- To monitor deployment progress:
  - Go to the CloudFormation console
  - Navigate to your stack's "Outputs" tab
  - Click on the CodePipeline link to view the deployment status

See the MediaLake-Installation-Guide.md for a complete CloudFormation deployment guide.

Development Environment Setup and CDK Deployment

For developers who want to customize the solution, contribute to the project, or deploy using AWS CDK.

Prerequisites

An AWS account with appropriate permissions to create and manage resources
AWS CLI configured with your account credentials
AWS CDK CLI (npm install -g aws-cdk)
Node.js (v20.x or later)
Python (3.12)
Docker (for local development)
Git for cloning the repository
Optional: Third-party services (such as Twelve Labs) require separate setup and API credentials for integration

1. Clone the repository

git clone https://github.com/aws-solutions-library-samples/guidance-for-medialake-on-aws.git
cd guidance-for-medialake-on-aws

2. Prepare the environment

(a) Python virtual environment (recommended):

python3 -m venv .venv
source .venv/bin/activate      # Mac
# OR for Windows
.venv\Scripts\activate.bat     # Windows

(b) Install dependencies:

pip install -r requirements.txt
npm install

For development:

pip install -r requirements-dev.txt

3. Configure AWS account and region

Ensure AWS credentials are configured (aws configure), and bootstrap your account for CDK:

cdk bootstrap --profile <profile> --region <region>

4. Configuration Setup

Create a config.json file in the project root with your deployment settings:

touch config.json

Key configuration parameters include:

environment: Choose between 1-4 letters that are alphanumeric that represent an environment name
deployment_size: OpenSearch deployment size ("small", "medium", "large")
resource_prefix: Prefix for all AWS resources created
account_id: AWS Account ID for deployment
primary_region: Primary region for deployment
initial_user: Initial user configuration with email and name
vpc: VPC configuration for using existing or creating new VPC
authZ: Identity provider configuration (Cognito, SAML)

See the config-example.json for a complete configuration example.

5. Deploy using AWS CDK

Deploy all stacks using CDK:

cdk deploy --all --profile <profile> --region <region>

Deployment Validation

In the AWS CloudFormation console, check that the related media lake stacks are in CREATE_COMPLETE status.
After deployment, you will receive a welcome email at the address you provided, containing:
- The media lake application URL
- Username (your email)
- Temporary password
Log in at the URL provided. You should see the media lake user interface and be able to add storage connectors and media.

Running the Guidance

1. Login

Use the emailed credentials to log in to the media lake UI.

2. Connect Storage

Navigate to Settings > Connectors in the UI.
Add a connector, choosing Amazon S3 and providing your bucket details.
Note: If you create new S3 buckets through media lake, remember that these will need to be manually emptied and deleted during cleanup as they are not automatically removed when the media lake stack is deleted.

3. Enable Semantic Search and Integrations

Enable and configure semantic search providers (e.g., TwelveLabs) as described in the UI and MediaLake-Instructions.md.
Import pipelines for enrichment and transcription.

4. Ingest Media

Upload media to your configured S3 bucket.

5. Process and Retrieve Assets

Monitor pipeline executions, view extracted metadata, and use search/discovery features in the UI.

Next Steps

Customize pipeline configurations for your use case.
Scale up OpenSearch or DynamoDB for higher performance.

Project Structure

guidance-for-medialake-on-aws/
├── assets/                          # Documentation, images, and scripts
│   ├── docs/                        # Installation and configuration guides
│   └── images/                      # Architecture diagrams and screenshots
├── medialake_constructs/            # CDK construct definitions
│   ├── api_gateway/                 # API Gateway constructs
│   ├── auth/                        # Authentication constructs
│   └── shared_constructs/           # Shared AWS constructs
├── medialake_stacks/                # CDK stack definitions
│   ├── api_gateway_core_stack.py    # Core API Gateway stack
│   ├── api_gateway_deployment_stack.py # API deployment stack
│   ├── api_gateway_stack.py         # Main API Gateway stack
│   └── [additional stack files]     # Infrastructure and service stacks
├── medialake_user_interface/        # React TypeScript frontend
│   ├── src/                         # Source code
│   │   ├── api/                     # API service layer
│   │   ├── features/                # Feature-based modules
│   │   ├── pages/                   # Page components
│   │   ├── shared/                  # Common utilities and types
│   │   └── [additional folders]     # Components, hooks, contexts
│   ├── tests/                       # End-to-end tests
│   ├── package.json                 # Node.js dependencies
│   └── playwright.config.ts         # Testing configuration
├── lambdas/                         # Lambda function source code
│   ├── api/                         # API endpoint handlers
│   ├── auth/                        # Authentication functions
│   ├── back_end/                    # Backend processing functions
│   ├── nodes/                       # Pipeline processing nodes
│   ├── pipelines/                   # Pipeline orchestration
│   └── common_libraries/            # Shared Lambda utilities
├── pipeline_library/               # Default pipeline templates
├── s3_bucket_assets/               # S3 deployment assets
│   ├── pipeline_library/           # Pipeline definitions
│   └── pipeline_nodes/             # Node templates and specs
├── app.py                          # Main CDK application entry point
├── cdk.json                        # CDK configuration and settings
├── config_utils.py                 # Configuration utilities
├── config-dev.json                 # Development configuration example
├── requirements.txt                # Python dependencies
├── requirements-dev.txt            # Development Python dependencies
├── package.json                    # Node.js dependencies for CDK
└── README.md                       # This documentation file

Key Components

Storage Connectors

S3 Connector with EventBridge/S3 event integration
Automatic resource provisioning (SQS, Lambda, IAM roles)
Bucket exploration and management capabilities

Processing Pipelines

FIFO queue-based media processing
Step Functions workflow orchestration
Customizable processing steps
Event-driven architecture

AWS Services Used

Core Services:

AWS Lambda - Serverless compute for API handlers and media processing
Amazon S3 - Object storage for media assets and metadata
AWS Step Functions - Orchestration of media processing workflows
Amazon SQS - Queues for ordered media processing
Amazon EventBridge - Event routing and pipeline triggers
Amazon API Gateway - REST API endpoint management
Amazon DynamoDB - Asset metadata and configuration storage
AWS MediaConvert - Media transcoding and format conversion service
Amazon CloudWatch - Metrics, logging, and alerting
Amazon OpenSearch - Search and analytics engine

Security & Authentication:

AWS Cognito - User authentication and authorization
AWS KMS - Encryption key management
AWS IAM - Resource access control

Security Features

AWS Cognito authentication and authorization including support for local username/password and federated authentication via SAML
KMS encryption for sensitive data
CORS-enabled API endpoints
VPC deployment options for network isolation

Supported Media Types

Media lake supports processing of the following file types through its default pipelines:

Audio Files

WAV - Waveform Audio File Format
AIFF/AIF - Audio Interchange File Format
MP3 - MPEG Audio Layer III
PCM - Pulse Code Modulation
M4A - MPEG-4 Audio

Video Files

FLV - Flash Video
MP4 - MPEG-4 Part 14
MOV - QuickTime Movie
AVI - Audio Video Interleave
MKV - Matroska Video
WEBM - WebM Video
MXF - Material Exchange Format

Image Files

PSD - Adobe Photoshop Document
TIF - Tagged Image File Format
JPG/JPEG - Joint Photographic Experts Group
PNG - Portable Network Graphics
WEBP - WebP Image Format
GIF - Graphics Interchange Format
SVG - Scalable Vector Graphics

Each media type is automatically processed through dedicated pipelines that handle metadata extraction, proxy/thumbnail generation, and integration with AI services for enhanced search and analysis capabilities.

Cleanup

To remove all media lake resources:

Manual Cleanup (AWS Console):

Go to CloudFormation console
Delete all stacks with prefix "Media Lake" and medialake-cf
Important for S3 Buckets: For new buckets created via media lake, you must manually empty and delete them as they are not automatically cleaned up during stack deletion
Delete any other associated S3 buckets, DynamoDB tables, or resources as needed

Warning: This will permanently remove all media lake data and resources. Use with caution.

FAQ, Known Issues, and Additional Considerations

For feedback, questions, or suggestions, please use the GitHub Issues page.
Known issues and deployment tips will be tracked in the Issues section.
Service quotas: media lake relies on OpenSearch, DynamoDB, Lambda, and S3 limits; monitor and request increases if needed for large-scale deployments.
For SAML integration and advanced identity provider setup, refer to the SAML instructions in MediaLake-Installation-Guide.md.

Revisions

July 2025: Initial release and commit of repository.
See repository commit history for further changes.

Notices

Customers are responsible for making their own independent assessment of the information in this Guidance. This Guidance: (a) is for informational purposes only, (b) represents AWS current product offerings and practices, which are subject to change without notice, and (c) does not create any commitments or assurances from AWS and its affiliates, suppliers or licensors. AWS products or services are provided "as is" without warranties, representations, or conditions of any kind, whether express or implied. AWS responsibilities and liabilities to its customers are controlled by AWS agreements, and this Guidance is not part of, nor does it modify, any agreement between AWS and its customers.

Authors

Joao Seike
Lior Berezinski
Robert Raver

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.cicd		.cicd
assets		assets
lambdas		lambdas
medialake_constructs		medialake_constructs
medialake_stacks		medialake_stacks
medialake_user_interface		medialake_user_interface
pipeline_library		pipeline_library
s3_bucket_assets		s3_bucket_assets
.DS_Store		.DS_Store
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE.txt		LICENSE.txt
NOTICE.txt		NOTICE.txt
README.md		README.md
app.py		app.py
cdk.context.json		cdk.context.json
cdk.json		cdk.json
cdk_logger.py		cdk_logger.py
config-dev.json		config-dev.json
config-example.json		config-example.json
config.py		config.py
config_utils.py		config_utils.py
constants.py		constants.py
medialake.template		medialake.template
package.json		package.json
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

License

aws-solutions-library-samples/guidance-for-medialake-on-aws

Folders and files

Latest commit

History

Repository files navigation

Guidance for a Media Lake on AWS

Overview

High-Level Overview

Application Architecture

Pipeline Execution and Deployment

Cost

Base Services Cost Table - OpenSearch Deployment

Base Services Cost Table - S3 Vectors Deployment

Usage Based Cost Example Table

Quick Deployment (CloudFormation)

Prerequisites

Deployment Steps

Initial Media Lake User

Media Lake Configuration

Media Lake Deployment Configuration

Development Environment Setup and CDK Deployment

Prerequisites

1. Clone the repository

2. Prepare the environment

(a) Python virtual environment (recommended):

(b) Install dependencies:

3. Configure AWS account and region

4. Configuration Setup

5. Deploy using AWS CDK

Deployment Validation

Running the Guidance

1. Login

2. Connect Storage

3. Enable Semantic Search and Integrations

4. Ingest Media

5. Process and Retrieve Assets

Next Steps

Project Structure

Key Components

Storage Connectors

Processing Pipelines

AWS Services Used

Security Features

Supported Media Types

Audio Files

Video Files

Image Files

Cleanup

Manual Cleanup (AWS Console):

FAQ, Known Issues, and Additional Considerations

Revisions

Notices

Authors

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 11

Uh oh!

Languages

Packages