Table of Contents
- Overview
- Cost
- Quick Deployment (CloudFormation)
- Development Environment Setup and CDK Deployment
- Development
- Deployment Validation
- Running the Guidance
- Next Steps
- Project Structure
- Key Components
- Security Features
- Supported Media Types
- Cleanup
- FAQ, Known Issues, and Additional Considerations
- Revisions
- Notices
- Authors
Guidance for a Media Lake on AWS provides a comprehensive, serverless, and scalable platform for media ingestion, processing, management, metadata management, and workflow orchestration on AWS. Media lake enables you to connect multiple storage sources, known as connectors, ingest and organize media at scale, creating a unified search space for your media. Workflows, knows as pipelines, run customizable processing workflows (such as proxy/thumbnail generation and AI enrichment), and integrate with both AWS native and partner services.
Diagram: Media Lake provides a comprehensive serverless platform connecting storage sources, processing pipelines, and enrichment services with secure user interfaces and API endpoints for scalable media management workflows.
Diagram: Media lake application layer shows the React UI, API Gateway endpoints, Lambda functions, and data flow between Cognito authentication, DynamoDB storage, and OpenSearch indexing for user interactions and asset management.
Diagram: Media lake processes media through S3 ingestion, EventBridge routing, Lambda orchestration, Step Functions, and enrichment with metadata, search, and integration endpoints.
You are responsible for the cost of the AWS services used while running this Guidance.
Base Infrastructure Cost (without variable workloads): As of July 2025, the cost for running this Guidance with the small deployment configuration in the US East (N. Virginia) region is approximately $423.62 per month for the baseline services only.
Variable Workload Costs: Additional costs will be incurred based on actual usage:
- Media processing and enrichment services (Lambda, Step Functions, MediaConvert, TwelveLabs, Transcription)
- Media and Metadata storage (OpenSearch, DynamoDB, S3)
- Interfactions with the user interface and viewing media(CloudFront, Data Transfer Out, Step Functions, Lambda, OpenSearch and DynamoDB queries)
The total monthly cost will vary based on the volume of media processed, storage requirements, and usage patterns.
We recommend creating a Budget through AWS Cost Explorer to help manage costs. Prices are subject to change. For full details, refer to the pricing webpage for each AWS service used in this guidance.
Service & Usage | How It Relates to Your Team's Usage | Estimated Monthly Cost (USD) |
---|---|---|
Cognito (Users) | 50 active users signing in and using the system each month | $2.00 |
OpenSearch Service (Search) | Search and index storage and compute | t3.small: $28.72 (1 instance) Storage: $2.44 (gp3) |
OpenSearch Ingestion (OSI) | Data ingestion processing units | $350.40 (2 OCUs) |
NAT Gateway (VPC) | Outbound internet access from VPC | $33.30 |
WAF (Web Application Firewall) | API & web protection (rules + ACLs + requests) | WebACL: $5.00 Rules: $2.00 |
TOTAL | Monthly cost estimate for small deployment | $423.60 |
Service & Usage | How It Relates to Your Team's Usage | Estimated Monthly Cost (USD) |
---|---|---|
Cognito (Users) | 50 active users signing in and using the system each month | $2.00 |
WAF (Web Application Firewall) | API & web protection (rules + ACLs + requests) | WebACL: $5.00 Rules: $2.00 |
TOTAL | Monthly cost estimate for small deployment | $9.00 |
Service & Usage | How It Relates to Your Team’s Usage | Estimated Monthly Cost (USD) |
---|---|---|
S3 + Step Functions (Uploads) | 1,000 new media files uploaded/month, each triggering a workflow | S3 storage: $23.55 Step Functions: $2.40 |
S3/CloudFront (Images, Audio, Video) | All users viewing/downloading images, audio, and video each month (aggregate, served via S3 and CloudFront) | S3 requests: $0.05 + $0.40 CloudFront data: $29.75 CloudFront requests: $0.03 |
Total Media Downloaded (S3/CloudFront) | About 350GB of media files viewed/downloaded per month | S3 data transfer out: $45.00 |
Web/App Requests (CloudFront) | 25,000 clicks or page loads/month through CDN | $0.03 |
API Gateway (API Requests) | 500,000 system actions/searches/uploads per month | $1.75 |
Lambda (All Automated and API Processing) | All automated backend tasks, API logic, file processing, and event handling; includes 1,000,000 invocations/month | ~$13.00 |
Database Usage (DynamoDB) | 200,000 new or updated records per month (write/read/storage) | Writes: $18.75 Reads: $7.50 Storage: $25.00 |
Message Queues (SQS) | 10,000 standard and 1,000 FIFO auto-messages per month | Standard: $0.002 FIFO: $0.0005 |
Workflow Automations (Step Functions) | 1,000 automated workflows (pipelines), 20 steps each every month | $2.40 |
WAF (Web Application Firewall) | API & web protection (rules + ACLs + requests) | Requests: $0.30 |
Encryption (KMS) | 311,000 encryption/decryption actions per month | $15.00.00 |
Monitoring/Logging (CloudWatch) | Storage, metrics, logs for all services | Data: $7.50 Storage: $0.07 |
EventBridge | Event-driven triggers | $0.01 |
X-Ray (Tracing) | Distributed trace monitoring | $5.00 |
TOTAL | Monthly cost estimate for usage-based services | $197.50 |
Recommended for most users - Deploy media lake quickly using the pre-built CloudFormation template without setting up a development environment.
- An AWS account with appropriate permissions to create and manage resources
- Access to the AWS Console
-
Download the CloudFormation template
- Download
medialake.template
from the GitHub repository
- Download
-
Deploy using AWS Console
- Go to the AWS Console > CloudFormation > "Create Stack" > "With new resources (standard)"
- Choose Upload a template file, select
medialake.template
- Set stack name to
medialake-cf
-
Configure template parameters:
- InitialUserEmail: Email address for the initial administrator account (required)
- InitialUserFirstName: First name of the initial administrator (1-50 characters, letters/spaces/hyphens/periods only)
- InitialUserLastName: Last name of the initial administrator (1-50 characters, letters/spaces/hyphens/periods only)
- MediaLakeEnvironmentName: Environment identifier (1-4 alphanumeric characters, default:
dev
) - OpenSearchDeploymentSize: Controls the size of your OpenSearch cluster
small
: Suitable for development and testing environmentsmedium
: Recommended for moderate production workloadslarge
: Designed for high-volume production environments
- SourceType: Deployment source method
Git
: Deploy directly from a public Git repositoryS3PresignedURL
: Deploy from a ZIP file via presigned URL
- GitRepositoryUrl: Public Git repository URL (default: AWS Solutions Library media lake repository)
- S3PresignedURL: Presigned URL for ZIP file download (required when using S3PresignedURL source type)
Note: You can use the default deployment configuration settings without making any changes. The defaults are configured to deploy from the official AWS Solutions Library repository.
-
Complete deployment
- Accept the required IAM capabilities and deploy
- Monitor the stack creation progress in the CloudFormation console
-
Initiate deployment
- Click "Create stack" to begin the deployment process
- The initial CloudFormation stack will be created first
-
Monitor CodePipeline deployment
- A CodePipeline will be automatically created to deploy the CDK code
- This deployment process will take approximately 1 hour to complete
- You will receive a welcome email at the address you provided once deployment is finished
- To monitor deployment progress:
- Go to the CloudFormation console
- Navigate to your stack's "Outputs" tab
- Click on the CodePipeline link to view the deployment status
See the MediaLake-Installation-Guide.md
for a complete CloudFormation deployment guide.
For developers who want to customize the solution, contribute to the project, or deploy using AWS CDK.
- An AWS account with appropriate permissions to create and manage resources
- AWS CLI configured with your account credentials
- AWS CDK CLI (
npm install -g aws-cdk
) - Node.js (v20.x or later)
- Python (3.12)
- Docker (for local development)
- Git for cloning the repository
- Optional: Third-party services (such as Twelve Labs) require separate setup and API credentials for integration
git clone https://github.com/aws-solutions-library-samples/guidance-for-medialake-on-aws.git
cd guidance-for-medialake-on-aws
python3 -m venv .venv
source .venv/bin/activate # Mac
# OR for Windows
.venv\Scripts\activate.bat # Windows
pip install -r requirements.txt
npm install
For development:
pip install -r requirements-dev.txt
Ensure AWS credentials are configured (aws configure
), and bootstrap your account for CDK:
cdk bootstrap --profile <profile> --region <region>
Create a config.json
file in the project root with your deployment settings:
touch config.json
Key configuration parameters include:
- environment: Choose between 1-4 letters that are alphanumeric that represent an environment name
- deployment_size: OpenSearch deployment size ("small", "medium", "large")
- resource_prefix: Prefix for all AWS resources created
- account_id: AWS Account ID for deployment
- primary_region: Primary region for deployment
- initial_user: Initial user configuration with email and name
- vpc: VPC configuration for using existing or creating new VPC
- authZ: Identity provider configuration (Cognito, SAML)
See the config-example.json
for a complete configuration example.
Deploy all stacks using CDK:
cdk deploy --all --profile <profile> --region <region>
- In the AWS CloudFormation console, check that the related media lake stacks are in CREATE_COMPLETE status.
- After deployment, you will receive a welcome email at the address you provided, containing:
- The media lake application URL
- Username (your email)
- Temporary password
- Log in at the URL provided. You should see the media lake user interface and be able to add storage connectors and media.
Use the emailed credentials to log in to the media lake UI.
- Navigate to Settings > Connectors in the UI.
- Add a connector, choosing Amazon S3 and providing your bucket details.
- Note: If you create new S3 buckets through media lake, remember that these will need to be manually emptied and deleted during cleanup as they are not automatically removed when the media lake stack is deleted.
- Enable and configure semantic search providers (e.g., TwelveLabs) as described in the UI and MediaLake-Instructions.md.
- Import pipelines for enrichment and transcription.
- Upload media to your configured S3 bucket.
- Monitor pipeline executions, view extracted metadata, and use search/discovery features in the UI.
- Customize pipeline configurations for your use case.
- Scale up OpenSearch or DynamoDB for higher performance.
guidance-for-medialake-on-aws/
├── assets/ # Documentation, images, and scripts
│ ├── docs/ # Installation and configuration guides
│ └── images/ # Architecture diagrams and screenshots
├── medialake_constructs/ # CDK construct definitions
│ ├── api_gateway/ # API Gateway constructs
│ ├── auth/ # Authentication constructs
│ └── shared_constructs/ # Shared AWS constructs
├── medialake_stacks/ # CDK stack definitions
│ ├── api_gateway_core_stack.py # Core API Gateway stack
│ ├── api_gateway_deployment_stack.py # API deployment stack
│ ├── api_gateway_stack.py # Main API Gateway stack
│ └── [additional stack files] # Infrastructure and service stacks
├── medialake_user_interface/ # React TypeScript frontend
│ ├── src/ # Source code
│ │ ├── api/ # API service layer
│ │ ├── features/ # Feature-based modules
│ │ ├── pages/ # Page components
│ │ ├── shared/ # Common utilities and types
│ │ └── [additional folders] # Components, hooks, contexts
│ ├── tests/ # End-to-end tests
│ ├── package.json # Node.js dependencies
│ └── playwright.config.ts # Testing configuration
├── lambdas/ # Lambda function source code
│ ├── api/ # API endpoint handlers
│ ├── auth/ # Authentication functions
│ ├── back_end/ # Backend processing functions
│ ├── nodes/ # Pipeline processing nodes
│ ├── pipelines/ # Pipeline orchestration
│ └── common_libraries/ # Shared Lambda utilities
├── pipeline_library/ # Default pipeline templates
├── s3_bucket_assets/ # S3 deployment assets
│ ├── pipeline_library/ # Pipeline definitions
│ └── pipeline_nodes/ # Node templates and specs
├── app.py # Main CDK application entry point
├── cdk.json # CDK configuration and settings
├── config_utils.py # Configuration utilities
├── config-dev.json # Development configuration example
├── requirements.txt # Python dependencies
├── requirements-dev.txt # Development Python dependencies
├── package.json # Node.js dependencies for CDK
└── README.md # This documentation file
- S3 Connector with EventBridge/S3 event integration
- Automatic resource provisioning (SQS, Lambda, IAM roles)
- Bucket exploration and management capabilities
- FIFO queue-based media processing
- Step Functions workflow orchestration
- Customizable processing steps
- Event-driven architecture
Core Services:
- AWS Lambda - Serverless compute for API handlers and media processing
- Amazon S3 - Object storage for media assets and metadata
- AWS Step Functions - Orchestration of media processing workflows
- Amazon SQS - Queues for ordered media processing
- Amazon EventBridge - Event routing and pipeline triggers
- Amazon API Gateway - REST API endpoint management
- Amazon DynamoDB - Asset metadata and configuration storage
- AWS MediaConvert - Media transcoding and format conversion service
- Amazon CloudWatch - Metrics, logging, and alerting
- Amazon OpenSearch - Search and analytics engine
Security & Authentication:
- AWS Cognito - User authentication and authorization
- AWS KMS - Encryption key management
- AWS IAM - Resource access control
- AWS Cognito authentication and authorization including support for local username/password and federated authentication via SAML
- KMS encryption for sensitive data
- CORS-enabled API endpoints
- VPC deployment options for network isolation
Media lake supports processing of the following file types through its default pipelines:
- WAV - Waveform Audio File Format
- AIFF/AIF - Audio Interchange File Format
- MP3 - MPEG Audio Layer III
- PCM - Pulse Code Modulation
- M4A - MPEG-4 Audio
- FLV - Flash Video
- MP4 - MPEG-4 Part 14
- MOV - QuickTime Movie
- AVI - Audio Video Interleave
- MKV - Matroska Video
- WEBM - WebM Video
- MXF - Material Exchange Format
- PSD - Adobe Photoshop Document
- TIF - Tagged Image File Format
- JPG/JPEG - Joint Photographic Experts Group
- PNG - Portable Network Graphics
- WEBP - WebP Image Format
- GIF - Graphics Interchange Format
- SVG - Scalable Vector Graphics
Each media type is automatically processed through dedicated pipelines that handle metadata extraction, proxy/thumbnail generation, and integration with AI services for enhanced search and analysis capabilities.
To remove all media lake resources:
- Go to CloudFormation console
- Delete all stacks with prefix "Media Lake" and
medialake-cf
- Important for S3 Buckets: For new buckets created via media lake, you must manually empty and delete them as they are not automatically cleaned up during stack deletion
- Delete any other associated S3 buckets, DynamoDB tables, or resources as needed
Warning: This will permanently remove all media lake data and resources. Use with caution.
- For feedback, questions, or suggestions, please use the GitHub Issues page.
- Known issues and deployment tips will be tracked in the Issues section.
- Service quotas: media lake relies on OpenSearch, DynamoDB, Lambda, and S3 limits; monitor and request increases if needed for large-scale deployments.
- For SAML integration and advanced identity provider setup, refer to the SAML instructions in MediaLake-Installation-Guide.md.
- July 2025: Initial release and commit of repository.
- See repository commit history for further changes.
Customers are responsible for making their own independent assessment of the information in this Guidance. This Guidance: (a) is for informational purposes only, (b) represents AWS current product offerings and practices, which are subject to change without notice, and (c) does not create any commitments or assurances from AWS and its affiliates, suppliers or licensors. AWS products or services are provided "as is" without warranties, representations, or conditions of any kind, whether express or implied. AWS responsibilities and liabilities to its customers are controlled by AWS agreements, and this Guidance is not part of, nor does it modify, any agreement between AWS and its customers.
- Joao Seike
- Lior Berezinski
- Robert Raver