|
| 1 | +# RDS Monitoring Architecture |
| 2 | + |
| 3 | +This document describes the new separated monitoring architecture for the RDS module, which provides better organization and maintainability of monitoring and alerting code. |
| 4 | + |
| 5 | +## Architecture Overview |
| 6 | + |
| 7 | +The monitoring functionality has been separated into two main components with a nested stack approach: |
| 8 | + |
| 9 | +1. **`src/monitoring.ts`** - Contains alert threshold definitions, interfaces, and the main monitoring logic as a nested stack |
| 10 | +2. **`src/rds.ts`** - Contains the RDS cluster creation logic (now cleaner without monitoring code) |
| 11 | + |
| 12 | +## Components |
| 13 | + |
| 14 | +### 1. Monitoring Module (`src/monitoring.ts`) |
| 15 | + |
| 16 | +The monitoring module contains both the interfaces and the main monitoring logic: |
| 17 | + |
| 18 | +#### Alert Thresholds Interface |
| 19 | +```typescript |
| 20 | +export interface AlertThresholds { |
| 21 | + cpu?: number; |
| 22 | + memory?: number; // in bytes |
| 23 | + readIops?: number; |
| 24 | + writeIops?: number; |
| 25 | + dbConnections?: number; // percentage |
| 26 | + diskQueueDepth?: number; |
| 27 | + freeStorage?: number; // in bytes |
| 28 | + networkThroughput?: number; // in bytes per second |
| 29 | + replicationLag?: number; // in milliseconds |
| 30 | +} |
| 31 | +``` |
| 32 | + |
| 33 | +#### Monitoring Logic |
| 34 | + |
| 35 | +The monitoring module contains: |
| 36 | + |
| 37 | +- **RDSMonitoringStack class** - Main monitoring nested stack |
| 38 | +- **SNS Topic creation** - For alert notifications |
| 39 | +- **Alert creation methods** - For all RDS metrics |
| 40 | +- **Dynamic threshold calculation** - Based on instance types |
| 41 | +- **Support for both primary and replica instances** |
| 42 | + |
| 43 | +#### Key Features: |
| 44 | + |
| 45 | +- **Dynamic Thresholds**: Automatically calculates appropriate alert thresholds based on instance type |
| 46 | +- **Custom Overrides**: Allows custom threshold values for specific instances |
| 47 | +- **Multi-Instance Support**: Handles both primary and read replica instances |
| 48 | +- **Zenduty Integration**: Supports Zenduty webhook integration |
| 49 | +- **Flexible Configuration**: Supports various monitoring configurations |
| 50 | + |
| 51 | +### 2. RDS Module (`src/rds.ts`) |
| 52 | + |
| 53 | +The RDS module is now cleaner and focused on: |
| 54 | + |
| 55 | +- RDS cluster creation |
| 56 | +- Security group configuration |
| 57 | +- Parameter group setup |
| 58 | +- Instance configuration |
| 59 | +- Integration with the monitoring module |
| 60 | + |
| 61 | +## Usage Examples |
| 62 | + |
| 63 | +### Basic Usage |
| 64 | + |
| 65 | +```typescript |
| 66 | +import { PostgresRDSCluster } from './src/rds'; |
| 67 | +import { RDSMonitoring, AlertThresholds } from './src/monitoring'; |
| 68 | + |
| 69 | +// Define custom thresholds |
| 70 | +const customThresholds: AlertThresholds = { |
| 71 | + cpu: 75, |
| 72 | + memory: 3221225472, // 3GB |
| 73 | + readIops: 1500, |
| 74 | + writeIops: 3000, |
| 75 | + dbConnections: 85, |
| 76 | + diskQueueDepth: 8, |
| 77 | + freeStorage: 16106127360, // 15GB |
| 78 | + networkThroughput: 2097152, // 2MB/s |
| 79 | + replicationLag: 45000, // 45 seconds |
| 80 | +}; |
| 81 | + |
| 82 | +// Create RDS cluster with monitoring |
| 83 | +const rdsCluster = new PostgresRDSCluster(this, 'MyRDSCluster', { |
| 84 | + // ... RDS configuration |
| 85 | + primaryAlertThresholds: customThresholds, |
| 86 | + alertSubcriptionWebhooks: [ |
| 87 | + 'https://hooks.slack.com/services/YOUR/WEBHOOK', |
| 88 | + 'https://www.zenduty.com/api/v1/integrations/aws-cloudwatch/YOUR_KEY/', |
| 89 | + 'https://discord.com/api/webhooks/YOUR/DISCORD/WEBHOOK', |
| 90 | + ], |
| 91 | +}); |
| 92 | + |
| 93 | +// Create separate monitoring nested stack |
| 94 | +const monitoring = new RDSMonitoringStack(this, 'RDSMonitoringStack', { |
| 95 | + clusterName: 'my-cluster', |
| 96 | + instanceType: ec2.InstanceType.of(ec2.InstanceClass.M5, ec2.InstanceSize.LARGE), |
| 97 | + primaryAlertThresholds: customThresholds, |
| 98 | + alertSubcriptionWebhooks: [ |
| 99 | + 'https://hooks.slack.com/services/YOUR/WEBHOOK', |
| 100 | + 'https://www.zenduty.com/api/v1/integrations/aws-cloudwatch/YOUR_KEY/', |
| 101 | + 'https://discord.com/api/webhooks/YOUR/DISCORD/WEBHOOK', |
| 102 | + ], |
| 103 | +}); |
| 104 | +``` |
| 105 | + |
| 106 | +### Advanced Usage with Read Replicas |
| 107 | + |
| 108 | +```typescript |
| 109 | +const primaryThresholds: AlertThresholds = { |
| 110 | + cpu: 75, |
| 111 | + memory: 3221225472, // 3GB |
| 112 | + // ... other thresholds |
| 113 | +}; |
| 114 | + |
| 115 | +const replicaThresholds: AlertThresholds = { |
| 116 | + cpu: 80, |
| 117 | + memory: 2147483648, // 2GB |
| 118 | + // ... other thresholds |
| 119 | +}; |
| 120 | + |
| 121 | +const rdsCluster = new PostgresRDSCluster(this, 'MyRDSCluster', { |
| 122 | + // ... RDS configuration |
| 123 | + readReplicas: { |
| 124 | + replicas: 2, |
| 125 | + instanceType: ec2.InstanceType.of(ec2.InstanceClass.M5, ec2.InstanceSize.MEDIUM), |
| 126 | + alertThresholds: replicaThresholds, |
| 127 | + }, |
| 128 | + primaryAlertThresholds: primaryThresholds, |
| 129 | + replicaAlertThresholds: replicaThresholds, |
| 130 | +}); |
| 131 | +``` |
| 132 | + |
| 133 | +## Available Alerts |
| 134 | + |
| 135 | +The monitoring module creates the following alerts: |
| 136 | + |
| 137 | +### Primary Instance Alerts |
| 138 | +- **CPU Utilization** - Monitors CPU usage with dynamic thresholds |
| 139 | +- **Free Memory** - Monitors available memory |
| 140 | +- **Free Storage** - Monitors available storage space |
| 141 | +- **Read IOPS** - Monitors read operations per second |
| 142 | +- **Write IOPS** - Monitors write operations per second |
| 143 | +- **Disk Queue Depth** - Monitors disk I/O queue depth |
| 144 | +- **Database Connections** - Monitors active database connections |
| 145 | +- **Network Throughput** - Monitors network I/O |
| 146 | +- **Replication Lag** - Monitors replication delay (Multi-AZ only) |
| 147 | +- **Backup Storage** - Monitors backup storage usage (if backups enabled) |
| 148 | + |
| 149 | +### Read Replica Alerts |
| 150 | +- All the same alerts as primary instances |
| 151 | +- Separate thresholds for replica-specific requirements |
| 152 | +- Individual monitoring for each replica instance |
| 153 | + |
| 154 | +## Dynamic Threshold Calculation |
| 155 | + |
| 156 | +The monitoring module automatically calculates appropriate thresholds based on instance type: |
| 157 | + |
| 158 | +### Instance Class Thresholds |
| 159 | + |
| 160 | +| Instance Class | CPU (%) | Memory (GB) | Read IOPS | Write IOPS | DB Connections (%) | |
| 161 | +|----------------|---------|-------------|-----------|------------|-------------------| |
| 162 | +| t3/t2 (Burstable) | 85 | 1 | 500 | 1000 | 60 | |
| 163 | +| m5/m6 (General) | 80 | 2 | 1000 | 2000 | 80 | |
| 164 | +| r5/r6 (Memory) | 75 | 4 | 1500 | 3000 | 85 | |
| 165 | +| c5/c6 (Compute) | 70 | 2 | 2000 | 4000 | 90 | |
| 166 | +| x1/x2 (Large) | 70 | 8 | 3000 | 6000 | 95 | |
| 167 | + |
| 168 | +### Custom Overrides |
| 169 | + |
| 170 | +You can override any threshold with custom values: |
| 171 | + |
| 172 | +```typescript |
| 173 | +const customThresholds: AlertThresholds = { |
| 174 | + cpu: 70, // Override CPU threshold to 70% |
| 175 | + memory: 4294967296, // Override memory threshold to 4GB |
| 176 | + // Other thresholds will use instance-type defaults |
| 177 | +}; |
| 178 | +``` |
| 179 | + |
| 180 | +## Integration with Zenduty |
| 181 | + |
| 182 | +The monitoring module supports Zenduty integration for incident management: |
| 183 | + |
| 184 | +```typescript |
| 185 | +const monitoring = new RDSMonitoring(this, 'RDSMonitoring', { |
| 186 | + // ... other configuration |
| 187 | + zendutyWebhookUrl: 'https://www.zenduty.com/api/v1/integrations/aws-cloudwatch/YOUR_INTEGRATION_KEY/', |
| 188 | +}); |
| 189 | +``` |
| 190 | + |
| 191 | +## Benefits of Separated Architecture |
| 192 | + |
| 193 | +1. **Better Organization**: Monitoring logic is separated from RDS creation logic |
| 194 | +2. **Reusability**: Monitoring module can be used independently |
| 195 | +3. **Maintainability**: Easier to update and maintain monitoring code |
| 196 | +4. **Flexibility**: Can create monitoring for existing RDS instances |
| 197 | +5. **Testing**: Easier to test monitoring logic in isolation |
| 198 | +6. **Customization**: More flexible configuration options |
| 199 | + |
| 200 | +## Migration from Old Architecture |
| 201 | + |
| 202 | +If you're migrating from the old architecture: |
| 203 | + |
| 204 | +1. **Remove old alert code** from your RDS module |
| 205 | +2. **Import the new modules**: |
| 206 | + ```typescript |
| 207 | + import { AlertThresholds, RDSMonitoring } from './src/monitoring'; |
| 208 | + ``` |
| 209 | +3. **Update your configuration** to use the new interfaces |
| 210 | +4. **Create separate monitoring instances** as needed |
| 211 | + |
| 212 | +## Best Practices |
| 213 | + |
| 214 | +1. **Use Dynamic Thresholds**: Let the module calculate appropriate thresholds based on instance type |
| 215 | +2. **Customize When Needed**: Override thresholds only when you have specific requirements |
| 216 | +3. **Monitor Both Primary and Replicas**: Set up monitoring for all instances |
| 217 | +4. **Use Zenduty Integration**: For better incident management |
| 218 | +5. **Test Your Alerts**: Verify that alerts are triggered appropriately |
| 219 | +6. **Document Your Thresholds**: Keep track of why you chose specific threshold values |
0 commit comments