|
| 1 | +--- |
| 2 | +title: Quality of Service (QoS) Integration |
| 3 | +--- |
| 4 | + |
| 5 | +# Quality of Service (QoS) Integration Guide |
| 6 | + |
| 7 | +This guide explains how to integrate and use the Blueprint SDK's Quality of Service (QoS) system to add comprehensive observability, monitoring, and dashboard capabilities to any Blueprint. QoS provides unified metrics collection, log aggregation, heartbeat monitoring, and visualization through a cohesive interface. |
| 8 | + |
| 9 | +## Prerequisites |
| 10 | + |
| 11 | +- Understanding of Blueprint concepts and execution model |
| 12 | +- Familiarity with Tangle Network architecture |
| 13 | +- Basic knowledge of observability concepts (metrics, logging, monitoring) |
| 14 | + |
| 15 | +## QoS Overview |
| 16 | + |
| 17 | +The Blueprint QoS system provides a complete observability stack: |
| 18 | + |
| 19 | +- **Heartbeat Service**: Sends periodic heartbeats to Tangle to prevent slashing |
| 20 | +- **Metrics Collection**: Captures system and application metrics |
| 21 | +- **Logging**: Aggregates logs via Loki for centralized querying |
| 22 | +- **Dashboards**: Creates Grafana visualizations automatically |
| 23 | +- **Server Management**: Optionally runs containerized instances of Prometheus, Loki, and Grafana |
| 24 | + |
| 25 | +The QoS system is designed to be added to any Blueprint type (Tangle, Eigenlayer, P2P, or Cron) as a background service. |
| 26 | + |
| 27 | +## Integrating QoS into a Blueprint |
| 28 | + |
| 29 | +The integration process involves setting up the QoS configuration and implementing the HeartbeatConsumer trait. Here's a step-by-step guide. |
| 30 | + |
| 31 | +### Main Blueprint Setup |
| 32 | + |
| 33 | +```rust |
| 34 | +#[tokio::main] |
| 35 | +async fn main() -> Result<(), blueprint_sdk::Error> { |
| 36 | + let env = BlueprintEnvironment::load()?; |
| 37 | + |
| 38 | + // Create your Blueprint's primary context |
| 39 | + let context = MyContext::new(env.clone()).await?; |
| 40 | + |
| 41 | + // Configure QoS system |
| 42 | + let qos_config = blueprint_qos::default_qos_config(); |
| 43 | + let heartbeat_consumer = Arc::new(MyHeartbeatConsumer::new()); |
| 44 | + |
| 45 | + // Standard Blueprint runner setup with QoS |
| 46 | + BlueprintRunner::builder(TangleConfig::default(), env) |
| 47 | + .router(Router::new() |
| 48 | + .route(JOB_ID, handler.layer(TangleLayer)) |
| 49 | + .with_context(context)) |
| 50 | + .producer(producer) |
| 51 | + .consumer(consumer) |
| 52 | + .qos_service(qos_config, Some(heartbeat_consumer)) |
| 53 | + .run() |
| 54 | + .await |
| 55 | +} |
| 56 | +``` |
| 57 | + |
| 58 | +### Implementing HeartbeatConsumer |
| 59 | + |
| 60 | +To enable the heartbeat service, you must implement the `HeartbeatConsumer` trait, which is responsible for sending heartbeat signals to the Tangle Network: |
| 61 | + |
| 62 | +```rust |
| 63 | +#[derive(Clone)] |
| 64 | +struct MyHeartbeatConsumer { |
| 65 | + // Add any required fields for heartbeat submission |
| 66 | +} |
| 67 | + |
| 68 | +impl HeartbeatConsumer for MyHeartbeatConsumer { |
| 69 | + fn consume_heartbeat( |
| 70 | + &self, |
| 71 | + service_id: u64, |
| 72 | + blueprint_id: u64, |
| 73 | + metrics_data: String, |
| 74 | + ) -> Result<(), Box<dyn std::error::Error + Send + Sync>> { |
| 75 | + // Implement custom heartbeat logic here, specific to blueprint |
| 76 | + Ok(()) |
| 77 | + } |
| 78 | +} |
| 79 | +``` |
| 80 | + |
| 81 | +## QoS Configuration Options |
| 82 | + |
| 83 | +### Using Default Configuration |
| 84 | + |
| 85 | +The simplest way to get started is with the default configuration: |
| 86 | + |
| 87 | +```rust |
| 88 | +let qos_config = blueprint_qos::default_qos_config(); |
| 89 | +``` |
| 90 | + |
| 91 | +This initializes a configuration with: |
| 92 | + |
| 93 | +- Heartbeat service (disabled until configured) |
| 94 | +- Metrics collection |
| 95 | +- Loki logging |
| 96 | +- Grafana integration |
| 97 | +- Automatic server management set to `false` |
| 98 | + |
| 99 | +### Custom Configuration |
| 100 | + |
| 101 | +Customize the configuration for your specific needs: |
| 102 | + |
| 103 | +```rust |
| 104 | +let qos_config = QoSConfig { |
| 105 | + heartbeat: Some(HeartbeatConfig { |
| 106 | + service_id: Some(42), |
| 107 | + blueprint_id: Some(7), |
| 108 | + interval_seconds: 60, |
| 109 | + jitter_seconds: 5, |
| 110 | + }), |
| 111 | + metrics: Some(MetricsConfig::default()), |
| 112 | + loki: Some(LokiConfig::default()), |
| 113 | + grafana: Some(GrafanaConfig { |
| 114 | + endpoint: "http://localhost:3000".into(), |
| 115 | + admin_user: Some("admin".into()), |
| 116 | + admin_password: Some("admin".into()), |
| 117 | + folder: None, |
| 118 | + }), |
| 119 | + grafana_server: Some(GrafanaServerConfig::default()), |
| 120 | + loki_server: Some(LokiServerConfig::default()), |
| 121 | + prometheus_server: Some(PrometheusServerConfig::default()), |
| 122 | + docker_network: Some("blueprint-network".into()), |
| 123 | + manage_servers: true, |
| 124 | + service_id: Some(42), |
| 125 | + blueprint_id: Some(7), |
| 126 | + docker_bind_ip: Some("0.0.0.0".into()), |
| 127 | +}; |
| 128 | +``` |
| 129 | + |
| 130 | +### Using the Builder Pattern |
| 131 | + |
| 132 | +The builder pattern provides a fluent API for configuration: |
| 133 | + |
| 134 | +```rust |
| 135 | +let qos_service = QoSServiceBuilder::new() |
| 136 | + .with_heartbeat_config(HeartbeatConfig { |
| 137 | + service_id: Some(service_id), |
| 138 | + blueprint_id: Some(blueprint_id), |
| 139 | + interval_seconds: 60, |
| 140 | + jitter_seconds: 5, |
| 141 | + }) |
| 142 | + .with_heartbeat_consumer(Arc::new(consumer)) |
| 143 | + .with_metrics_config(MetricsConfig::default()) |
| 144 | + .with_loki_config(LokiConfig::default()) |
| 145 | + .with_grafana_config(GrafanaConfig::default()) |
| 146 | + .with_prometheus_server_config(PrometheusServerConfig { |
| 147 | + host: "0.0.0.0".into(), |
| 148 | + port: 9090, |
| 149 | + ..Default::default() |
| 150 | + }) |
| 151 | + .manage_servers(true) |
| 152 | + .with_ws_rpc_endpoint(ws_endpoint) |
| 153 | + .with_keystore_uri(keystore_uri) |
| 154 | + .build()?; |
| 155 | +``` |
| 156 | + |
| 157 | +## Recording Blueprint Metrics and Events |
| 158 | + |
| 159 | +### Job Performance Tracking |
| 160 | + |
| 161 | +Tracking job execution and performance in your job handlers is essential for monitoring and optimization: |
| 162 | + |
| 163 | +```rust |
| 164 | +pub async fn process_job( |
| 165 | + Context(ctx): Context<MyContext>, |
| 166 | + TangleArg(data): TangleArg<String>, |
| 167 | +) -> Result<TangleResult<u64>> { |
| 168 | + let start_time = std::time::Instant::now(); |
| 169 | + |
| 170 | + // Process the job |
| 171 | + let result = perform_processing(&data)?; |
| 172 | + |
| 173 | + // Record job execution metrics |
| 174 | + if let Some(qos) = &ctx.qos_service { |
| 175 | + qos.record_job_execution( |
| 176 | + JOB_ID, |
| 177 | + start_time.elapsed().as_secs_f64(), |
| 178 | + ctx.service_id, |
| 179 | + ctx.blueprint_id |
| 180 | + ); |
| 181 | + } |
| 182 | + |
| 183 | + Ok(TangleResult::Success(result)) |
| 184 | +} |
| 185 | +``` |
| 186 | + |
| 187 | +### Error Tracking |
| 188 | + |
| 189 | +Tracking job errors is crucial for monitoring and alerts: |
| 190 | + |
| 191 | +```rust |
| 192 | +match perform_complex_operation() { |
| 193 | + Ok(value) => Ok(TangleResult::Success(value)), |
| 194 | + Err(e) => { |
| 195 | + if let Some(qos) = &ctx.qos_service { |
| 196 | + qos.record_job_error(JOB_ID, "complex_operation_failure"); |
| 197 | + } |
| 198 | + Err(e.into()) |
| 199 | + } |
| 200 | +} |
| 201 | +``` |
| 202 | + |
| 203 | +## Automatic Dashboard Creation |
| 204 | + |
| 205 | +QoS can automatically create Grafana dashboards that display your Blueprint's metrics: |
| 206 | + |
| 207 | +```rust |
| 208 | +// Create a custom dashboard for your Blueprint |
| 209 | +if let Some(mut qos) = qos_service { |
| 210 | + if let Err(e) = qos.create_dashboard("My Blueprint") { |
| 211 | + error!("Failed to create dashboard: {}", e); |
| 212 | + } else { |
| 213 | + info!("Created Grafana dashboard for My Blueprint"); |
| 214 | + } |
| 215 | +} |
| 216 | +``` |
| 217 | + |
| 218 | +The dashboard includes: |
| 219 | + |
| 220 | +- System resource usage (CPU, memory, disk, network) |
| 221 | +- Job execution metrics (frequency, duration, error rates) |
| 222 | +- Log visualization panels (when Loki is configured) |
| 223 | +- Service status and uptime information |
| 224 | + |
| 225 | +## Accessing QoS in Context |
| 226 | + |
| 227 | +Typically, you'll want to store the QoS service in your Blueprint context: |
| 228 | + |
| 229 | +```rust |
| 230 | +#[derive(Clone)] |
| 231 | +pub struct MyContext { |
| 232 | + #[config] |
| 233 | + pub env: BlueprintEnvironment, |
| 234 | + pub data_dir: PathBuf, |
| 235 | + pub qos_service: Option<Arc<QoSService<MyHeartbeatConsumer>>>, |
| 236 | + pub service_id: u64, |
| 237 | + pub blueprint_id: u64, |
| 238 | +} |
| 239 | + |
| 240 | +impl MyContext { |
| 241 | + pub async fn new(env: BlueprintEnvironment) -> Result<Self, Error> { |
| 242 | + // Initialize QoS service |
| 243 | + let qos_service = initialize_qos(&env)?; |
| 244 | + |
| 245 | + Ok(Self { |
| 246 | + data_dir: env.data_dir.clone().unwrap_or_else(default_data_dir), |
| 247 | + qos_service: Some(Arc::new(qos_service)), |
| 248 | + service_id: 42, |
| 249 | + blueprint_id: 7, |
| 250 | + env, |
| 251 | + }) |
| 252 | + } |
| 253 | +} |
| 254 | +``` |
| 255 | + |
| 256 | +You can then access the QoS service in your job handlers: |
| 257 | + |
| 258 | +```rust |
| 259 | +pub async fn my_job( |
| 260 | + Context(ctx): Context<MyContext>, |
| 261 | + TangleArg(data): TangleArg<String>, |
| 262 | +) -> Result<TangleResult<()>> { |
| 263 | + // Access QoS metrics provider |
| 264 | + if let Some(qos) = &ctx.qos_service { |
| 265 | + if let Some(provider) = qos.provider() { |
| 266 | + let cpu_usage = provider.get_cpu_usage()?; |
| 267 | + info!("Current CPU usage: {}%", cpu_usage); |
| 268 | + } |
| 269 | + } |
| 270 | + |
| 271 | + // Job implementation |
| 272 | + Ok(TangleResult::Success(())) |
| 273 | +} |
| 274 | +``` |
| 275 | + |
| 276 | +## Server Management |
| 277 | + |
| 278 | +QoS can automatically manage Grafana, Prometheus, and Loki servers: |
| 279 | + |
| 280 | +```rust |
| 281 | +// Configure server management |
| 282 | +let qos_config = QoSConfig { |
| 283 | + grafana_server: Some(GrafanaServerConfig { |
| 284 | + port: 3000, |
| 285 | + container_name: "blueprint-grafana".into(), |
| 286 | + image: "grafana/grafana:latest".into(), |
| 287 | + ..Default::default() |
| 288 | + }), |
| 289 | + loki_server: Some(LokiServerConfig { |
| 290 | + port: 3100, |
| 291 | + container_name: "blueprint-loki".into(), |
| 292 | + image: "grafana/loki:latest".into(), |
| 293 | + ..Default::default() |
| 294 | + }), |
| 295 | + prometheus_server: Some(PrometheusServerConfig { |
| 296 | + port: 9090, |
| 297 | + container_name: "blueprint-prometheus".into(), |
| 298 | + image: "prom/prometheus:latest".into(), |
| 299 | + host: "0.0.0.0".into(), |
| 300 | + ..Default::default() |
| 301 | + }), |
| 302 | + docker_network: Some("blueprint-network".into()), |
| 303 | + manage_servers: true, |
| 304 | + ..Default::default() |
| 305 | +}; |
| 306 | +``` |
| 307 | + |
| 308 | +For proper operation with Docker containers, ensure: |
| 309 | + |
| 310 | +1. Your application binds metrics endpoints to `0.0.0.0` (not `127.0.0.1`) |
| 311 | +2. Prometheus configuration uses `host.docker.internal` to access host metrics |
| 312 | +3. Docker is installed and the user has the necessary permissions |
| 313 | +4. A common Docker network is used for all containers |
| 314 | + |
| 315 | +## Best Practices |
| 316 | + |
| 317 | +✅ DO: |
| 318 | + |
| 319 | +- Initialize QoS early in your Blueprint's startup sequence |
| 320 | +- Add QoS as a background service using `BlueprintRunner::background_service()` |
| 321 | +- Record job execution metrics for all important jobs |
| 322 | +- Use `#[derive(Clone)]` for your `HeartbeatConsumer` implementation |
| 323 | +- Access QoS APIs through your Blueprint's context |
| 324 | + |
| 325 | +❌ DON'T: |
| 326 | + |
| 327 | +- Don't create separate QoS instances for different components |
| 328 | +- Avoid using hardcoded admin credentials in production code |
| 329 | +- Don't pass the QoS service directly between jobs; use the context pattern |
| 330 | +- Don't forget to bind Prometheus metrics server to `0.0.0.0` for Docker accessibility |
| 331 | +- Don't ignore QoS shutdown or creation errors; they may indicate more serious issues |
| 332 | + |
| 333 | +## QoS Components Reference |
| 334 | + |
| 335 | +| Component | Primary Struct | Config | Purpose | |
| 336 | +| ----------------- | ------------------ | ---------------------- | ------------------------------------------------- | |
| 337 | +| Unified Service | `QoSService` | `QoSConfig` | Main entry point for QoS integration | |
| 338 | +| Heartbeat | `HeartbeatService` | `HeartbeatConfig` | Sends periodic liveness signals to chain | |
| 339 | +| Metrics | `MetricsService` | `MetricsConfig` | Collects system and application metrics | |
| 340 | +| Logging | N/A | `LokiConfig` | Configures log aggregation to Loki | |
| 341 | +| Dashboards | `GrafanaClient` | `GrafanaConfig` | Creates and manages Grafana dashboards | |
| 342 | +| Server Management | `ServerManager` | Various server configs | Manages Docker containers for observability stack | |
0 commit comments