Skip to content

Commit 55af1d1

Browse files
authored
feat: allow instrumentation of Termination lambda (#1255)
## Description This PR adds the possibility to instrument the internal Lambda function, e.g. with APM tools. Use the following variables: - `runner_terminate_ec2_lambda_handler` to replace the `handler` with your function - `runner_terminate_ec2_environment_variables` to add environment variables. The special value `{HANDLER}` is automatically replaced by the internal handler name to be able to call the "real" handler - `runner_terminate_ec2_lambda_handler_layer_arns` to add additional layers to the Lambda function - `runner_terminate_ec2_lambda_egress_rules` to allow traffic to external systems. IPv4/6 port 443 is the default
1 parent d644987 commit 55af1d1

File tree

8 files changed

+199
-15
lines changed

8 files changed

+199
-15
lines changed

docker_autoscaler_security_group.tf

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,8 @@ resource "aws_vpc_security_group_ingress_rule" "docker_autoscaler_ingress" {
2727
referenced_security_group_id = each.value.security_group
2828
cidr_ipv4 = each.value.cidr_block
2929
cidr_ipv6 = each.value.ipv6_cidr_block
30+
31+
tags = local.tags
3032
}
3133

3234
resource "aws_vpc_security_group_ingress_rule" "docker_autoscaler_internal_traffic" {
@@ -38,6 +40,8 @@ resource "aws_vpc_security_group_ingress_rule" "docker_autoscaler_internal_traff
3840
ip_protocol = "-1"
3941
description = "Allow ALL Ingress traffic between Runner Manager and Docker-autoscaler workers security group"
4042
referenced_security_group_id = aws_security_group.runner.id
43+
44+
tags = local.tags
4145
}
4246

4347
# Egress rules
@@ -55,4 +59,6 @@ resource "aws_vpc_security_group_egress_rule" "docker_autoscaler_egress" {
5559
referenced_security_group_id = each.value.security_group
5660
cidr_ipv4 = each.value.cidr_block
5761
cidr_ipv6 = each.value.ipv6_cidr_block
62+
63+
tags = local.tags
5864
}

docs/usage.md

Lines changed: 28 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -111,7 +111,7 @@ We have seen that the [fork](https://gitlab.com/cki-project/docker-machine/-/tre
111111
module is using consume more RAM using spot fleets. For comparison, if you launch 50 machines in the same time, it consumes
112112
~1.2GB of RAM. In our case, we had to change the `instance_type` of the runner from `t3.micro` to `t3.small`.
113113

114-
#### Configuration example
114+
#### Spot Fleet Configuration
115115

116116
```hcl
117117
module "runner" {
@@ -146,9 +146,11 @@ module "runner" {
146146

147147
### Scenario: Use of Docker autoscaler
148148

149-
As docker machine is no longer maintained by docker, gitlab recently developed docker autoscaler to replace docker machine (still in beta). An option is available to test it out.
149+
As docker machine is no longer maintained by docker, gitlab recently developed docker autoscaler to replace docker machine
150+
(still in beta). An option is available to test it out.
150151

151-
Tested with amazon-linux-2-x86 as runner manager and ubuntu-server-22-lts-x86 for runner worker. The following commands have been added to the original AMI for the runner worker for the docker-autoscaler to work correctly:
152+
Tested with amazon-linux-2-x86 as runner manager and ubuntu-server-22-lts-x86 for runner worker. The following commands have been
153+
added to the original AMI for the runner worker for the docker-autoscaler to work correctly:
152154

153155
```bash
154156
# Install docker
@@ -170,7 +172,7 @@ apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin do
170172
usermod -aG docker ubuntu
171173
```
172174

173-
#### Configuration example
175+
#### Docker Autoscaler Configuration
174176

175177
```hcl
176178
module "runner" {
@@ -253,9 +255,7 @@ If a KMS key is set via `kms_key_id`, make sure that you also give proper access
253255
get errors, e.g. the build cache can't be decrypted or logging via CloudWatch is not possible. For a CloudWatch
254256
example checkout [kms-policy.json](https://github.com/cattle-ops/terraform-aws-gitlab-runner/blob/main/policies/kms-policy.json)
255257

256-
### Auto Scaling Group
257-
258-
#### Scheduled scaling
258+
### Auto Scaling Group - Scheduled scaling
259259

260260
When `runner_schedule_enable=true`, the `runner_schedule_config` block can be used to scale the Auto Scaling group.
261261

@@ -281,7 +281,7 @@ module "runner" {
281281
}
282282
```
283283

284-
#### Graceful termination / Zero Downtime deployment
284+
### Graceful termination / Zero Downtime deployment
285285

286286
This module supports zero-downtime deployments by following a structured process:
287287

@@ -315,6 +315,26 @@ that executes a provided Lambda function when the runner is terminated to termin
315315
provisioned by the Docker Machine executor. a `builds/` directory relative to the root module persists that
316316
contains the packaged Lambda function.
317317

318+
### Instrumenting the Graceful termination Lambda
319+
320+
To instrument the Lambda function, the following steps are required:
321+
322+
```hcl
323+
module "runner" {
324+
# ...
325+
runner_terminate_ec2_environment_variables = {
326+
variable1 = "here"
327+
variable2 = "are"
328+
old_handler = "{HANDLER}" # automatically replaced by the correct value
329+
}
330+
runner_terminate_ec2_lambda_egress_rules = {
331+
# ... whatever you need, IPv4/IPv6 port 443 is the default
332+
}
333+
runner_terminate_ec2_lambda_handler = "instrumented_handler.from.a.layer"
334+
runner_terminate_ec2_lambda_layer_arns = ["arn:aws:lambda:us-east-1:123456789012:layer:instrumented_handler:1"]
335+
}
336+
```
337+
318338
### Access the Runner instance
319339

320340
A few option are provided to access the runner instance:

main.tf

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -379,6 +379,12 @@ module "terminate_agent_hook" {
379379
role_permissions_boundary = var.iam_permissions_boundary == "" ? null : "arn:${data.aws_partition.current.partition}:iam::${data.aws_caller_identity.current.account_id}:policy/${var.iam_permissions_boundary}"
380380
kms_key_id = local.kms_key_arn
381381
asg_hook_terminating_heartbeat_timeout = local.runner_worker_graceful_terminate_heartbeat_timeout
382+
environment_variables = var.runner_terminate_ec2_environment_variables
383+
lambda_handler = var.runner_terminate_ec2_lambda_handler
384+
layer_arns = var.runner_terminate_ec2_lambda_layer_arns
385+
egress_rules = var.runner_terminate_ec2_lambda_egress_rules
386+
vpc_id = var.vpc_id
387+
subnet_id = var.subnet_id
382388

383389
tags = local.tags
384390
}

modules/terminate-agent-hook/iam.tf

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ resource "aws_iam_role" "lambda" {
3737
data "aws_iam_policy_document" "lambda" {
3838
# checkov:skip=CKV_AWS_111:Write access is limited to the resources needed
3939
statement {
40-
sid = "allow kms access"
40+
sid = "AllowKmsAccess"
4141
actions = [
4242
"kms:Decrypt", # to decrypt the Lambda environment variables
4343
]
@@ -167,3 +167,8 @@ resource "aws_iam_role_policy_attachment" "spot_request_housekeeping" {
167167
role = aws_iam_role.lambda.name
168168
policy_arn = aws_iam_policy.spot_request_housekeeping.arn
169169
}
170+
171+
resource "aws_iam_role_policy_attachment" "aws_lambda_vpc_access_execution_role" {
172+
role = aws_iam_role.lambda.name
173+
policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole"
174+
}
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
locals {
2+
original_lambda_handler = "lambda_function.handler"
3+
lambda_handler = var.lambda_handler != null ? var.lambda_handler : local.original_lambda_handler
4+
5+
replaced_environment_variables = { for key, value in var.environment_variables : key => replace(value, "{HANDLER}", local.original_lambda_handler) }
6+
}

modules/terminate-agent-hook/main.tf

Lines changed: 38 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -15,23 +15,48 @@ data "archive_file" "terminate_runner_instances_lambda" {
1515
output_file_mode = "0666"
1616
}
1717

18+
resource "aws_security_group" "terminate_runner_instances" {
19+
name = "${var.environment}-${var.name}"
20+
description = "Allowing access to external services for the terminate runner instances lambda"
21+
22+
vpc_id = var.vpc_id
23+
24+
tags = var.tags
25+
}
26+
27+
resource "aws_vpc_security_group_egress_rule" "docker_autoscaler_egress" {
28+
for_each = var.egress_rules
29+
30+
security_group_id = aws_security_group.terminate_runner_instances.id
31+
32+
from_port = each.value.from_port
33+
to_port = each.value.to_port
34+
ip_protocol = each.value.protocol
35+
36+
description = each.value.description
37+
prefix_list_id = each.value.prefix_list_id
38+
referenced_security_group_id = each.value.security_group
39+
cidr_ipv4 = each.value.cidr_block
40+
cidr_ipv6 = each.value.ipv6_cidr_block
41+
42+
tags = var.tags
43+
}
44+
1845
# tracing functions can be activated by the user
1946
# tfsec:ignore:aws-lambda-enable-tracing
2047
# kics-scan ignore-line
2148
resource "aws_lambda_function" "terminate_runner_instances" {
2249
#ts:skip=AC_AWS_0485:Tracing functions can be activated by the user
23-
#ts:skip=AC_AWS_0486 There is no need to run this lambda in our VPC
2450
# checkov:skip=CKV_AWS_50:Tracing functions can be activated by the user
2551
# checkov:skip=CKV_AWS_115:We do not assign a reserved concurrency as this function can't be called by users
2652
# checkov:skip=CKV_AWS_116:We should think about having a dead letter queue for this lambda
27-
# checkov:skip=CKV_AWS_117:There is no need to run this lambda in our VPC
2853
# checkov:skip=CKV_AWS_272:Code signing would be a nice enhancement, but I guess we can live without it here
2954
architectures = ["x86_64"]
3055
description = "Lifecycle hook for terminating GitLab runner agent instances"
3156
filename = data.archive_file.terminate_runner_instances_lambda.output_path
3257
source_code_hash = data.archive_file.terminate_runner_instances_lambda.output_base64sha256
3358
function_name = "${var.environment}-${var.name}"
34-
handler = "lambda_function.handler"
59+
handler = local.lambda_handler
3560
memory_size = 128
3661
package_type = "Zip"
3762
publish = true
@@ -40,12 +65,17 @@ resource "aws_lambda_function" "terminate_runner_instances" {
4065
timeout = var.timeout
4166
kms_key_arn = var.kms_key_id
4267

43-
tags = var.tags
68+
layers = [for layer_arn in var.layer_arns : layer_arn]
4469

4570
environment {
46-
variables = {
71+
variables = merge({
4772
NAME_EXECUTOR_INSTANCE = var.name_docker_machine_runners
48-
}
73+
}, local.replaced_environment_variables)
74+
}
75+
76+
vpc_config {
77+
security_group_ids = [aws_security_group.terminate_runner_instances.id]
78+
subnet_ids = [var.subnet_id]
4979
}
5080

5181
dynamic "tracing_config" {
@@ -55,6 +85,8 @@ resource "aws_lambda_function" "terminate_runner_instances" {
5585
mode = "Passthrough"
5686
}
5787
}
88+
89+
tags = var.tags
5890
}
5991

6092
resource "aws_lambda_permission" "current_version_triggers" {

modules/terminate-agent-hook/variables.tf

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,3 +77,45 @@ variable "asg_hook_terminating_heartbeat_timeout" {
7777
error_message = "AWS only supports heartbeat timeout in the range of 30 to 7200."
7878
}
7979
}
80+
81+
variable "environment_variables" {
82+
description = "Environment variables to set for the Lambda function. A value of `{HANDLER} is replaced with the handler value of the Lambda function."
83+
type = map(string)
84+
default = {}
85+
}
86+
87+
variable "layer_arns" {
88+
description = "A list of ARNs of Lambda layers to attach to the Lambda function."
89+
type = list(string)
90+
default = []
91+
}
92+
93+
variable "lambda_handler" {
94+
description = "The entry point for the Lambda function."
95+
type = string
96+
default = null
97+
}
98+
99+
variable "vpc_id" {
100+
description = "The VPC used for the runner and runner workers."
101+
type = string
102+
}
103+
104+
variable "subnet_id" {
105+
type = string
106+
description = "The subnet for the lambda function."
107+
}
108+
109+
variable "egress_rules" {
110+
description = "Map of egress rules for the Lambda function."
111+
type = map(object({
112+
from_port = optional(number, null)
113+
to_port = optional(number, null)
114+
protocol = string
115+
description = string
116+
cidr_block = optional(string, null)
117+
ipv6_cidr_block = optional(string, null)
118+
prefix_list_id = optional(string, null)
119+
security_group = optional(string, null)
120+
}))
121+
}

variables.tf

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -447,6 +447,73 @@ variable "runner_terminate_ec2_timeout_duration" {
447447
default = 90
448448
}
449449

450+
variable "runner_terminate_ec2_environment_variables" {
451+
description = "Environment variables to set for the Lambda function. A value of `{HANDLER} is replaced with the handler value of the Lambda function."
452+
type = map(string)
453+
default = {}
454+
}
455+
456+
variable "runner_terminate_ec2_lambda_handler" {
457+
description = "The handler for the terminate Lambda function."
458+
type = string
459+
default = null
460+
}
461+
462+
variable "runner_terminate_ec2_lambda_layer_arns" {
463+
description = "A list of ARNs of Lambda layers to attach to the Lambda function."
464+
type = list(string)
465+
default = []
466+
}
467+
468+
variable "runner_terminate_ec2_lambda_egress_rules" {
469+
description = "Map of egress rules for the Lambda function."
470+
type = map(object({
471+
from_port = optional(number, null)
472+
to_port = optional(number, null)
473+
protocol = string
474+
description = string
475+
cidr_block = optional(string, null)
476+
ipv6_cidr_block = optional(string, null)
477+
prefix_list_id = optional(string, null)
478+
security_group = optional(string, null)
479+
}))
480+
default = {
481+
allow_https_ipv4 = {
482+
cidr_block = "0.0.0.0/0"
483+
from_port = 443
484+
to_port = 443
485+
protocol = "tcp"
486+
description = "Allow HTTPS egress traffic to all destinations (IPv4)"
487+
},
488+
allow_https_ipv6 = {
489+
ipv6_cidr_block = "::/0"
490+
from_port = 443
491+
to_port = 443
492+
protocol = "tcp"
493+
description = "Allow HTTPS egress traffic to all destinations (IPv6)"
494+
}
495+
}
496+
497+
validation {
498+
condition = alltrue([
499+
for rule in values(var.runner_terminate_ec2_lambda_egress_rules) :
500+
contains(["-1", "tcp", "udp", "icmp", "icmpv6"], rule.protocol)
501+
])
502+
error_message = "Protocol must be '-1', 'tcp', 'udp', 'icmp', or 'icmpv6'."
503+
}
504+
505+
validation {
506+
condition = alltrue([
507+
for rule in values(var.runner_terminate_ec2_lambda_egress_rules) :
508+
(rule.cidr_block != null) ||
509+
(rule.ipv6_cidr_block != null) ||
510+
(rule.prefix_list_id != null) ||
511+
(rule.security_group != null)
512+
])
513+
error_message = "At least one destination must be specified."
514+
}
515+
}
516+
450517
/*
451518
* Runner Worker: The process created by the Runner on the host computing platform to run jobs.
452519
*/

0 commit comments

Comments
 (0)