Skip to content

[CI] Add Terraform resources for daily CronJob that processes LLVM commits #495

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions premerge/gke_cluster/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,10 @@ resource "google_container_cluster" "llvm_premerge" {
# for adding windows nodes to the cluster.
networking_mode = "VPC_NATIVE"
ip_allocation_policy {}

workload_identity_config {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least for the non-TF docs, changing this would cause changes in new node pools. Does this change any of the defaults for node pools created through TF?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For existing node pools created through TF, they should keep their original default values.

Based on the workload identity federation docs, new node pools created through TF will have workload identity enabled since the cluster has it enabled. It seems we can explicitly add workload_metadata_config { mode = "GCE_METADATA" } to disable it in unwanted nodes however.

Although, looking back through the docs now, there appears to be some risk with updating the existing service node pools:

Caution: Modifying the node pool immediately enables Workload Identity Federation for GKE for any workloads running in the node pool. This prevents the workloads from using the service account that your nodes use and might result in disruptions.

I'm not too familiar with what existing workloads are running on these nodes, but they may break if they're using the node's service account. Perhaps we want a separate node pool for this after all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of our workloads access any GCP services, so I think this should be fine. If it's supported in place I would imagine all the underlying GKE services will be fine transitioning over/not even use this.

I think we can keep things as is for the existing node pools. I'll have to work through https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity later today, but it doesn't look like there's significant security risk either way given we aren't using any GCP services anywhere else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be worth mentioning that modifying the Compute Engine default service account is also an alternative here, and we could probably avoid having to make new service accounts & updating the existing service account with the BigQuery permissions.

I still think workload identity is a better solution though, since using the default service account would mean violating least privilege by granting these permissions to every single attached workload in the project instead of just the CronJob that actually needs them. Just thought it'd be worth bringing up for visibility.

The docs listing alternatives is at https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity#alternatives_to

workload_pool = "llvm-premerge-checks.svc.id.goog"
}
}

resource "google_container_node_pool" "llvm_premerge_linux_service" {
Expand All @@ -23,6 +27,10 @@ resource "google_container_node_pool" "llvm_premerge_linux_service" {

node_config {
machine_type = "e2-highcpu-4"

workload_metadata_config {
mode = "GKE_METADATA"
}
# Terraform wants to recreate the node pool everytime whe running
# terraform apply unless we explicitly set this.
# TODO(boomanaiden154): Look into why terraform is doing this so we do
Expand Down
103 changes: 103 additions & 0 deletions premerge/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -190,3 +190,106 @@ resource "kubernetes_manifest" "metrics_deployment" {

depends_on = [kubernetes_namespace.metrics, kubernetes_secret.metrics_secrets]
}

# Resources for collecting LLVM operational metrics data

# Service accounts and bindings to grant access to the
# BigQuery API for our cronjob
resource "google_service_account" "operational_metrics_gsa" {
account_id = "operational-metrics-gsa"
display_name = "Operational Metrics GSA"
}

resource "google_project_iam_binding" "bigquery_jobuser_binding" {
project = google_service_account.operational_metrics_gsa.project
role = "roles/bigquery.jobUser"

members = [
"serviceAccount:${google_service_account.operational_metrics_gsa.email}",
]

depends_on = [google_service_account.operational_metrics_gsa]
}

resource "kubernetes_namespace" "operational_metrics" {
metadata {
name = "operational-metrics"
}
provider = kubernetes.llvm-premerge-us-central
}

resource "kubernetes_service_account" "operational_metrics_ksa" {
metadata {
name = "operational-metrics-ksa"
namespace = "operational-metrics"
annotations = {
"iam.gke.io/gcp-service-account" = google_service_account.operational_metrics_gsa.email
}
}

depends_on = [kubernetes_namespace.operational_metrics]
}

resource "google_service_account_iam_binding" "workload_identity_binding" {
service_account_id = google_service_account.operational_metrics_gsa.name
role = "roles/iam.workloadIdentityUser"

members = [
"serviceAccount:${google_service_account.operational_metrics_gsa.project}.svc.id.goog[operational-metrics/operational-metrics-ksa]",
]

depends_on = [
google_service_account.operational_metrics_gsa,
kubernetes_service_account.operational_metrics_ksa,
]
}

# The container for scraping LLVM commits needs persistent storage
# for a local check-out of llvm/llvm-project
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be stored persistently? It's pretty cheap to clone LLVM and a PVC I think adds unnecessary complexity on top of making things more complicated because they are now stateful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I neglected to mention this, but there's also a persistent file that keeps track of the last commits we've seen. Originally, the script was to run at a more frequent cadence so we wanted to keep track of commits we've seen as to avoid reprocessing them.

Now that the script only scrapes a day worth of data at a time, maybe we don't need a persistent state to keep track of commits we've seen. Although it might still be valuable for ensuring the quality of the commit data between iterations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a PVC for a persistent file would make more sense.

I still think it's a bit of an antipattern though. If you want to ensure you're only looking at new commits and its a cron job, you can just look at the last 24 hours of commits (which it seems like you're already doing?). Making this stateless makes things quite a bit simpler and aligns things more with how k8s expects them to work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed removal of dependency on persistent storage in #501

resource "kubernetes_persistent_volume_claim" "operational_metrics_pvc" {
metadata {
name = "operational-metrics-pvc"
namespace = "operational-metrics"
}

spec {
access_modes = ["ReadWriteOnce"]
resources {
requests = {
storage = "20Gi"
}
}
storage_class_name = "standard-rwo"
}

depends_on = [kubernetes_namespace.operational_metrics]
}

resource "kubernetes_secret" "operational_metrics_secrets" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need a separate Github token instead of reusing one of the existing ones?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the same Github token, just under a separate secrets object to keep separation between the premerge metrics and operational metrics

Although I'm not opposed to scrapping this and just reusing the metrics secrets if that's more appropriate

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this creates any tangible separation if they're the same token. You should reuse the metrics container secret, but probably rename the kubernetes_secret object and maybe the underlying GCP object. You'll need to use a terraform moved block (https://developer.hashicorp.com/terraform/language/modules/develop/refactoring#moved-block-syntax) so that TF doesn't try to delete and recreate everything.

metadata {
name = "operational-metrics-secrets"
namespace = "operational-metrics"
}

data = {
"github-token" = data.google_secret_manager_secret_version.metrics_github_pat.secret_data
"grafana-api-key" = data.google_secret_manager_secret_version.metrics_grafana_api_key.secret_data
"grafana-metrics-userid" = data.google_secret_manager_secret_version.metrics_grafana_metrics_userid.secret_data
}

type = "Opaque"
provider = kubernetes.llvm-premerge-us-central
depends_on = [kubernetes_namespace.operational_metrics]
}

resource "kubernetes_manifest" "operational_metrics_cronjob" {
manifest = yamldecode(file("operational_metrics_cronjob.yaml"))
provider = kubernetes.llvm-premerge-us-central

depends_on = [
kubernetes_namespace.operational_metrics,
kubernetes_persistent_volume_claim.operational_metrics_pvc,
kubernetes_secret.operational_metrics_secrets,
kubernetes_service_account.operational_metrics_ksa,
]
}
52 changes: 52 additions & 0 deletions premerge/operational_metrics_cronjob.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# operational_metrics_cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: operational-metrics-cronjob
namespace: operational-metrics
spec:
# Midnight PDT
schedule: "0 7 * * *"
timeZone: "Etc/UTC"
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
spec:
serviceAccountName: operational-metrics-ksa
nodeSelector:
iam.gke.io/gke-metadata-server-enabled: "true"
volumes:
- name: metrics-volume
persistentVolumeClaim:
claimName: operational-metrics-pvc
containers:
- name: process-llvm-commits
image: ghcr.io/llvm/operations-metrics:latest
env:
- name: GITHUB_TOKEN
valueFrom:
secretKeyRef:
name: operational-metrics-secrets
key: github-token
- name: GRAFANA_API_KEY
valueFrom:
secretKeyRef:
name: operational-metrics-secrets
key: grafana-api-key
- name: GRAFANA_METRICS_USERID
valueFrom:
secretKeyRef:
name: operational-metrics-secrets
key: grafana-metrics-userid
volumeMounts:
- name: metrics-volume
mountPath: "/data"
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "1"
memory: "512Mi"
restartPolicy: OnFailure