-
Notifications
You must be signed in to change notification settings - Fork 112
[CI] Add Terraform resources for daily CronJob that processes LLVM commits #495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -190,3 +190,106 @@ resource "kubernetes_manifest" "metrics_deployment" { | |
|
||
depends_on = [kubernetes_namespace.metrics, kubernetes_secret.metrics_secrets] | ||
} | ||
|
||
# Resources for collecting LLVM operational metrics data | ||
|
||
# Service accounts and bindings to grant access to the | ||
# BigQuery API for our cronjob | ||
resource "google_service_account" "operational_metrics_gsa" { | ||
account_id = "operational-metrics-gsa" | ||
display_name = "Operational Metrics GSA" | ||
} | ||
|
||
resource "google_project_iam_binding" "bigquery_jobuser_binding" { | ||
project = google_service_account.operational_metrics_gsa.project | ||
role = "roles/bigquery.jobUser" | ||
|
||
members = [ | ||
"serviceAccount:${google_service_account.operational_metrics_gsa.email}", | ||
] | ||
|
||
depends_on = [google_service_account.operational_metrics_gsa] | ||
} | ||
|
||
resource "kubernetes_namespace" "operational_metrics" { | ||
metadata { | ||
name = "operational-metrics" | ||
} | ||
provider = kubernetes.llvm-premerge-us-central | ||
} | ||
|
||
resource "kubernetes_service_account" "operational_metrics_ksa" { | ||
metadata { | ||
name = "operational-metrics-ksa" | ||
namespace = "operational-metrics" | ||
annotations = { | ||
"iam.gke.io/gcp-service-account" = google_service_account.operational_metrics_gsa.email | ||
} | ||
} | ||
|
||
depends_on = [kubernetes_namespace.operational_metrics] | ||
} | ||
|
||
resource "google_service_account_iam_binding" "workload_identity_binding" { | ||
service_account_id = google_service_account.operational_metrics_gsa.name | ||
role = "roles/iam.workloadIdentityUser" | ||
|
||
members = [ | ||
"serviceAccount:${google_service_account.operational_metrics_gsa.project}.svc.id.goog[operational-metrics/operational-metrics-ksa]", | ||
] | ||
|
||
depends_on = [ | ||
google_service_account.operational_metrics_gsa, | ||
kubernetes_service_account.operational_metrics_ksa, | ||
] | ||
} | ||
|
||
# The container for scraping LLVM commits needs persistent storage | ||
# for a local check-out of llvm/llvm-project | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why does this need to be stored persistently? It's pretty cheap to clone LLVM and a PVC I think adds unnecessary complexity on top of making things more complicated because they are now stateful. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I neglected to mention this, but there's also a persistent file that keeps track of the last commits we've seen. Originally, the script was to run at a more frequent cadence so we wanted to keep track of commits we've seen as to avoid reprocessing them. Now that the script only scrapes a day worth of data at a time, maybe we don't need a persistent state to keep track of commits we've seen. Although it might still be valuable for ensuring the quality of the commit data between iterations |
||
resource "kubernetes_persistent_volume_claim" "operational_metrics_pvc" { | ||
metadata { | ||
name = "operational-metrics-pvc" | ||
namespace = "operational-metrics" | ||
} | ||
|
||
spec { | ||
access_modes = ["ReadWriteOnce"] | ||
resources { | ||
requests = { | ||
storage = "20Gi" | ||
} | ||
} | ||
storage_class_name = "standard-rwo" | ||
} | ||
|
||
depends_on = [kubernetes_namespace.operational_metrics] | ||
} | ||
|
||
resource "kubernetes_secret" "operational_metrics_secrets" { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why does this need a separate Github token instead of reusing one of the existing ones? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's the same Github token, just under a separate secrets object to keep separation between the premerge metrics and operational metrics Although I'm not opposed to scrapping this and just reusing the metrics secrets if that's more appropriate |
||
metadata { | ||
name = "operational-metrics-secrets" | ||
namespace = "operational-metrics" | ||
} | ||
|
||
data = { | ||
"github-token" = data.google_secret_manager_secret_version.metrics_github_pat.secret_data | ||
"grafana-api-key" = data.google_secret_manager_secret_version.metrics_grafana_api_key.secret_data | ||
"grafana-metrics-userid" = data.google_secret_manager_secret_version.metrics_grafana_metrics_userid.secret_data | ||
} | ||
|
||
type = "Opaque" | ||
provider = kubernetes.llvm-premerge-us-central | ||
depends_on = [kubernetes_namespace.operational_metrics] | ||
} | ||
|
||
resource "kubernetes_manifest" "operational_metrics_cronjob" { | ||
manifest = yamldecode(file("operational_metrics_cronjob.yaml")) | ||
provider = kubernetes.llvm-premerge-us-central | ||
|
||
depends_on = [ | ||
kubernetes_namespace.operational_metrics, | ||
kubernetes_persistent_volume_claim.operational_metrics_pvc, | ||
kubernetes_secret.operational_metrics_secrets, | ||
kubernetes_service_account.operational_metrics_ksa, | ||
] | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
# operational_metrics_cronjob.yaml | ||
apiVersion: batch/v1 | ||
kind: CronJob | ||
metadata: | ||
name: operational-metrics-cronjob | ||
namespace: operational-metrics | ||
spec: | ||
# Midnight PDT | ||
schedule: "0 7 * * *" | ||
timeZone: "Etc/UTC" | ||
concurrencyPolicy: Forbid | ||
jobTemplate: | ||
spec: | ||
template: | ||
spec: | ||
serviceAccountName: operational-metrics-ksa | ||
nodeSelector: | ||
iam.gke.io/gke-metadata-server-enabled: "true" | ||
volumes: | ||
- name: metrics-volume | ||
persistentVolumeClaim: | ||
claimName: operational-metrics-pvc | ||
containers: | ||
- name: process-llvm-commits | ||
image: ghcr.io/llvm/operations-metrics:latest | ||
env: | ||
- name: GITHUB_TOKEN | ||
valueFrom: | ||
secretKeyRef: | ||
name: operational-metrics-secrets | ||
key: github-token | ||
- name: GRAFANA_API_KEY | ||
valueFrom: | ||
secretKeyRef: | ||
name: operational-metrics-secrets | ||
key: grafana-api-key | ||
- name: GRAFANA_METRICS_USERID | ||
valueFrom: | ||
secretKeyRef: | ||
name: operational-metrics-secrets | ||
key: grafana-metrics-userid | ||
volumeMounts: | ||
- name: metrics-volume | ||
mountPath: "/data" | ||
resources: | ||
requests: | ||
cpu: "250m" | ||
memory: "256Mi" | ||
limits: | ||
cpu: "1" | ||
memory: "512Mi" | ||
restartPolicy: OnFailure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least for the non-TF docs, changing this would cause changes in new node pools. Does this change any of the defaults for node pools created through TF?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For existing node pools created through TF, they should keep their original default values.
Based on the workload identity federation docs, new node pools created through TF will have workload identity enabled since the cluster has it enabled. It seems we can explicitly add
workload_metadata_config { mode = "GCE_METADATA" }
to disable it in unwanted nodes however.Although, looking back through the docs now, there appears to be some risk with updating the existing service node pools:
I'm not too familiar with what existing workloads are running on these nodes, but they may break if they're using the node's service account. Perhaps we want a separate node pool for this after all?