Skip to content

Add "User Group Diagnostics" Grafana dashboard #6065

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 19, 2025

Conversation

jnywong
Copy link
Member

@jnywong jnywong commented May 16, 2025

This PR adds a new "User Group Diagnostics" Grafana dashboard that complements the "User Diagnostics Dashboard" to show resource usage aggregated by user group.

Requires jupyterhub-groups-exporter to be set up on the hub for the dashboard to work, but if not, then see below. I have created a fork of the upstream dashboards while we validate implementation across 2i2c-hosted hubs.

For this to universally work across all our infrastructure, I decided to separate the user group dashboard entirely from user name based aggregation. This is because if I combined both user name and user group into one dashboard/PromQL query, then it would break the dashboard for hubs that do not have jupyterhub-groups-exporter set up because the jupyterhub_user_group_info metric is unavailable to the PromQL. Therefore, if a hub does not have jupyterhub-groups-exporter set up, then the "User Diagnostics" dashboard will work as normal but the "User Group Diagnostics" dashboard will show no data.

The "User Diagnostics" dashboard included in this PR differs from the upstream version of user.jsonnet, because the upstream version is technically a "Pod Diagnostics" dashboard. This PR aggregates pod-level data on a per user basis and uses unescaped usernames from metric kube_pod_annotations, rather than limited charactersets from kube_pod_labels.

Note

Metrics are available as a time series from the date of initially manually deploying the jupyterhub-groups-exporter service (therefore some PromQL in this PR is invalid prior to deployment, say, since before 18 May 2025). If you see an execution error in the dashboard, try selecting a more recent time window.

Ref: #5983

@jnywong jnywong self-assigned this May 16, 2025
@jnywong jnywong requested a review from GeorgianaElena May 16, 2025 11:06
@jnywong
Copy link
Member Author

jnywong commented May 16, 2025

Hey @GeorgianaElena ! Just requested a review from you to just check if you think this is okay to implement across all of our clusters.

I think it should be fine, but I would value your infra eng experience here to give a quick green/red light :)

@jnywong jnywong removed the request for review from GeorgianaElena May 16, 2025 11:27
@jnywong jnywong marked this pull request as draft May 16, 2025 11:27
@jnywong
Copy link
Member Author

jnywong commented May 16, 2025

Actually, not ready for review yet, just found a bug!

@jnywong jnywong marked this pull request as ready for review May 16, 2025 12:55
@jnywong jnywong requested a review from GeorgianaElena May 16, 2025 12:55
Copy link
Member

@GeorgianaElena GeorgianaElena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @jnywong <3 ! I didn't pay too much attention to the queries, so this review is more from an infra perspective.

If you look at

if cluster_provider == "aws":
print_colour("Deploying cloud cost dashboards to an AWS cluster...")
subprocess.check_call(
[
"./deploy.py",
grafana_url,
"--dashboards-dir=../grafana-dashboards",
"--folder-name=Cloud cost dashboards",
"--folder-uid=cloud-cost",
],
env=deploy_script_env,
cwd="jupyterhub-grafana-dashboards",
)
print_colour(f"Done! Dashboards deployed to {grafana_url}.")

The dashboards here are deployed on aws only and are categorized as cost-related dashboards.

In my opinion, ideally this work should be instead upstream, in jupyterhub/grafana-dashboards. I believe 2i2c uses that repo most intensively so we shouldn't be impacting lots of people if we were to merge this ourselves.

If you want to be on the safe side, we could also host them under 2i2c for a month lets say, validate that they are correct, then upstream them. In this time, we would also open an upstream issue about this intention.

If we are to go with option two, then we need to update the deployer command and the general structure of this directory because it looks like it's growing to be more than cost-related dashboards.

What do you think?

@jnywong
Copy link
Member Author

jnywong commented May 19, 2025

Your infra perspective is exactly what I needed :)

Yeah, eventually I would like to upstream this, but like you said, to be on the safe side we can validate and host this on the 2i2c side while we wait.

I'll make changes to this PR to update the deployer command then. Thanks!

Copy link

Merging this PR will trigger the following deployment actions.

Support deployments

No support upgrades will be triggered

Staging deployments

Cloud Provider Cluster Name Hub Name Reason for Redeploy
aws nmfs-openscapes staging Core infrastructure has been modified
aws openscapes staging Core infrastructure has been modified
gcp cloudbank staging Core infrastructure has been modified
gcp 2i2c staging Core infrastructure has been modified
gcp 2i2c dask-staging Core infrastructure has been modified
gcp 2i2c ucmerced-staging Core infrastructure has been modified
aws nasa-ghg staging Core infrastructure has been modified
aws maap staging Core infrastructure has been modified
gcp hhmi staging Core infrastructure has been modified
aws reflective staging Core infrastructure has been modified
aws disasters staging Core infrastructure has been modified
aws smithsonian staging Core infrastructure has been modified
gcp awi-ciroh staging Core infrastructure has been modified
kubeconfig utoronto staging Core infrastructure has been modified
kubeconfig utoronto r-staging Core infrastructure has been modified
aws 2i2c-aws-us staging Core infrastructure has been modified
aws 2i2c-aws-us dask-staging Core infrastructure has been modified
aws jupyter-health staging Core infrastructure has been modified
aws nasa-veda staging Core infrastructure has been modified
aws nasa-cryo staging Core infrastructure has been modified
gcp leap staging Core infrastructure has been modified
gcp 2i2c-uk staging Core infrastructure has been modified
aws projectpythia staging Core infrastructure has been modified
aws strudel staging Core infrastructure has been modified
gcp catalystproject-latam staging Core infrastructure has been modified
aws catalystproject-africa staging Core infrastructure has been modified
kubeconfig 2i2c-jetstream2 staging Core infrastructure has been modified
aws opensci staging Core infrastructure has been modified
aws victor staging Core infrastructure has been modified
aws earthscope staging Core infrastructure has been modified
aws ubc-eoas staging Core infrastructure has been modified
gcp climatematch staging Core infrastructure has been modified

Production deployments

Cloud Provider Cluster Name Hub Name Reason for Redeploy
aws nmfs-openscapes prod Core infrastructure has been modified
aws nmfs-openscapes workshop Core infrastructure has been modified
aws nmfs-openscapes noaa-only Core infrastructure has been modified
aws openscapes prod Core infrastructure has been modified
aws openscapes workshop Core infrastructure has been modified
gcp cloudbank authoring Core infrastructure has been modified
gcp cloudbank bcc Core infrastructure has been modified
gcp cloudbank ccc Core infrastructure has been modified
gcp cloudbank ccsf Core infrastructure has been modified
gcp cloudbank chabot Core infrastructure has been modified
gcp cloudbank csm Core infrastructure has been modified
gcp cloudbank csueb Core infrastructure has been modified
gcp cloudbank csuf Core infrastructure has been modified
gcp cloudbank csula Core infrastructure has been modified
gcp cloudbank csulb Core infrastructure has been modified
gcp cloudbank csun Core infrastructure has been modified
gcp cloudbank csum Core infrastructure has been modified
gcp cloudbank csumb Core infrastructure has been modified
gcp cloudbank csus Core infrastructure has been modified
gcp cloudbank demo Core infrastructure has been modified
gcp cloudbank dvc Core infrastructure has been modified
gcp cloudbank elac Core infrastructure has been modified
gcp cloudbank elcamino Core infrastructure has been modified
gcp cloudbank evc Core infrastructure has been modified
gcp cloudbank fresno Core infrastructure has been modified
gcp cloudbank foothill Core infrastructure has been modified
gcp cloudbank glendale Core infrastructure has been modified
gcp cloudbank high Core infrastructure has been modified
gcp cloudbank howard Core infrastructure has been modified
gcp cloudbank humboldt Core infrastructure has been modified
gcp cloudbank lacc Core infrastructure has been modified
gcp cloudbank lamission Core infrastructure has been modified
gcp cloudbank laney Core infrastructure has been modified
gcp cloudbank lavc Core infrastructure has been modified
gcp cloudbank lbcc Core infrastructure has been modified
gcp cloudbank mendocino Core infrastructure has been modified
gcp cloudbank merced Core infrastructure has been modified
gcp cloudbank mills Core infrastructure has been modified
gcp cloudbank miracosta Core infrastructure has been modified
gcp cloudbank mission Core infrastructure has been modified
gcp cloudbank moreno Core infrastructure has been modified
gcp cloudbank norco Core infrastructure has been modified
gcp cloudbank palomar Core infrastructure has been modified
gcp cloudbank pasadena Core infrastructure has been modified
gcp cloudbank reedley Core infrastructure has been modified
gcp cloudbank riohondo Core infrastructure has been modified
gcp cloudbank sacramento Core infrastructure has been modified
gcp cloudbank saddleback Core infrastructure has been modified
gcp cloudbank santiago Core infrastructure has been modified
gcp cloudbank sbcc Core infrastructure has been modified
gcp cloudbank sbcc-dev Core infrastructure has been modified
gcp cloudbank sierra Core infrastructure has been modified
gcp cloudbank sjcc Core infrastructure has been modified
gcp cloudbank sjsu Core infrastructure has been modified
gcp cloudbank skyline Core infrastructure has been modified
gcp cloudbank srjc Core infrastructure has been modified
gcp cloudbank tuskegee Core infrastructure has been modified
gcp cloudbank ucsc Core infrastructure has been modified
gcp cloudbank wlac Core infrastructure has been modified
gcp dubois ephemeral Core infrastructure has been modified
gcp 2i2c imagebuilding-demo Core infrastructure has been modified
gcp 2i2c binderhub-ui-demo Core infrastructure has been modified
gcp 2i2c demo Core infrastructure has been modified
gcp 2i2c temple Core infrastructure has been modified
gcp 2i2c ucmerced Core infrastructure has been modified
gcp 2i2c mtu Core infrastructure has been modified
aws nasa-ghg prod Core infrastructure has been modified
aws nasa-ghg binder Core infrastructure has been modified
aws maap prod Core infrastructure has been modified
gcp hhmi prod Core infrastructure has been modified
gcp hhmi spyglass Core infrastructure has been modified
gcp hhmi binder Core infrastructure has been modified
aws reflective prod Core infrastructure has been modified
aws reflective workshop Core infrastructure has been modified
aws disasters prod Core infrastructure has been modified
aws smithsonian prod Core infrastructure has been modified
gcp awi-ciroh prod Core infrastructure has been modified
gcp awi-ciroh workshop Core infrastructure has been modified
kubeconfig utoronto prod Core infrastructure has been modified
kubeconfig utoronto r-prod Core infrastructure has been modified
kubeconfig utoronto highmem Core infrastructure has been modified
aws 2i2c-aws-us showcase Core infrastructure has been modified
aws jupyter-health prod Core infrastructure has been modified
aws nasa-veda prod Core infrastructure has been modified
aws nasa-veda binder Core infrastructure has been modified
kubeconfig projectpythia-binder binderhub Core infrastructure has been modified
aws nasa-cryo prod Core infrastructure has been modified
gcp leap prod Core infrastructure has been modified
gcp leap public Core infrastructure has been modified
gcp 2i2c-uk lis Core infrastructure has been modified
aws projectpythia prod Core infrastructure has been modified
aws projectpythia pythia-binder Core infrastructure has been modified
aws strudel prod Core infrastructure has been modified
gcp catalystproject-latam unitefa-conicet Core infrastructure has been modified
gcp catalystproject-latam cicada Core infrastructure has been modified
gcp catalystproject-latam gita Core infrastructure has been modified
gcp catalystproject-latam iner Core infrastructure has been modified
gcp catalystproject-latam plnc Core infrastructure has been modified
gcp catalystproject-latam unam Core infrastructure has been modified
gcp catalystproject-latam cabana Core infrastructure has been modified
gcp catalystproject-latam nnb-ccg Core infrastructure has been modified
gcp catalystproject-latam labi Core infrastructure has been modified
gcp catalystproject-latam areciboc3 Core infrastructure has been modified
gcp catalystproject-latam valledellili Core infrastructure has been modified
aws catalystproject-africa nm-aist Core infrastructure has been modified
aws catalystproject-africa must Core infrastructure has been modified
aws catalystproject-africa uvri Core infrastructure has been modified
aws catalystproject-africa wits Core infrastructure has been modified
aws catalystproject-africa kush Core infrastructure has been modified
aws catalystproject-africa molerhealth Core infrastructure has been modified
aws catalystproject-africa aibst Core infrastructure has been modified
aws catalystproject-africa bhki Core infrastructure has been modified
aws catalystproject-africa bon Core infrastructure has been modified
aws opensci sciencecore Core infrastructure has been modified
aws opensci climaterisk Core infrastructure has been modified
aws opensci small-binder Core infrastructure has been modified
aws opensci big-binder Core infrastructure has been modified
aws victor prod Core infrastructure has been modified
aws earthscope prod Core infrastructure has been modified
aws earthscope binder Core infrastructure has been modified
aws ubc-eoas prod Core infrastructure has been modified
gcp climatematch prod Core infrastructure has been modified

@jnywong
Copy link
Member Author

jnywong commented May 19, 2025

Deployer updated, dashboards moved to 2i2c hosted repo and upstream PR waiting in draft 👍

@jnywong jnywong marked this pull request as ready for review May 19, 2025 10:59
@jnywong jnywong requested a review from GeorgianaElena May 19, 2025 10:59
@GeorgianaElena
Copy link
Member

this is perfect @jnywong ❤️ I feel comfortable merging your PR upstream as well (I have the rights) if you feel confident too. Just let me know, otherwise feel free to merge this one and maybe also open an internal 2i2c tracking issue to review+merge upstream PR after some testing on our infra to make sure we don't loose track of it.

I believe the biggest challenge is having the 2i2c repo diverge and having to maintain and contribute in two places.

@jnywong
Copy link
Member Author

jnywong commented May 19, 2025

Great, thank you @GeorgianaElena!

I agree, we do not want these two repos to diverge and I will let you know when #5315 is ready to be merged and upstreamed once I am happy.

To not lose track, I have added this task to the DoD in the parent initiative: #5315

@jnywong jnywong merged commit 6162e97 into 2i2c-org:main May 19, 2025
41 checks passed
sunu pushed a commit to sunu/infrastructure that referenced this pull request May 27, 2025
* Add new user group diagnostics dashboard and update user diagnostics

* Use unescaped $user_name from pod annotation instead of $user_pod from pod label

* Add home dir usage panel for groups

* Use unescaped username for home dir usage panel

* Move dashboards to https://github.com/2i2c-org/grafana-dashboards

* Revert variables for cloud cost dashboard

* Point deployer to 2i2c hosted dashboards
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants