-
Notifications
You must be signed in to change notification settings - Fork 73
Add "User Group Diagnostics" Grafana dashboard #6065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hey @GeorgianaElena ! Just requested a review from you to just check if you think this is okay to implement across all of our clusters. I think it should be fine, but I would value your infra eng experience here to give a quick green/red light :) |
Actually, not ready for review yet, just found a bug! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @jnywong <3 ! I didn't pay too much attention to the queries, so this review is more from an infra perspective.
If you look at
infrastructure/deployer/commands/grafana/deploy_dashboards.py
Lines 58 to 72 in a526d62
if cluster_provider == "aws": | |
print_colour("Deploying cloud cost dashboards to an AWS cluster...") | |
subprocess.check_call( | |
[ | |
"./deploy.py", | |
grafana_url, | |
"--dashboards-dir=../grafana-dashboards", | |
"--folder-name=Cloud cost dashboards", | |
"--folder-uid=cloud-cost", | |
], | |
env=deploy_script_env, | |
cwd="jupyterhub-grafana-dashboards", | |
) | |
print_colour(f"Done! Dashboards deployed to {grafana_url}.") |
The dashboards here are deployed on aws only and are categorized as cost-related dashboards.
In my opinion, ideally this work should be instead upstream, in jupyterhub/grafana-dashboards
. I believe 2i2c uses that repo most intensively so we shouldn't be impacting lots of people if we were to merge this ourselves.
If you want to be on the safe side, we could also host them under 2i2c for a month lets say, validate that they are correct, then upstream them. In this time, we would also open an upstream issue about this intention.
If we are to go with option two, then we need to update the deployer command and the general structure of this directory because it looks like it's growing to be more than cost-related dashboards.
What do you think?
Your infra perspective is exactly what I needed :) Yeah, eventually I would like to upstream this, but like you said, to be on the safe side we can validate and host this on the 2i2c side while we wait. I'll make changes to this PR to update the deployer command then. Thanks! |
Merging this PR will trigger the following deployment actions. Support deploymentsNo support upgrades will be triggered Staging deployments
Production deployments
|
Deployer updated, dashboards moved to 2i2c hosted repo and upstream PR waiting in draft 👍 |
this is perfect @jnywong ❤️ I feel comfortable merging your PR upstream as well (I have the rights) if you feel confident too. Just let me know, otherwise feel free to merge this one and maybe also open an internal 2i2c tracking issue to review+merge upstream PR after some testing on our infra to make sure we don't loose track of it. I believe the biggest challenge is having the 2i2c repo diverge and having to maintain and contribute in two places. |
Great, thank you @GeorgianaElena! I agree, we do not want these two repos to diverge and I will let you know when #5315 is ready to be merged and upstreamed once I am happy. To not lose track, I have added this task to the DoD in the parent initiative: #5315 |
* Add new user group diagnostics dashboard and update user diagnostics * Use unescaped $user_name from pod annotation instead of $user_pod from pod label * Add home dir usage panel for groups * Use unescaped username for home dir usage panel * Move dashboards to https://github.com/2i2c-org/grafana-dashboards * Revert variables for cloud cost dashboard * Point deployer to 2i2c hosted dashboards
This PR adds a new "User Group Diagnostics" Grafana dashboard that complements the "User Diagnostics Dashboard" to show resource usage aggregated by user group.
Requires jupyterhub-groups-exporter to be set up on the hub for the dashboard to work, but if not, then see below. I have created a fork of the upstream dashboards while we validate implementation across 2i2c-hosted hubs.
For this to universally work across all our infrastructure, I decided to separate the user group dashboard entirely from user name based aggregation. This is because if I combined both user name and user group into one dashboard/PromQL query, then it would break the dashboard for hubs that do not have jupyterhub-groups-exporter set up because the
jupyterhub_user_group_info
metric is unavailable to the PromQL. Therefore, if a hub does not have jupyterhub-groups-exporter set up, then the "User Diagnostics" dashboard will work as normal but the "User Group Diagnostics" dashboard will show no data.The "User Diagnostics" dashboard included in this PR differs from the upstream version of user.jsonnet, because the upstream version is technically a "Pod Diagnostics" dashboard. This PR aggregates pod-level data on a per user basis and uses unescaped usernames from metric
kube_pod_annotations
, rather than limited charactersets fromkube_pod_labels
.Note
Metrics are available as a time series from the date of initially manually deploying the
jupyterhub-groups-exporter
service (therefore some PromQL in this PR is invalid prior to deployment, say, since before 18 May 2025). If you see an execution error in the dashboard, try selecting a more recent time window.Ref: #5983