Introducing the [perfmon] Extension for Cloudberry Database Monitoring #1087
Replies: 12 comments 10 replies
-
@fanfuxiaoran - Thanks for the detailed proposal—this is exciting functionality. That said, the architecture and component naming ( Could you clarify a few points? 🔄 Is this a revival or a full rewrite?
⚙️ Build & Runtime Dependencies
🛠️
|
Beta Was this translation helpful? Give feedback.
-
I’ve now reviewed the related PR (#1085), and a few important things came up that I want to bring back to this proposal thread: It’s now evident that the perfmon extension is derived from the Greenplum gpperfmon codebase. That’s valuable context, but it needs to be made explicit in this proposal for transparency and clarity—especially for those unfamiliar with Greenplum internals. We should clarify:
🧭 Proposal → PR Flow Also, I want to respectfully point out that the PR was submitted before the proposal received any discussion or feedback. This limits the community’s ability to weigh in on design, naming, and direction—especially for inherited or complex features. Can we encourage a clearer process going forward?
This would help ensure stronger community alignment and make it easier for new contributors to follow along. Let me know if you’d like help updating the proposal language to reflect this background. I believe this feature has real potential — but it deserves a more transparent and community-driven rollout. |
Beta Was this translation helpful? Give feedback.
-
+1 for this, very useful for customers, nice work. |
Beta Was this translation helpful? Give feedback.
-
Thanks for confirming that ❌ Why
|
Beta Was this translation helpful? Give feedback.
-
🛠️ Build Environment Support for
|
Beta Was this translation helpful? Give feedback.
-
📚 User Documentation Plans?This is a feature that will directly impact end users — it introduces an extension, SQL tables, background workers, and runtime hooks that people will need to understand to use effectively. Could you clarify what the plans are for end-user documentation? Some important areas that would benefit from formal documentation:
Ideally this would be integrated into the Cloudberry user docs ( Without documentation, this feature risks being misunderstood or underutilized — and it's a missed opportunity for user engagement. Happy to help review or contribute to the initial docs if that would help. |
Beta Was this translation helpful? Give feedback.
-
Building
|
Beta Was this translation helpful? Give feedback.
-
📌 Note: Replacing libsigar in gpsmonHi @fanfuxiaoran, Following up on the effort to replace libsigar within gpsmon: while SIGAR is Apache-licensed and technically ASF-compatible, it's effectively unmaintained and fragile on modern Linux systems. We’ve looked into possible alternatives — specifically C-based, Linux-only libraries with a comparable feature set and compatible licensing. 🔍 Summary of FindingsDespite a broad survey, there does not appear to be a single, actively maintained, C-based library that:
✅ Recommendation Given these constraints, the cleanest and most sustainable path forward is to incrementally replace SIGAR usage with direct /proc, /sys, and statvfs() parsing, organized as:
This approach ensures:
This would allow you to phase out SIGAR while maintaining equivalent functionality and aligning with ASF licensing policy. This is simply a proposed direction to consider as you evaluate how best to support the SIGAR functionality in some form going forward. -=e |
Beta Was this translation helpful? Give feedback.
-
Given the amount of work involved in preparing the SIGAR fork and ensuring it’s ready for inclusion under ASF policy, I suggest we exclude perfmon from the upcoming Apache Cloudberry (Incubating) 2.0 release. This will give us space to:
Also, it might make sense to close the current perfmon PR for now while we align on the overall approach. Once the SIGAR fork is ready and integrated in a maintainable way, we can revisit the PR with a clean path forward. -=e |
Beta Was this translation helpful? Give feedback.
-
Hi @fanfuxiaoran, very thanks for your work! I have a thought on how to properly review and merge such complex functionality. Since we have agreed not to merge it into 2.0.0, what about creating a separate branch? In this branch, we can merge the initial set of changes, check them, discuss them, create additional PRs, and then merge the final PR into the main branch. I was reviewing #1085, but realized that if we started discussing the implementation details, the review process would be delayed. Plus, we might postpone some ideas and come back to them later. However, they might get lost in the general discussion flow. It would be better to write them down and return to them at a later date. |
Beta Was this translation helpful? Give feedback.
-
We have a similar solution for Open-GPDB: https://github.com/open-gpdb/yagp_hooks_collector/tree/YAGP-0.0.2-WIP. Right now, we have an open-sourced component that is quite similar to GPMON. It would be great if we could reuse solutions. Is it possible to formalize the segment -> GPSMON and GPSMON -> GPMMON interface? If so, I believe we could merge the components and replace them somehow. We use protobuf for exchange data https://github.com/open-gpdb/yagp_hooks_collector/blob/YAGP-0.0.2-WIP/protos/yagpcc_set_service.proto |
Beta Was this translation helpful? Give feedback.
-
Hi, @fanfuxiaoran I have a number of ideas about what can be done to improve diagnosis. May I ask you a few quick questions below? You can answer them in detail or just yes/no. It will allow me to understand what you consider important and what isn't.
We do not use it in https://github.com/powa-team/pg_stat_kcache and in https://github.com/open-gpdb/yagp_hooks_collector/blob/YAGP-0.0.2-WIP/src/ProcStats.cpp
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Proposers
Xiaoran Wang
Proposal Status
Under Discussion
Abstract
We're excited to introduce the new perfmon extension, a comprehensive monitoring solution for Cloudberry Database clusters. This discussion covers its capabilities, architecture and potential use cases.
we welcome your feedback and questions!
Motivation
Why perfmon?
Traditional monitoring tools often lack:
perfmon addresses these gaps with deep database-aware instrumentation.
Key Features
Hardware metrics (CPU, memory, disk, network) from all nodes.
- Query's general information, such as query text, user and status
- Resource attribution (CPU/memory/spills by query/segment)
- Progress tracking per plan node (like EXPLAIN ANALYZE for live queries)
- Query history with execution statistics
- Performance baseline comparison
Implementation
Technical Deep Dive
The system persists all monitoring data in the gpperfmon database, which supports standard SQL querying for data retrieval and analysis
The pg_query_state function is adapted from PostgresPro's pg_query_state https://github.com/postgrespro/pg_query_state to operate on Cloudberry Database (an MPP database system). It provides real-time query execution state monitoring capabilities.
while query is running
you can check the running query state by pg_query_state
Rollout/Adoption Plan
No response
Are you willing to submit a PR?
Beta Was this translation helpful? Give feedback.
All reactions