Skip to content

fix(cpu_monitor): collect cpu information on timer thread #10545

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

nishikawa-masaki
Copy link
Contributor

@nishikawa-masaki nishikawa-masaki commented Apr 23, 2025

Description

As reported on #10554 , the "cpu_monitor" node collects the following CPU statistics in the context of publishing the "/diagnostics" topic.

  • Load of each core
    • "mpstat -P ALL 1 1 -o JSON" command line is executed.
      • This will be fixed by another PR in near future.
  • Frequency of each core
  • Temperature of each core
  • Overall load average
  • Thermal Throttling status about each core (optional)

In addition to "sleeping 1 second in mpstat", other operations take a considerable amount of time.
As a result, publishing intervals from the "cpu_monitor" node are longer than expected (1 second).

This PR makes the following changes to the way the "cpu_monitor" node works:

  • "cpu_monitor" creates a timer with 1 second intervals.
  • In the context of the timer-callback, the "cpu_monitor" node collects statistics about CPU cores and stores them internally.
  • On "update" for publishing, the "cpu_monitor" node publishes the stored statistics information.

Handling of thermal throttling is intentionally left unchanged because:

  • The "cpu_monitor" node supports multiple CPU platforms.
  • The behaviors of thermal throttling are different among CPU types.
  • The structures of thermal throttling information and the procedures for collecting the information are tightly coupled and are different among CPU types.
  • Therefore, handling of thermal throttling can't be generalized.
  • The load and time for collecting thermal throttling alone won't be a problem as other works are moved to timer-callback context.

Related links

Parent Issue:

Private Links:

How was this PR tested?

  • All PR CI tests passed.
    • The source code can be built for different CPU platforms.
    • Not tested on other CPU platforms than Intel, but it should work.
      • Only the context in which statistics are collected has changed.
  • The number of entries and their values in "/diagnostics" topic published by the "cpu_monitor" node are same as before.
  • Modified publish interval to 0.5 second (2 Hz) temporarily and verified that the "cpu_monitor" node can publish its data with 0.5 second intervals.
    • The version before this PR can't publish with 0.5 second intervals.

Notes for reviewers

None.

Interface changes

None.

The parameters of the "cpu_monitor" node are now explicitly declared as "read-only".
They had been "read-only" (i.e. they can't be modified after initialization) implicitly.

Effects on system behavior

None.

* Fixed existing code to meet Autoware coding style guide.

Signed-off-by: Masaki NISHIKAWA <masaki.nishikawa@tier4.jp>
* CPU temperatures are checked by timer thread and published by publisher thread.
* Other CPU statics will follow the frame work.

Signed-off-by: Masaki NISHIKAWA <masaki.nishikawa@tier4.jp>
* Moved CPU usage measurement to onTimer() thread from the publishing thread

Signed-off-by: Masaki NISHIKAWA <masaki.nishikawa@tier4.jp>
* Moved load-average measurement to onTimer() thread from the publishing thread

Signed-off-by: Masaki NISHIKAWA <masaki.nishikawa@tier4.jp>
* Moved CPU frequency measurement to onTimer() thread from the publishing thread

Signed-off-by: Masaki NISHIKAWA <masaki.nishikawa@tier4.jp>
* Gave up to move checking of thermal throttling to timer callback context.
* Implemented reporting of additional error information.
* Implemented exception handling for boost::process::child().

Signed-off-by: Masaki NISHIKAWA <masaki.nishikawa@tier4.jp>
* Modernized "for loops".
* Minor fixes befor submitting a PR.

Signed-off-by: Masaki NISHIKAWA <masaki.nishikawa@tier4.jp>
@nishikawa-masaki nishikawa-masaki added component:system System design and integration. (auto-assigned) run:build-and-test-differential Mark to enable build-and-test-differential workflow. (used-by-ci) labels Apr 23, 2025
@nishikawa-masaki nishikawa-masaki self-assigned this Apr 23, 2025
Copy link

github-actions bot commented Apr 23, 2025

Thank you for contributing to the Autoware project!

🚧 If your pull request is in progress, switch it to draft mode.

Please ensure:

@nishikawa-masaki nishikawa-masaki changed the title Fix/cpu monitor collect info on timer thread fix(cpu_monitor): collect cpu information on timer thread Apr 23, 2025
pre-commit-ci bot and others added 3 commits April 23, 2025 13:18
* Fixed problems reported by Draft PR CI tools.

Signed-off-by: Masaki NISHIKAWA <masaki.nishikawa@tier4.jp>
…timer-thread' into fix/cpu_monitor-collect-info-on-timer-thread

Signed-off-by: Masaki NISHIKAWA <masaki.nishikawa@tier4.jp>
Copy link

codecov bot commented Apr 24, 2025

Codecov Report

Attention: Patch coverage is 0% with 275 lines in your changes missing coverage. Please review.

Project coverage is 24.91%. Comparing base (175ca5c) to head (79bd63a).
Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
...ystem_monitor/src/cpu_monitor/cpu_monitor_base.cpp 0.00% 237 Missing ⚠️
...ude/system_monitor/cpu_monitor/cpu_information.hpp 0.00% 19 Missing ⚠️
...stem_monitor/src/cpu_monitor/intel_cpu_monitor.cpp 0.00% 19 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10545      +/-   ##
==========================================
- Coverage   24.96%   24.91%   -0.05%     
==========================================
  Files        1351     1351              
  Lines      105116   105229     +113     
  Branches    39752    39757       +5     
==========================================
- Hits        26238    26219      -19     
- Misses      76386    76519     +133     
+ Partials     2492     2491       -1     
Flag Coverage Δ *Carryforward flag
differential 9.36% <0.00%> (?)
total 24.98% <ø> (+0.02%) ⬆️ Carriedforward from fe6d274

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

* Fixed problems reported by Draft PR CI tools.

Signed-off-by: Masaki NISHIKAWA <masaki.nishikawa@tier4.jp>
nishikawa-masaki and others added 5 commits April 25, 2025 19:19
* Minimized critical sections locked by mutex_.

Signed-off-by: Masaki NISHIKAWA <masaki.nishikawa@tier4.jp>
…timer-thread' into fix/cpu_monitor-collect-info-on-timer-thread
* Fixed a minor problem reported by PR CI workflow.

Signed-off-by: Masaki NISHIKAWA <masaki.nishikawa@tier4.jp>
…timer-thread' into fix/cpu_monitor-collect-info-on-timer-thread
@nishikawa-masaki nishikawa-masaki marked this pull request as ready for review April 25, 2025 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:system System design and integration. (auto-assigned) run:build-and-test-differential Mark to enable build-and-test-differential workflow. (used-by-ci)
Projects
Status: To Triage
Development

Successfully merging this pull request may close these issues.

1 participant