Skip to content

feat(data): Added automated CPU batch updater #841

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
May 25, 2025
Merged

Conversation

IamLRBA
Copy link
Contributor

@IamLRBA IamLRBA commented May 18, 2025

cc
@benoit-cty

Upon looking at the CPU_Create_Dataset.ipynb notebook in the CodeCarbon repository, I came up with this script aimed at providing a complete and automated solution to update CPU power consumption data for both Intel and AMD processors in one run. Addressing what you requested and it accomplishes this by:

1. Handling both Intel and AMD CPUs in a single script execution while maintaining the existing file structure in a Unified Processing.
2. Automating Data Collection

  • Intel: Scrapes ark.intel.com (15 pages × 100 CPUs/page = ~1500 processors)
  • AMD: Scrapes TechPowerUp's database (all listed AMD processors)
  • Processes server, desktop, and laptop CPUs separately

3. Updating all relevant files in one run.
4. Checking all files exist and contain sufficient data during Validation.

Maintenance

-The script uses a single command python -m codecarbon.data.hardware.cpu_batch_updater to update everything.

If at all there is any changes or suggestions, I am more than willing to make them to the best of my knowledge.
Thank you!

Addresses issue #840

@benoit-cty
Copy link
Contributor

Thank you very much, that's a great improvement ! I will review it in detail this Wednesday.

If you want to fix the pre-commit error, you could look at https://github.com/mlco2/codecarbon/blob/master/CONTRIBUTING.md#coding-style--linting to install it locally.

@IamLRBA IamLRBA changed the title feat(data): add automated CPU batch updater feat(data): Added automated CPU batch updater May 19, 2025
@IamLRBA
Copy link
Contributor Author

IamLRBA commented May 19, 2025

Thank you very much, that's a great improvement ! I will review it in detail this Wednesday.

If you want to fix the pre-commit error, you could look at https://github.com/mlco2/codecarbon/blob/master/CONTRIBUTING.md#coding-style--linting to install it locally.

Thank you @benoit-cty, It now passes!

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new automated CPU batch updater that scrapes and processes CPU power consumption data for Intel and AMD processors, then aggregates the results.

  • Implements a new script to fetch Intel CPU data via web scraping and process AMD CPU datasets from TechPowerUp.
  • Aggregates data into a unified CSV (cpu_power.csv) and validates file existence and size.

@benoit-cty
Copy link
Contributor

When I run the script, it did not find any pages, does it still work for you ?

I'm trying to get Manus.im write a script for us to scrape https://www.intel.com/content/www/us/en/ark/featurefilter.html?productType=873&3_MaxTDP-Min=0.03&3_MaxTDP-Max=500 and click on "Show more".

@benoit-cty
Copy link
Contributor

The ARK database at https://www.intel.com/libs/apps/intel/support/ark/advancedFilterSearch?productType=873&3_MaxTDP-Min=0.03&3_MaxTDP-Max=500&forwardPath=/content/www/us/en/ark/featurefilter.html&pageNo=1&sort=&sortType= do not have all Intel CPUs. For example it miss "Intel Xeon Gold 6133".
There is 2 024 CPUs in it.

I push a script to do the scrapping and another one to do the merge.

@benoit-cty
Copy link
Contributor

Hello,

I've changed the scripts and add a documentation. Can you give a try and let me know if everything works on your side ?

@benoit-cty benoit-cty requested a review from Copilot May 25, 2025 09:18
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a unified, automated pipeline to scrape and merge Intel and AMD CPU TDP data into the existing cpu_power.csv.

  • Adds two scrapers (intel_cpu_scrapper.py, amd_cpu_scrapper.py) to fetch CPU specs from Intel ARK and AMD product pages.
  • Adds merge_scrapped_cpu_power.py to clean, merge, and update the master CPU power CSV.
  • Bulk-updates AMD server and desktop CSV datasets, and documents the workflow in a new README.

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
merge_scrapped_cpu_power.py New script to clean names, merge Intel/AMD TDP data, and update cpu_power.csv
intel_cpu_scrapper.py Adds an Intel CPU scraper using requests & BeautifulSoup
amd_cpu_scrapper.py Adds an AMD CPU scraper using Playwright
amd_cpu_server_dataset.csv Bulk updates AMD server CPU dataset entries
amd_cpu_desktop_dataset.csv Bulk updates AMD desktop CPU dataset entries
README.md Instructions for running scrapers and merge script
Comments suppressed due to low confidence (2)

codecarbon/data/hardware/cpu_dataset_builder/intel_cpu_scrapper.py:1

  • [nitpick] The file is named 'intel_cpu_scrapper.py' but the class is 'IntelCpuScraper'. Consider renaming the file to 'intel_cpu_scraper.py' to align spelling and conventions.
#!/usr/bin/env python3

codecarbon/data/hardware/cpu_dataset_builder/merge_scrapped_cpu_power.py:1

  • There are no tests covering this new merging script. Consider adding unit or integration tests to validate name cleaning, TDP extraction, and merge logic.
This script updates the CPU power data by reading from Intel and AMD CPU data file,

IamLRBA and others added 7 commits May 25, 2025 11:51
>>
>> - This script provides a complete automated solution to update CPU power consumption data for both Intel and AMD processors in one run
>>
>> Addresses issue #840
benoit-cty and others added 4 commits May 25, 2025 12:10
…u_power.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@benoit-cty benoit-cty merged commit 06082b2 into mlco2:master May 25, 2025
4 checks passed
@IamLRBA
Copy link
Contributor Author

IamLRBA commented May 25, 2025

Thanks @benoit-cty for refining the script and adding the necessary changes.
Always a pleasure working with you!
If at all there is anything else or issue you need me to do I am available.
Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants