Skip to content

Fetch data for 24 days to stay within quota #49

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 29, 2025
Merged

Fetch data for 24 days to stay within quota #49

merged 1 commit into from
Apr 29, 2025

Conversation

hugovk
Copy link
Owner

@hugovk hugovk commented Apr 29, 2025

Dry runs:

pypinfo --all --indent 0 --limit 15000 --days 27 --dry-run "" project
Served from cache: False
Data processed: 1.10 TiB
Data billed: 0.00 B
Estimated cost: $0.00pypinfo --all --indent 0 --limit 15000 --days 26 --dry-run "" project
Served from cache: False
Data processed: 1.06 TiB
Data billed: 0.00 B
Estimated cost: $0.00pypinfo --all --indent 0 --limit 15000 --days 25 --dry-run "" project
Served from cache: False
Data processed: 1.01 TiB
Data billed: 0.00 B
Estimated cost: $0.00pypinfo --all --indent 0 --limit 15000 --days 24 --dry-run "" project
Served from cache: False
Data processed: 984.76 GiB
Data billed: 0.00 B
Estimated cost: $0.00

@hugovk hugovk merged commit b7d80ec into main Apr 29, 2025
3 checks passed
@hugovk hugovk deleted the 24-days branch April 29, 2025 13:32
@reneleonhardt
Copy link

@hugovk Does Data processed account for actually fetched values too or could more columns be fetched?

Would it be possible to enable a GitHub cron job to generate every month automatically?

https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/events-that-trigger-workflows#schedule

on:
  schedule:
    # * is a special character in YAML so you have to quote this string
    - cron:  '0 0 1 * *'

@hugovk
Copy link
Owner Author

hugovk commented Jun 13, 2025

@hugovk Does Data processed account for actually fetched values too or could more columns be fetched?

It should account for the actually fetched values, because it's the result of the query sent. You can see a snippet in #42 which does more or less the same thing.

Would it be possible to enable a GitHub cron job to generate every month automatically?

docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/events-that-trigger-workflows#schedule

on:
  schedule:
    # * is a special character in YAML so you have to quote this string
    - cron:  '0 0 1 * *'

Yes, but there's already a cron running on Digital Ocean (see the README for details) that's meant to fetch the data each month automatically. Unfortunately the free quota is becoming too little and I need to adjust the amount fetched.

About the 2025.06 data: that also used up too much quota and so didn't complete. I'd meant to merge #50 before the 1st June, but I was travelling. I'll have to do a manual run instead.

@hugovk
Copy link
Owner Author

hugovk commented Jun 13, 2025

About the 2025.06 data: that also used up too much quota and so didn't complete. I'd meant to merge #50 before the 1st June, but I was travelling. I'll have to do a manual run instead.

Done re: #50 (comment)

-> https://github.com/hugovk/top-pypi-packages/releases/tag/2025.06

@reneleonhardt
Copy link

reneleonhardt commented Jun 13, 2025

Thank you!
Hmm it looks like 1 TB isn't enough anymore for the exploding Python ecosystem, every month one day less free quota 😅
No wonder with free-threading and AI getting adopted more and more everyday.

Maybe time to start thinking if the magic number 15000 should be reduced to be able to show a full month.
I mean the name is "top" not "top-15000"... 😉

@hugovk
Copy link
Owner Author

hugovk commented Jun 13, 2025

5k or 8k or 15k or 600k doesn't make a difference!

https://hugovk.dev/blog/2024/a-surprising-thing-about-pypis-bigquery-data/#finding-the-number-of-packages-doesnt-affect-the-cost

@reneleonhardt
Copy link

There are so many "big data" tools, it's sad that for counting PyPI downloads no other solution has been chosen but Big Query... now that Microsoft is helping CPython development, why don't they store the download data / metadata?

So many packages are abandoned or incompatible to Python 3.13 or missing binary wheels for some platforms or architectures, identifying those problems would be more important than counting downloads and should be provided for free to the community.

@hugovk
Copy link
Owner Author

hugovk commented Jun 14, 2025

Is the compatibility with 3.13 so bad?

https://pyreadiness.org/3.13/ shows 55% of the top 360 packages have declared compatibility by adding the 3.13 Trove classifier, but many more are nevertheless compatible but either don't use classifiers or haven't added/released yet.

Are there any in particular you're missing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants