Skip to content

feat: store repo data in sorted JSON files, closes #55 #57

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 24, 2025

Conversation

m-jahn
Copy link
Member

@m-jahn m-jahn commented Jul 22, 2025

  • slight change to data.js to become data.json, with same structure as skips.json
  • json files are now alphabetically sorted, hopefully to prevent constant rewriting of existing repos, closes storage of repo data in json file creates large overhead of changes lines #55
  • removed unnecessary jinja template
  • moved functions from generate_catalog.py to common.py
  • updated docs with 1 extra env variable

Summary by CodeRabbit

  • Documentation
    • Updated setup instructions to include a new environment variable for limiting the number of repositories fetched.
  • Refactor
    • Switched data storage and processing from JavaScript-based files to pure JSON files for improved simplicity and consistency.
    • Updated scripts and data handling to use dictionaries and direct JSON serialization.
  • Chores
    • Removed obsolete JavaScript template and related code, streamlining data import and export processes.

Copy link
Contributor

coderabbitai bot commented Jul 22, 2025

📝 Walkthrough
## Walkthrough

The changes transition the workflow catalog's data handling from JavaScript-based (`data.js` with Jinja2 templating) to pure JSON files (`data.json`). Scripts and source files are refactored to read, write, and process JSON directly. Additional utility functions are introduced, and data structures are updated from lists to dictionaries for improved management.

## Changes

| File(s)                                         | Change Summary                                                                                  |
|------------------------------------------------|------------------------------------------------------------------------------------------------|
| README.md                                       | Expanded local setup instructions to include `N_REPOS` environment variable.                    |
| data.json                                       | Converted from JS variable assignment to pure JSON array.                                       |
| scripts/cleanup-catalog.py                      | Reformatted imports and loops; changed data storage to use full dictionaries and filenames.     |
| scripts/common.py                               | Removed Jinja2, switched to JSON serialization; added `register_skip` and `get_tarfile` helpers.|
| scripts/generate-catalog.py                     | Refactored data structures from lists to dicts; updated skip handling and tarball extraction.   |
| source/build_wf_pages.py, source/build_wf_tables.py | Changed data import from `data.js` to `data.json`; removed line skipping; simplified JSON loading.|
| templates/data.js                               | Deleted Jinja2-based JavaScript template file.                                                  |

## Sequence Diagram(s)

```mermaid
sequenceDiagram
    participant User
    participant generate-catalog.py
    participant common.py
    participant data.json

    User->>generate-catalog.py: Run script
    generate-catalog.py->>common.py: get_tarfile(url)
    common.py-->>generate-catalog.py: Tarfile object
    generate-catalog.py->>common.py: register_skip(repo, skips)
    common.py-->>generate-catalog.py: Updated skips dict
    generate-catalog.py->>common.py: store_data(repos, "data.json")
    common.py->>data.json: Write JSON
    generate-catalog.py->>common.py: store_data(skips, "skips.json")
    common.py->>skips.json: Write JSON

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~40 minutes

Possibly related PRs

Suggested reviewers

  • tedil
  • johanneskoester
  • cademirch

</details>

<!-- walkthrough_end -->

<!-- announcements_start -->

> [!NOTE]
> <details open="true">
> <summary>⚡️ Unit Test Generation is now available in beta!</summary>
> 
> Learn more [here](https://docs.coderabbit.ai/finishing-touches/unit-test-generation), or try it out under "Finishing Touches" below.
> 
> </details>

<!-- announcements_end -->
<!-- internal state start -->


<!-- DwQgtGAEAqAWCWBnSTIEMB26CuAXA9mAOYCmGJATmriQCaQDG+Ats2bgFyQAOFk+AIwBWJBrngA3EsgEBPRvlqU0AgfFwA6NPEgQAfACgjoCEYDEZyAAUASpETZWaCrKNxU3bABsvkCiQBHbGlcABpIcVwvOkgAIgAzEmouRAJ/PxJufEhaajQULER8Chp6ACkAZQB5ADlIePhoxHCGL3xEaUgzAFZu0NjIAHc0ZAcBZnVSyDlIbA6+ZjAhNFgsEes7DEcBSkhugHYCiNgSeww0AGsSZkuSAHpEc6ubq7BB4ov4tsGwBjy2ogZLKIdTFWQaGAnRiwTCkZDwDCtbBKDLnCYYQEAA1yuDQGiEiExEWy2Ly+KKGCJgwQ0UgNwRuIRCMBuChiDQbHsuAo2DE2HS60xiAu8G4iHJ+EpELgp0qtXqjU6zlOGHwg3QXm4MJ24j+PnkRRKMQIGVovNO2Aw5AY0nZLgygwo6mZ/HikBIAA8kOIMUD2qD7Ti8ZAAIK0WjOyVofXhTCzK2iW3OeRlBHLCLXbheainYbIfzMfBSWgQiosU7xS1ieCS5CDSinfxtP5TeIUFiQTGkchUGgAfRb0fwRA03FkRJNmKYrElo/HxJQzF4RdOTBRxSImHgAC9qDWMNKobR8AxHOw95Khuto0VZtwccbsgikevyOhw5Hzr4yBJ4O2MGwGC4JAEjOPAKjRIepy2Og8DMMgJpkDCiKnEGXLFGgpDoBg9D0kB2jnGoXjqPIgzqAgWCsqcy4iGIGhGPoxjgFAZD0PgbpoHghDdsoUzToBnA8Hwgi0eIUgyPIa7KKo6haDojEmFA7jwsgcacQQxBkLxMT8ewXBUOqDhOPaMxSVQMmaNouhgIYTGmAYNgAKIhgAIgAso5GjMLQHAGLE/kGBYoYAJKaT2Ob0EZNz2ux0KwtIbhQk5rkeQUqQ8tWtb1MU9gkLgPqAj+f6SgJIFgRBnRmk6vrNtGuW4Ng3BDA27oetwmBKCW1j+L++BzF4sjhJKA3HKcmLQI5FTQH2TlWFURKgU6FVXsgAn7nQ4TUvADCwJAxETLgyBoYkuA7S6Jr5CCGK0u8FCfN8fogmk4KQha945u+tCqZA5DqkV/6lYt4ECLSmI1DNjlzRUE4wsBfxYDsswdPQJr7eoo0/dsuyxf4wIBvAnQnTtdBQa1HJZqc/0lewZVLSDlNtcUh0/WqBQvmN4OzVUFQALwAMxEkOGIgiiVGtd6LpAxV4qvSzNDoAIfXAWLg4AtM2CNPQVUXlgBYEcgVbIaQJZGOYlghl4NC9vuCHZGLSitM4OvILFnpZEabF8J4IPbe6QHOglBg1JKJB+QFBgQGARhBhKGC+f5sSBeboU8b2MRRcmrpxRigcytnWEFiuCFQlL9PoIgIJEABNOYotOR5JAPNEm2HZizsRAIhgLqxWLcp1EwQHsOE0RoL+vrDfIYsGZAffoBQVAvcpdKKPADSdGLDS0i3zDoDPo9oBUDBOtwuBgMRVy08DtIO9m1uXgPjJd76s9oRdPD8rK1R1M4C9DORSvYXkKXWkIxK7VyAvRJOoZLa8Rtgue2ohb7Oyzm7JmMQcre2IgwP2kQCaICMMHcgkDHKpDghFBQKIeoEz+vEeITMuAABk1Rh0ThHeyiAj6ikOncVoSQtjcF+P8Ycc544BSCiGFOWk06RUcNFeQsUdrxXwQYPOcF3bAVSDmUqdCFjeHEBTHBAdkDb07PxSUVJ1j+B0TcfKj5d7MD0fAM+CJTipFkFvHKaj2zFgyGgXIRESJ/1ZAAxRGIXSVkROIKMxFcCLyhFOEe/Dm5ViiRgAA5MgNo+Amorj4Ljf0z1IBXFkHWSxJBrHUCmG/BxlsnHETfG4m65F7DtRtKpdslpIpYNXPgNoGBmh0gIo/c6bIOSnABNtUmmJxBsA0IgaImQAAU3QACURI9S+D1giGQeAUDIFVOqX8DhowjQ6O1aR9R2w7zFrwRMEYaouIhEwogHdfRsArphTo9Z0ihONpcjssQbDXBXPYRqlAvjYD6vmTI7QHgijFBoAYJpAXArHoCBw3BwVeEhXMR6cLRTigGDohQ/TvQ00aSQUmGzbadlSMUEgfYgxUhaoWCMa8UbZHahXDG7A/yoW2qksCnRMS3N6nMPs+TCTYXoCKqhUK+zCgJYLPpgIyKslalmba6NN5kDGcgTEsQY4EklLEQWOFOyxEVQi41GBTXhFxtmBgIzqJytxdS441AeCgPQRgEaYtiKpCzhGTK5x7SgWxV8/+uydVomkIeVAvz15HgbnSqgWEGTgrQDaaVe14BXD9VCJWnhgIVOAjlJQpDzipMgWbaBVtkEmgQY7O+/SUGMw9vwL22AfbYN5eIQOhDKWy0TS7X18g0C0NEMBDNFAvz1BSfuPaw5fYdXsJwk+Z8SBSF8NSgA3CzH6JB1Q5VZavAm9BUEdowd2rB87ImLpyo7blYDqAfzrC1Bk7YzQ2hNgnBi7D13cLMQeMcojE7iMkeFKYGcYpuhHYlVcRtOiFykBjVMGBlgACY/a/gBuSvKjUc03NdQbDoWd8joawxmJc2Z5YmkdOoMaRrCT2syI6l06M1W7Uumo6IfA+5gAECMGIaBuDLizbtVknr/B+OMVcnNDGaA0tJLiWOZq2JUX41/BUTQch8rEANUmOrfH3MBMS0VNZcWSvxp0BNSH6AmJk6Z3eZR96H2PiWxUQTuP2HhZihzf5A31NOCaJzLp8ieHSLPHVUzU0MqZXekNeygSOpiFx+x9LICp19lICgIJLxSY0UzeEGBi16ZDUKiSj0AzyErD4H6YyFNOiU6NHe6xZ5v0QJip17L+B4DKzF0Mh71QRKS5ifwHdUiUAVfCokqA/GdQXAt3e1nnoZPnfV2NObGoPgiHBEI5Mlu+YJeVwVLhSZqOK3PN6D4OVs2xSiaZzgdXqc7PyHw8ABAaH8EEEImJYzmvyL9WY4gYm1YXZeLseU+y4goC95LC27HZLILvWHQn6smPyAAVRsAwrkMnmCk1ySc90k6xAI/sLxs9LYH1ujQshWg9SzPyZoDR8hlQ+u4DKyadK2giCwFwDo4YFByjafmOBYiu5Umxg/KkrOVqMiTe5DrAH9AgdHsgCcTUuxiVo5J/4bkBNw38Fy5AAAEtAaAVga1QItvW1JNKm1IId229R6Cu09sMf25RUAABiEO1jhjoFwcbJAlfTatYs/J4QrWIDWUcIUgHEA8JYIWED44I6QH9/ey8L6GrpB2xFEPcXGV5CjzCgZsf4+ACTCWlaR4tl4RMW8INq+xFrwPHhEtKk8p5nOnzEmfs9JcRz5TspBcAw+e4qRZ73O9YETx55PwG5wD7/WwqOBgOGL7uKnHMgjcQAhESw23kHtIyOMvIuDSHlF5y31w5ebK8EYwZ0zrOq2wQ5oV0wHwU64GmSNs6jvCYgGszCaMGmdk/sUjEDMO/vaHVr4LGgMrqj7L6BGP4OTlAWrgwG0qpOaoXkpqTLVL4KHuHhQDNqKMkjnlgKgChiJuag6lmtAeOtQUuGgvQLlvlpRLDBkPnq2nGPgenPCkdAKvuMmJtAgDtMljJhXPAFXCJvEFbO6BJowCcqTHrvVseIML6vgH4jmp6MrklhMtgqgCCDRmekwSllmuFtQb6i4mPtDrDvDqNnLulnGJdlem+IVugNgZkKAVCOob4DjnjteGsBQEQGeBArLGwGEd3G6L0mruasDrAcwZFPCvXLiMlomg5vJvoaxC6CATSuAaIfaJaA+pii2gMnQj4GqBYams6goMwGoOQFrCIVGE6JVPplEC9M8q8oCO8uyHCNdneLdguFYtEOTlRGRiOhCFjlsMjIuOosgIspiO9sRF9j9sEKkK9k9nDoqJiCskMbQV1HnA/O2L4F8KzKupQO2HwC/i6FslgAPCCFNkBOEAILsmhOlLyPnkqOakUJyM4YunMJ8u0jduQsSkUW0bIIJsJrhJgJ8gJDbuIjAhUfAkeIgk7C7q7O2lMNep7n2nggxJAECoWD4kQYlnLiQWSmQZHvknPt3tvrvjQPvkOCOGOJiPRFAAAML2ZZZtDo6XzSydiSrbGx7NzyYFELiQmhpP5d4L5cLJ5MkkAsmH7smcmQDBz5zrzZBPouwLB5SwCKCU5Vyvr+CID7oTzpGXTcjfEfw5qAmXjAlYSnoNA061j0Rr6RxGBFD8g2h3DvGax9iDDxB9jtRwhH5/oQZhRn6goX5ZzwYqJQhSTZR8DuEFRDAfAXHqhoRd5iyYgBleC0BBkhlhnSCUFJZ5gP7mF3aJoYxN7RpeYmIGoaAaB3DMamoLjNmtnMYmocmywWZQoUKuJ+bGhQgNB5bATBZZwbxeY7A6KNhJDOazwPxDxDA0ghZQjA7JlObFyfzygrlAR6boFdHeYAJWrcDWHyDBYyxhgRiCoxhobabnJXR9EGlGmVlZH/I7yYg2oaBtCyaLLxDfaLmLIrLx6Ti/n/m0CAVrIx5qpnS+iOaLn1G8rpDGYHnATXj47nTZBoFTojRQX1HRYs5jleYiRTpInmwokNp2zonNrILYlu6ezvz4n+ze4EIhwenhxemb59QUB+kFlFnBmT70ziigbH5RlKnn5yLxnX4IYLFMxZy3T3Ssw5lcFjSCXFkiVNDlkuHrCfmhaLlfkubabGbLEtltlkg2r7FpQ0C6GxQUZuaAY6anDmXdlWWEgrKEF2FeHnmjkViBaTl2E9ykW0jLHkAei4AwU2WVlHGsZfC/6+heG9ymVebHidCqjAQbF8rHZiaXm5rkAyx5yzzPkuhsDBL0CVkCHZEdg/kUh/k6FfSAXAV+KgXgUkiQWNXRXiHbS7R4UGbyDPlJp7n9ySg0CHkmIzm0jkV0SyxjAdC/aHnLg4GxFKGSHJE4IuDhDPgPbhKNBWwugdDjGtgEyFm4EXqRVUDVi+hzC7CrobWIGsZ4QGyIj2aQLIn25wKNp0XO5wKMVsGdosW3oEkDqcXH7r5GCs5Zg5jJ7MZgYn7RkXIwaX5ak36hVjRQ20bSCWWqYEgWJHQkDRClDxrGJeYDn9SSRjUEQiaU7XSyhOUeaCllw3yYmLqxoyoJZbSSGVmgKyHNEmXygzWaCyyzxC0rRZZSLkIzDi7Rg7jha5QUC5ZgAiynDAJMZ5BzZATZDRbFA2LeYFBKD4Ry7EpOYqCNAkSq6DJ3TpwToVjlqdFpSUBy4Mi0X02gTub34rmRVGZpXwD0CZX3bIinCYDjqy6iG+BGHhDXE5R3EYhDR8CnG9L1DfDvVUWfVZTfWoQYmon/VXoe7A1sWEmZ4uSE15QxBq05DZ06wh51wJZymY0w0414h41GAkIzLkLblbrUKk46KCRMKDDg2MQGCKR+xsQcRcSI0d2p56R+BoCGSyKZymSKDSRqCWTyQ2TD3MQNEHR9h+2IASpd1Hp0AKqw7ARD0j37CYbdAAAst9AADN0LQNfd0PEAAByYYMC0DxAMDxDdB8zX0ACMaA+w8QAAbNfXQCQHfQAJwCCYZAPX20DYIKRb0CDP3gPX3xB32gPdCYa0B32Ybv0qACAAOv2v2gMUOP0kB4MMDX3QOgN32X37AMACDWSGAj2Yb7AkAAP7DdCiCgOqCv37CX3X18wqDxDhhwNoAAMMDQOP1oMAPX2gN8wv3dDIMb2b0QDb3qC71fQH2/hH1FmsRsOaNQDMAMDcADhjX6En3OBn0aOYiOMGAADeBgkAcQ2yshAuiAsQXAAA2gALqhBuNxDC5PxEA+P+MBMGAAC+BgjjA+pjdIFjVjg8kVfYxj+gQAA= -->

<!-- internal state end -->
<!-- finishing_touch_checkbox_start -->

<details>
<summary>✨ Finishing Touches</summary>

- [ ] <!-- {"checkboxId": "7962f53c-55bc-4827-bfbf-6a18da830691"} --> 📝 Generate Docstrings
<details>
<summary>🧪 Generate unit tests</summary>

- [ ] <!-- {"checkboxId": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "radioGroupId": "utg-output-choice-group-unknown_comment_id"} -->   Create PR with unit tests
- [ ] <!-- {"checkboxId": "07f1e7d6-8a8e-4e23-9900-8731c2c87f58", "radioGroupId": "utg-output-choice-group-unknown_comment_id"} -->   Post copyable unit tests in a comment

</details>

</details>

<!-- finishing_touch_checkbox_end -->
<!-- tips_start -->

---

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

<details>
<summary>❤️ Share</summary>

- [X](https://twitter.com/intent/tweet?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A&url=https%3A//coderabbit.ai)
- [Mastodon](https://mastodon.social/share?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A%20https%3A%2F%2Fcoderabbit.ai)
- [Reddit](https://www.reddit.com/submit?title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&text=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code.%20Check%20it%20out%3A%20https%3A//coderabbit.ai)
- [LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fcoderabbit.ai&mini=true&title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&summary=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code)

</details>

<details>
<summary>🪧 Tips</summary>

### Chat

There are 3 ways to chat with [CodeRabbit](https://coderabbit.ai?utm_source=oss&utm_medium=github&utm_campaign=snakemake/snakemake-workflow-catalog&utm_content=57):

- Review comments: Directly reply to a review comment made by CodeRabbit. Example:
  - `I pushed a fix in commit <commit_id>, please review it.`
  - `Explain this complex logic.`
  - `Open a follow-up GitHub issue for this discussion.`
- Files and specific lines of code (under the "Files changed" tab): Tag `@coderabbitai` in a new review comment at the desired location with your query. Examples:
  - `@coderabbitai explain this code block.`
  -	`@coderabbitai modularize this function.`
- PR comments: Tag `@coderabbitai` in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
  - `@coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.`
  - `@coderabbitai read src/utils.ts and explain its main purpose.`
  - `@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.`
  - `@coderabbitai help me debug CodeRabbit configuration file.`

### Support

Need help? Create a ticket on our [support page](https://www.coderabbit.ai/contact-us/support) for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

### CodeRabbit Commands (Invoked using PR comments)

- `@coderabbitai pause` to pause the reviews on a PR.
- `@coderabbitai resume` to resume the paused reviews.
- `@coderabbitai review` to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
- `@coderabbitai full review` to do a full review from scratch and review all the files again.
- `@coderabbitai summary` to regenerate the summary of the PR.
- `@coderabbitai generate docstrings` to [generate docstrings](https://docs.coderabbit.ai/finishing-touches/docstrings) for this PR.
- `@coderabbitai generate sequence diagram` to generate a sequence diagram of the changes in this PR.
- `@coderabbitai generate unit tests` to generate unit tests for this PR.
- `@coderabbitai resolve` resolve all the CodeRabbit review comments.
- `@coderabbitai configuration` to show the current CodeRabbit configuration for the repository.
- `@coderabbitai help` to get help.

### Other keywords and placeholders

- Add `@coderabbitai ignore` anywhere in the PR description to prevent this PR from being reviewed.
- Add `@coderabbitai summary` to generate the high-level summary at a specific location in the PR description.
- Add `@coderabbitai` anywhere in the PR title to generate the title automatically.

### CodeRabbit Configuration File (`.coderabbit.yaml`)

- You can programmatically configure CodeRabbit by adding a `.coderabbit.yaml` file to the root of your repository.
- Please see the [configuration documentation](https://docs.coderabbit.ai/guides/configure-coderabbit) for more information.
- If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: `# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json`

### Documentation and Community

- Visit our [Documentation](https://docs.coderabbit.ai) for detailed information on how to use CodeRabbit.
- Join our [Discord Community](http://discord.gg/coderabbit) to get help, request features, and share feedback.
- Follow us on [X/Twitter](https://twitter.com/coderabbitai) for updates and announcements.

</details>

<!-- tips_end -->

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
README.md (1)

93-97: Use “environment” instead of “environmental”.

Minor wording nit: we usually say “environment variables”, not “environmental variables”.

-2. Set required environmental variables.
+2. Set required environment variables.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c5c8ef9 and 7254440.

📒 Files selected for processing (8)
  • README.md (1 hunks)
  • data.json (0 hunks)
  • scripts/cleanup-catalog.py (2 hunks)
  • scripts/common.py (4 hunks)
  • scripts/generate-catalog.py (7 hunks)
  • source/build_wf_pages.py (1 hunks)
  • source/build_wf_tables.py (1 hunks)
  • templates/data.js (0 hunks)
📓 Path-based instructions (1)
**/*.py

⚙️ CodeRabbit Configuration File

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

Files:

  • source/build_wf_tables.py
  • source/build_wf_pages.py
  • scripts/common.py
  • scripts/cleanup-catalog.py
  • scripts/generate-catalog.py
🧠 Learnings (5)
📓 Common learnings
Learnt from: m-jahn
PR: snakemake/snakemake-workflow-catalog#50
File: topics.json:0-0
Timestamp: 2025-06-10T12:10:20.483Z
Learning: In the snakemake-workflow-catalog project, m-jahn prefers to allow cross-topic keyword mappings in topics.json, where keywords can appear in multiple topic categories. This design decision enables workflows to be discoverable under all relevant topic areas, improving user experience even though it means some workflows will be categorized under multiple topics.
README.md (2)

Learnt from: johanneskoester
PR: #45
File: .github/workflows/test.yml:0-0
Timestamp: 2025-05-09T08:11:42.764Z
Learning: In test workflows (like .github/workflows/test.yml), hardcoded environment variables like OFFSET and LATEST_COMMIT are intentional and appropriate as they provide controlled, reproducible test parameters. These values should not be replaced with dynamic values derived from GitHub contexts.

Learnt from: m-jahn
PR: #33
File: source/build_wf_tables.py:51-51
Timestamp: 2025-03-14T15:04:47.868Z
Learning: In the build_wf_tables.py file, the repository list is expected to never be empty during normal operation, and an empty list would indicate an upstream problem with repository data collection that should be addressed separately.

source/build_wf_tables.py (1)

Learnt from: m-jahn
PR: #33
File: source/build_wf_tables.py:51-51
Timestamp: 2025-03-14T15:04:47.868Z
Learning: In the build_wf_tables.py file, the repository list is expected to never be empty during normal operation, and an empty list would indicate an upstream problem with repository data collection that should be addressed separately.

scripts/common.py (1)

Learnt from: johanneskoester
PR: #45
File: .github/workflows/test.yml:0-0
Timestamp: 2025-05-09T08:11:42.764Z
Learning: In test workflows (like .github/workflows/test.yml), hardcoded environment variables like OFFSET and LATEST_COMMIT are intentional and appropriate as they provide controlled, reproducible test parameters. These values should not be replaced with dynamic values derived from GitHub contexts.

scripts/generate-catalog.py (1)

Learnt from: m-jahn
PR: #33
File: source/build_wf_tables.py:51-51
Timestamp: 2025-03-14T15:04:47.868Z
Learning: In the build_wf_tables.py file, the repository list is expected to never be empty during normal operation, and an empty list would indicate an upstream problem with repository data collection that should be addressed separately.

🧬 Code Graph Analysis (2)
scripts/cleanup-catalog.py (1)
scripts/common.py (2)
  • store_data (67-70)
  • check_repo_exists (73-84)
scripts/generate-catalog.py (1)
scripts/common.py (4)
  • register_skip (59-64)
  • store_data (67-70)
  • call_rate_limit_aware (51-56)
  • get_tarfile (87-88)
💤 Files with no reviewable changes (2)
  • data.json
  • templates/data.js
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

⚙️ CodeRabbit Configuration File

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

Files:

  • source/build_wf_tables.py
  • source/build_wf_pages.py
  • scripts/common.py
  • scripts/cleanup-catalog.py
  • scripts/generate-catalog.py
🧠 Learnings (5)
📓 Common learnings
Learnt from: m-jahn
PR: snakemake/snakemake-workflow-catalog#50
File: topics.json:0-0
Timestamp: 2025-06-10T12:10:20.483Z
Learning: In the snakemake-workflow-catalog project, m-jahn prefers to allow cross-topic keyword mappings in topics.json, where keywords can appear in multiple topic categories. This design decision enables workflows to be discoverable under all relevant topic areas, improving user experience even though it means some workflows will be categorized under multiple topics.
README.md (2)

Learnt from: johanneskoester
PR: #45
File: .github/workflows/test.yml:0-0
Timestamp: 2025-05-09T08:11:42.764Z
Learning: In test workflows (like .github/workflows/test.yml), hardcoded environment variables like OFFSET and LATEST_COMMIT are intentional and appropriate as they provide controlled, reproducible test parameters. These values should not be replaced with dynamic values derived from GitHub contexts.

Learnt from: m-jahn
PR: #33
File: source/build_wf_tables.py:51-51
Timestamp: 2025-03-14T15:04:47.868Z
Learning: In the build_wf_tables.py file, the repository list is expected to never be empty during normal operation, and an empty list would indicate an upstream problem with repository data collection that should be addressed separately.

source/build_wf_tables.py (1)

Learnt from: m-jahn
PR: #33
File: source/build_wf_tables.py:51-51
Timestamp: 2025-03-14T15:04:47.868Z
Learning: In the build_wf_tables.py file, the repository list is expected to never be empty during normal operation, and an empty list would indicate an upstream problem with repository data collection that should be addressed separately.

scripts/common.py (1)

Learnt from: johanneskoester
PR: #45
File: .github/workflows/test.yml:0-0
Timestamp: 2025-05-09T08:11:42.764Z
Learning: In test workflows (like .github/workflows/test.yml), hardcoded environment variables like OFFSET and LATEST_COMMIT are intentional and appropriate as they provide controlled, reproducible test parameters. These values should not be replaced with dynamic values derived from GitHub contexts.

scripts/generate-catalog.py (1)

Learnt from: m-jahn
PR: #33
File: source/build_wf_tables.py:51-51
Timestamp: 2025-03-14T15:04:47.868Z
Learning: In the build_wf_tables.py file, the repository list is expected to never be empty during normal operation, and an empty list would indicate an upstream problem with repository data collection that should be addressed separately.

🧬 Code Graph Analysis (2)
scripts/cleanup-catalog.py (1)
scripts/common.py (2)
  • store_data (67-70)
  • check_repo_exists (73-84)
scripts/generate-catalog.py (1)
scripts/common.py (4)
  • register_skip (59-64)
  • store_data (67-70)
  • call_rate_limit_aware (51-56)
  • get_tarfile (87-88)
🔇 Additional comments (18)
README.md (1)

108-110: Double-check the script name (generate-catalog.py vs generate_catalog.py).

Elsewhere in the PR description the script is referred to as generate_catalog.py (underscore), but the README uses a hyphen. If the filename was renamed, this command will break.

-python scripts/generate-catalog.py
+python scripts/generate_catalog.py
source/build_wf_pages.py (1)

60-62: Clean transition from JavaScript to JSON data format.

The changes correctly update the data loading to read from pure JSON (data.json) instead of JavaScript-wrapped JSON (data.js). The switch from json.loads(f.read()) to json.load(f) is appropriate since there's no longer a need to skip the JavaScript variable declaration line.

source/build_wf_tables.py (1)

83-85: Consistent transition to JSON data format.

The changes mirror those in source/build_wf_pages.py, correctly updating the data loading to read from pure JSON format. The use of json.load(f) instead of json.loads(f.read()) is appropriate for the new JSON-only format.

scripts/cleanup-catalog.py (3)

6-13: Improved import formatting for readability.

The multi-line import statement improves code readability without changing functionality.


20-22: Enhanced loop formatting for clarity.

The multi-line loop formatting improves code readability while maintaining the same logic.


39-40: Updated to use new store_data signature.

The changes correctly align with the updated store_data function in scripts/common.py (lines 67-70), now passing the full dictionaries and explicit JSON filenames instead of just the dictionary values.

scripts/common.py (5)

7-8: Added necessary imports for new utility functions.

The tarfile and urllib.request imports support the new get_tarfile function introduced later in the file.


24-25: Simplified data loading with direct JSON parsing.

The transition from JavaScript-wrapped JSON to pure JSON eliminates the need to skip the first line and parse manually. Using json.load(f) directly is cleaner and more efficient.


59-64: Well-designed register_skip function.

The function properly registers a repository as skipped by storing both the full name and timestamp, returning the updated dictionary for functional-style updates.


67-70: Improved store_data function with better design.

The new signature taking a dictionary and explicit filename is more flexible and clear. Sorting by full_name and using sort_keys=True with indentation ensures consistent, readable JSON output.


87-88: Useful utility function for tarball handling.

The get_tarfile function provides a clean abstraction for opening compressed archives from URLs, which is used in the catalog generation process.

scripts/generate-catalog.py (7)

16-16: Updated imports align with refactored common utilities.

The imports correctly reference the register_skip and get_tarfile functions now provided by scripts/common.py.

Also applies to: 19-19


35-36: Improved data structures using dictionaries.

The transition from lists to dictionaries enables direct access by repository full name, improving efficiency and simplifying data management throughout the script.


168-168: Consistent dictionary-based data storage.

Using dictionary keys for direct assignment aligns with the new data structure design and improves code clarity.

Also applies to: 175-175


188-190: Simplified tarball handling with utility function.

The use of the imported get_tarfile function eliminates code duplication and provides consistent tarball handling across the codebase.


197-197: Proper skip registration with dictionary updates.

The calls to register_skip correctly reassign the returned dictionary, maintaining functional programming patterns while updating the skips collection.

Also applies to: 209-209, 219-219


297-297: Dictionary-based repository storage.

Using the repository's full name as the dictionary key provides direct access and aligns with the overall refactoring to dictionary-based data structures.


304-304: Efficient dictionary merging and storage.

The use of dictionary union (|) operator provides clean merging of old and new data, while the updated store_data calls use explicit filenames for clarity.

Also applies to: 307-307

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
scripts/common.py (2)

68-71: LGTM: Improved generic interface with consistent sorting.

The refactored function provides a more flexible interface and ensures deterministic output through sorting. The alphabetical sorting aligns with the PR objective of reducing unnecessary data rewrites.

Consider adding input validation or documentation to clarify that input should be a dictionary where values contain a "full_name" key.


108-109: LGTM: Efficient streaming tarfile implementation.

The function correctly uses streaming mode for network-based tar files, which is memory-efficient for large archives.

Consider adding error handling for network failures and URL validation if the URL parameter comes from external sources, though this may be acceptable for internal tooling.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b45644f and 27e175e.

📒 Files selected for processing (8)
  • README.md (1 hunks)
  • data.json (0 hunks)
  • scripts/cleanup-catalog.py (2 hunks)
  • scripts/common.py (4 hunks)
  • scripts/generate-catalog.py (7 hunks)
  • source/build_wf_pages.py (1 hunks)
  • source/build_wf_tables.py (1 hunks)
  • templates/data.js (0 hunks)
💤 Files with no reviewable changes (2)
  • data.json
  • templates/data.js
🚧 Files skipped from review as they are similar to previous changes (5)
  • scripts/cleanup-catalog.py
  • README.md
  • source/build_wf_pages.py
  • source/build_wf_tables.py
  • scripts/generate-catalog.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

⚙️ CodeRabbit Configuration File

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

Files:

  • scripts/common.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: m-jahn
PR: snakemake/snakemake-workflow-catalog#50
File: topics.json:0-0
Timestamp: 2025-06-10T12:10:20.483Z
Learning: In the snakemake-workflow-catalog project, m-jahn prefers to allow cross-topic keyword mappings in topics.json, where keywords can appear in multiple topic categories. This design decision enables workflows to be discoverable under all relevant topic areas, improving user experience even though it means some workflows will be categorized under multiple topics.
scripts/common.py (1)

Learnt from: johanneskoester
PR: #45
File: .github/workflows/test.yml:0-0
Timestamp: 2025-05-09T08:11:42.764Z
Learning: In test workflows (like .github/workflows/test.yml), hardcoded environment variables like OFFSET and LATEST_COMMIT are intentional and appropriate as they provide controlled, reproducible test parameters. These values should not be replaced with dynamic values derived from GitHub contexts.

🔇 Additional comments (3)
scripts/common.py (3)

8-9: LGTM: Import additions support new functionality.

The tarfile and urllib.request imports are appropriately added to support the new get_tarfile function for handling tarball downloads.


25-26: LGTM: Correct transition from JavaScript to JSON data loading.

The change properly updates the data loading logic to read pure JSON instead of JavaScript with Jinja2 templating. Using json.load(f) directly is more efficient than the previous approach.


60-65: LGTM: Clean utility function for registering skipped repositories.

The function correctly extracts repository metadata and converts the datetime to a timestamp for JSON serialization. The implementation assumes the repo parameter has full_name and updated_at attributes, which is consistent with GitHub API objects.

@m-jahn m-jahn merged commit e979da6 into snakemake:main Jul 24, 2025
3 checks passed
@m-jahn m-jahn deleted the sorted_db branch July 24, 2025 07:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

storage of repo data in json file creates large overhead of changes lines
2 participants