-
Notifications
You must be signed in to change notification settings - Fork 11
feat: store repo data in sorted JSON files, closes #55 #57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 Walkthrough## Walkthrough
The changes transition the workflow catalog's data handling from JavaScript-based (`data.js` with Jinja2 templating) to pure JSON files (`data.json`). Scripts and source files are refactored to read, write, and process JSON directly. Additional utility functions are introduced, and data structures are updated from lists to dictionaries for improved management.
## Changes
| File(s) | Change Summary |
|------------------------------------------------|------------------------------------------------------------------------------------------------|
| README.md | Expanded local setup instructions to include `N_REPOS` environment variable. |
| data.json | Converted from JS variable assignment to pure JSON array. |
| scripts/cleanup-catalog.py | Reformatted imports and loops; changed data storage to use full dictionaries and filenames. |
| scripts/common.py | Removed Jinja2, switched to JSON serialization; added `register_skip` and `get_tarfile` helpers.|
| scripts/generate-catalog.py | Refactored data structures from lists to dicts; updated skip handling and tarball extraction. |
| source/build_wf_pages.py, source/build_wf_tables.py | Changed data import from `data.js` to `data.json`; removed line skipping; simplified JSON loading.|
| templates/data.js | Deleted Jinja2-based JavaScript template file. |
## Sequence Diagram(s)
```mermaid
sequenceDiagram
participant User
participant generate-catalog.py
participant common.py
participant data.json
User->>generate-catalog.py: Run script
generate-catalog.py->>common.py: get_tarfile(url)
common.py-->>generate-catalog.py: Tarfile object
generate-catalog.py->>common.py: register_skip(repo, skips)
common.py-->>generate-catalog.py: Updated skips dict
generate-catalog.py->>common.py: store_data(repos, "data.json")
common.py->>data.json: Write JSON
generate-catalog.py->>common.py: store_data(skips, "skips.json")
common.py->>skips.json: Write JSON Estimated code review effort🎯 3 (Moderate) | ⏱️ ~40 minutes Possibly related PRs
Suggested reviewers
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
README.md (1)
93-97
: Use “environment” instead of “environmental”.Minor wording nit: we usually say “environment variables”, not “environmental variables”.
-2. Set required environmental variables. +2. Set required environment variables.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
README.md
(1 hunks)data.json
(0 hunks)scripts/cleanup-catalog.py
(2 hunks)scripts/common.py
(4 hunks)scripts/generate-catalog.py
(7 hunks)source/build_wf_pages.py
(1 hunks)source/build_wf_tables.py
(1 hunks)templates/data.js
(0 hunks)
📓 Path-based instructions (1)
**/*.py
⚙️ CodeRabbit Configuration File
**/*.py
: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theself
argument of methods.
Do not suggest type annotation of thecls
argument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturn
statement.
Files:
source/build_wf_tables.py
source/build_wf_pages.py
scripts/common.py
scripts/cleanup-catalog.py
scripts/generate-catalog.py
🧠 Learnings (5)
📓 Common learnings
Learnt from: m-jahn
PR: snakemake/snakemake-workflow-catalog#50
File: topics.json:0-0
Timestamp: 2025-06-10T12:10:20.483Z
Learning: In the snakemake-workflow-catalog project, m-jahn prefers to allow cross-topic keyword mappings in topics.json, where keywords can appear in multiple topic categories. This design decision enables workflows to be discoverable under all relevant topic areas, improving user experience even though it means some workflows will be categorized under multiple topics.
README.md (2)
Learnt from: johanneskoester
PR: #45
File: .github/workflows/test.yml:0-0
Timestamp: 2025-05-09T08:11:42.764Z
Learning: In test workflows (like .github/workflows/test.yml
), hardcoded environment variables like OFFSET
and LATEST_COMMIT
are intentional and appropriate as they provide controlled, reproducible test parameters. These values should not be replaced with dynamic values derived from GitHub contexts.
Learnt from: m-jahn
PR: #33
File: source/build_wf_tables.py:51-51
Timestamp: 2025-03-14T15:04:47.868Z
Learning: In the build_wf_tables.py file, the repository list is expected to never be empty during normal operation, and an empty list would indicate an upstream problem with repository data collection that should be addressed separately.
source/build_wf_tables.py (1)
Learnt from: m-jahn
PR: #33
File: source/build_wf_tables.py:51-51
Timestamp: 2025-03-14T15:04:47.868Z
Learning: In the build_wf_tables.py file, the repository list is expected to never be empty during normal operation, and an empty list would indicate an upstream problem with repository data collection that should be addressed separately.
scripts/common.py (1)
Learnt from: johanneskoester
PR: #45
File: .github/workflows/test.yml:0-0
Timestamp: 2025-05-09T08:11:42.764Z
Learning: In test workflows (like .github/workflows/test.yml
), hardcoded environment variables like OFFSET
and LATEST_COMMIT
are intentional and appropriate as they provide controlled, reproducible test parameters. These values should not be replaced with dynamic values derived from GitHub contexts.
scripts/generate-catalog.py (1)
Learnt from: m-jahn
PR: #33
File: source/build_wf_tables.py:51-51
Timestamp: 2025-03-14T15:04:47.868Z
Learning: In the build_wf_tables.py file, the repository list is expected to never be empty during normal operation, and an empty list would indicate an upstream problem with repository data collection that should be addressed separately.
🧬 Code Graph Analysis (2)
scripts/cleanup-catalog.py (1)
scripts/common.py (2)
store_data
(67-70)check_repo_exists
(73-84)
scripts/generate-catalog.py (1)
scripts/common.py (4)
register_skip
(59-64)store_data
(67-70)call_rate_limit_aware
(51-56)get_tarfile
(87-88)
💤 Files with no reviewable changes (2)
- data.json
- templates/data.js
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
⚙️ CodeRabbit Configuration File
**/*.py
: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theself
argument of methods.
Do not suggest type annotation of thecls
argument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturn
statement.
Files:
source/build_wf_tables.py
source/build_wf_pages.py
scripts/common.py
scripts/cleanup-catalog.py
scripts/generate-catalog.py
🧠 Learnings (5)
📓 Common learnings
Learnt from: m-jahn
PR: snakemake/snakemake-workflow-catalog#50
File: topics.json:0-0
Timestamp: 2025-06-10T12:10:20.483Z
Learning: In the snakemake-workflow-catalog project, m-jahn prefers to allow cross-topic keyword mappings in topics.json, where keywords can appear in multiple topic categories. This design decision enables workflows to be discoverable under all relevant topic areas, improving user experience even though it means some workflows will be categorized under multiple topics.
README.md (2)
Learnt from: johanneskoester
PR: #45
File: .github/workflows/test.yml:0-0
Timestamp: 2025-05-09T08:11:42.764Z
Learning: In test workflows (like .github/workflows/test.yml
), hardcoded environment variables like OFFSET
and LATEST_COMMIT
are intentional and appropriate as they provide controlled, reproducible test parameters. These values should not be replaced with dynamic values derived from GitHub contexts.
Learnt from: m-jahn
PR: #33
File: source/build_wf_tables.py:51-51
Timestamp: 2025-03-14T15:04:47.868Z
Learning: In the build_wf_tables.py file, the repository list is expected to never be empty during normal operation, and an empty list would indicate an upstream problem with repository data collection that should be addressed separately.
source/build_wf_tables.py (1)
Learnt from: m-jahn
PR: #33
File: source/build_wf_tables.py:51-51
Timestamp: 2025-03-14T15:04:47.868Z
Learning: In the build_wf_tables.py file, the repository list is expected to never be empty during normal operation, and an empty list would indicate an upstream problem with repository data collection that should be addressed separately.
scripts/common.py (1)
Learnt from: johanneskoester
PR: #45
File: .github/workflows/test.yml:0-0
Timestamp: 2025-05-09T08:11:42.764Z
Learning: In test workflows (like .github/workflows/test.yml
), hardcoded environment variables like OFFSET
and LATEST_COMMIT
are intentional and appropriate as they provide controlled, reproducible test parameters. These values should not be replaced with dynamic values derived from GitHub contexts.
scripts/generate-catalog.py (1)
Learnt from: m-jahn
PR: #33
File: source/build_wf_tables.py:51-51
Timestamp: 2025-03-14T15:04:47.868Z
Learning: In the build_wf_tables.py file, the repository list is expected to never be empty during normal operation, and an empty list would indicate an upstream problem with repository data collection that should be addressed separately.
🧬 Code Graph Analysis (2)
scripts/cleanup-catalog.py (1)
scripts/common.py (2)
store_data
(67-70)check_repo_exists
(73-84)
scripts/generate-catalog.py (1)
scripts/common.py (4)
register_skip
(59-64)store_data
(67-70)call_rate_limit_aware
(51-56)get_tarfile
(87-88)
🔇 Additional comments (18)
README.md (1)
108-110
: Double-check the script name (generate-catalog.py
vsgenerate_catalog.py
).Elsewhere in the PR description the script is referred to as
generate_catalog.py
(underscore), but the README uses a hyphen. If the filename was renamed, this command will break.-python scripts/generate-catalog.py +python scripts/generate_catalog.pysource/build_wf_pages.py (1)
60-62
: Clean transition from JavaScript to JSON data format.The changes correctly update the data loading to read from pure JSON (
data.json
) instead of JavaScript-wrapped JSON (data.js
). The switch fromjson.loads(f.read())
tojson.load(f)
is appropriate since there's no longer a need to skip the JavaScript variable declaration line.source/build_wf_tables.py (1)
83-85
: Consistent transition to JSON data format.The changes mirror those in
source/build_wf_pages.py
, correctly updating the data loading to read from pure JSON format. The use ofjson.load(f)
instead ofjson.loads(f.read())
is appropriate for the new JSON-only format.scripts/cleanup-catalog.py (3)
6-13
: Improved import formatting for readability.The multi-line import statement improves code readability without changing functionality.
20-22
: Enhanced loop formatting for clarity.The multi-line loop formatting improves code readability while maintaining the same logic.
39-40
: Updated to use new store_data signature.The changes correctly align with the updated
store_data
function inscripts/common.py
(lines 67-70), now passing the full dictionaries and explicit JSON filenames instead of just the dictionary values.scripts/common.py (5)
7-8
: Added necessary imports for new utility functions.The
tarfile
andurllib.request
imports support the newget_tarfile
function introduced later in the file.
24-25
: Simplified data loading with direct JSON parsing.The transition from JavaScript-wrapped JSON to pure JSON eliminates the need to skip the first line and parse manually. Using
json.load(f)
directly is cleaner and more efficient.
59-64
: Well-designed register_skip function.The function properly registers a repository as skipped by storing both the full name and timestamp, returning the updated dictionary for functional-style updates.
67-70
: Improved store_data function with better design.The new signature taking a dictionary and explicit filename is more flexible and clear. Sorting by
full_name
and usingsort_keys=True
with indentation ensures consistent, readable JSON output.
87-88
: Useful utility function for tarball handling.The
get_tarfile
function provides a clean abstraction for opening compressed archives from URLs, which is used in the catalog generation process.scripts/generate-catalog.py (7)
16-16
: Updated imports align with refactored common utilities.The imports correctly reference the
register_skip
andget_tarfile
functions now provided byscripts/common.py
.Also applies to: 19-19
35-36
: Improved data structures using dictionaries.The transition from lists to dictionaries enables direct access by repository full name, improving efficiency and simplifying data management throughout the script.
168-168
: Consistent dictionary-based data storage.Using dictionary keys for direct assignment aligns with the new data structure design and improves code clarity.
Also applies to: 175-175
188-190
: Simplified tarball handling with utility function.The use of the imported
get_tarfile
function eliminates code duplication and provides consistent tarball handling across the codebase.
197-197
: Proper skip registration with dictionary updates.The calls to
register_skip
correctly reassign the returned dictionary, maintaining functional programming patterns while updating the skips collection.Also applies to: 209-209, 219-219
297-297
: Dictionary-based repository storage.Using the repository's full name as the dictionary key provides direct access and aligns with the overall refactoring to dictionary-based data structures.
304-304
: Efficient dictionary merging and storage.The use of dictionary union (
|
) operator provides clean merging of old and new data, while the updatedstore_data
calls use explicit filenames for clarity.Also applies to: 307-307
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
scripts/common.py (2)
68-71
: LGTM: Improved generic interface with consistent sorting.The refactored function provides a more flexible interface and ensures deterministic output through sorting. The alphabetical sorting aligns with the PR objective of reducing unnecessary data rewrites.
Consider adding input validation or documentation to clarify that
input
should be a dictionary where values contain a "full_name" key.
108-109
: LGTM: Efficient streaming tarfile implementation.The function correctly uses streaming mode for network-based tar files, which is memory-efficient for large archives.
Consider adding error handling for network failures and URL validation if the URL parameter comes from external sources, though this may be acceptable for internal tooling.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
README.md
(1 hunks)data.json
(0 hunks)scripts/cleanup-catalog.py
(2 hunks)scripts/common.py
(4 hunks)scripts/generate-catalog.py
(7 hunks)source/build_wf_pages.py
(1 hunks)source/build_wf_tables.py
(1 hunks)templates/data.js
(0 hunks)
💤 Files with no reviewable changes (2)
- data.json
- templates/data.js
🚧 Files skipped from review as they are similar to previous changes (5)
- scripts/cleanup-catalog.py
- README.md
- source/build_wf_pages.py
- source/build_wf_tables.py
- scripts/generate-catalog.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
⚙️ CodeRabbit Configuration File
**/*.py
: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theself
argument of methods.
Do not suggest type annotation of thecls
argument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturn
statement.
Files:
scripts/common.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: m-jahn
PR: snakemake/snakemake-workflow-catalog#50
File: topics.json:0-0
Timestamp: 2025-06-10T12:10:20.483Z
Learning: In the snakemake-workflow-catalog project, m-jahn prefers to allow cross-topic keyword mappings in topics.json, where keywords can appear in multiple topic categories. This design decision enables workflows to be discoverable under all relevant topic areas, improving user experience even though it means some workflows will be categorized under multiple topics.
scripts/common.py (1)
Learnt from: johanneskoester
PR: #45
File: .github/workflows/test.yml:0-0
Timestamp: 2025-05-09T08:11:42.764Z
Learning: In test workflows (like .github/workflows/test.yml
), hardcoded environment variables like OFFSET
and LATEST_COMMIT
are intentional and appropriate as they provide controlled, reproducible test parameters. These values should not be replaced with dynamic values derived from GitHub contexts.
🔇 Additional comments (3)
scripts/common.py (3)
8-9
: LGTM: Import additions support new functionality.The
tarfile
andurllib.request
imports are appropriately added to support the newget_tarfile
function for handling tarball downloads.
25-26
: LGTM: Correct transition from JavaScript to JSON data loading.The change properly updates the data loading logic to read pure JSON instead of JavaScript with Jinja2 templating. Using
json.load(f)
directly is more efficient than the previous approach.
60-65
: LGTM: Clean utility function for registering skipped repositories.The function correctly extracts repository metadata and converts the datetime to a timestamp for JSON serialization. The implementation assumes the
repo
parameter hasfull_name
andupdated_at
attributes, which is consistent with GitHub API objects.
data.js
to becomedata.json
, with same structure asskips.json
Summary by CodeRabbit