feat(integration-tests): Add infrastructure for running integration tests; Add basic integration tests for `clp` & `clp-s` compression. #1100

Bill-hbrhbr · 2025-07-14T14:11:48Z

Description

As the CLP codebase grows, C++ unit tests alone are no longer sufficient. We now need integration tests that validate the correctness of the entire codebase and to ensure components work correctly together. This not only helps developers catch issues earlier but also reduces review overhead.

This PR establishes a foundation for future, more comprehensive integration testing. Specifically, it introduces:

Integration test setup with pytest
Linting via black, ruff, and mypy
Environment management with uv
Taskfile workflows specific to integration tests

Within the pytest setup, this PR adds:

A pytest.ini with default options, logging, and markers
Facilities for downloading and storing test logs
Configuration classes for CLP packages and binaries
Example tests for CLP core binaries
Utility functions to support future tests

Checklist

The PR satisfies the contribution guidelines.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Integration tests pass with the current main branch codebase and selected test logs.

Summary by CodeRabbit

New Features
- Added an integration-tests project with end-to-end identity‑transformation tests, fixtures, frozen config objects and test utilities.
Tests
- Added pytest configuration, pinned Python 3.9 for the test workspace, test tasks/entries and markers, plus helpers to run commands, assert results, and compare directories/JSON.
Documentation
- Added testing docs and an integration‑tests README; documented uv as a dependency.
Chores
- Enhanced lint workflow with per‑project linting and task additions; updated .gitignore to exclude pycache/.

coderabbitai · 2025-07-14T14:11:55Z

Walkthrough

Adds a new integration-tests Python project (pytest config, fixtures, utils, and tests), taskfile and lint-task updates for per-project linting and integration test orchestration, documentation for testing, and a .gitignore entry for __pycache__/. All changes are non-API except new test helpers and frozen dataclasses used by tests.

Changes

Cohort / File(s)	Summary of changes
VCS ignore update `\.gitignore`	Added `__pycache__/` ignore entry.
Integration project metadata & config `integration-tests/pyproject.toml`, `integration-tests/.pytest.ini`, `integration-tests/.python-version`, `integration-tests/README.md`	New Python project config (hatchling), pinned Python `3.9`, dev deps (mypy/ruff/pytest/pytest-env), mypy/ruff settings, pytest logging/markers/env, and README.
Package & test package inits `integration-tests/src/integration_tests/__init__.py`, `integration-tests/tests/__init__.py`, `integration-tests/tests/fixtures/__init__.py`, `integration-tests/tests/utils/__init__.py`	Added module/package docstrings to establish namespaces.
Pytest wiring `integration-tests/tests/conftest.py`	Registered pytest plugins via `pytest_plugins`.
Fixtures: config & datasets `integration-tests/tests/fixtures/integration_test_config.py`, `integration-tests/tests/fixtures/integration_test_logs.py`	Added session-scoped `integration_test_config` fixture (core/package/build paths) and dataset fixtures (`hive_24hr`, `postgresql`) that download/extract tarballs and use pytest cache.
Test utilities — configs `integration-tests/tests/utils/config.py`	Added frozen dataclasses: `CoreConfig`, `PackageConfig`, `IntegrationTestConfig`, `IntegrationTestLogs`, `CompressionTestConfig` with validation, computed paths, and `clear_test_outputs`.
Test utilities — helpers `integration-tests/tests/utils/utils.py`, `integration-tests/tests/utils/asserting_utils.py`	Added helpers: `get_env_var`, `validate_dir_exists`, `unlink`, `is_dir_tree_content_equal`, `is_json_file_structurally_equal` (uses `jq` + `sort`), and `run_and_assert` (subprocess wrapper that fails tests on error).
Tests `integration-tests/tests/test_identity_transformation.py`	Added identity-transformation tests exercising `clp` (text) and `clp-s` (JSON) compress/decompress flows, including consolidation and structural JSON comparison.
Taskfile includes & test tasks `taskfile.yaml`, `taskfiles/tests/main.yaml`, `taskfiles/tests/integration.yaml`	Included test taskfile and added integration test tasks (`default`, `core`, `cache-clear`) and `G_INTEGRATION_TESTS_DIR`; `clean` now calls `tests:integration:cache-clear`.
Lint task updates `taskfiles/lint.yaml`	Split `RUFF_FLAGS` into `RUFF_CHECK_FLAGS` and `RUFF_FORMAT_FLAGS`; added `UV_PYTHON_PROJECTS` (includes `integration-tests`); per-project ruff format/check and mypy runs; adjusted fix-py invocation.
Documentation & navigation `docs/src/dev-docs/building-package.md`, `docs/src/dev-docs/components-core/index.md`, `docs/src/dev-docs/index.md`, `docs/src/dev-docs/testing/index.md`, `docs/src/dev-docs/testing/integration-tests.md`, `docs/src/dev-docs/testing/unit-tests.md`	Added testing docs, navigation card and toctree, noted `uv` as a dependency, and documented running integration tests and related environment variables.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Dev as Developer
  participant UV as uv (task runner)
  participant Pytest as pytest
  participant Fixtures as Fixtures
  participant Utils as TestUtils
  participant Bins as CLP binaries

  Dev->>UV: task tests:integration
  UV->>Pytest: run pytest in integration-tests (env vars set)
  Pytest->>Fixtures: load integration_test_config (session)
  Fixtures->>Utils: get_env_var(), validate_dir_exists()
  Pytest->>Fixtures: load dataset fixture (download & extract if needed)
  Fixtures->>Utils: unlink / prepare test dirs
  Pytest->>Bins: invoke clp / clp-s (compress/decompress)
  Bins-->>Pytest: outputs (archives, decompressed dirs)
  Pytest->>Utils: is_dir_tree_content_equal / is_json_file_structurally_equal
  Utils-->>Pytest: comparison result
  Pytest-->>Dev: test results
  note right of Pytest: clp-s flow includes consolidation before structural JSON compare

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Optimize integration test dependencies: core tests should only require core binaries #1281 — This PR adds PackageConfig and performs package-directory validation in the session-scoped integration_test_config fixture, which relates to the issue about unconditional package validation for core-only tests.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title accurately and succinctly summarizes the primary change: adding integration-test infrastructure and basic tests for the clp and clp-s compressors, and it follows Conventional Commits style; it is specific rather than generic so a reviewer scanning history will understand the main intent.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 31e69e3 and f424f21.

📒 Files selected for processing (1)

integration-tests/tests/fixtures/integration_test_logs.py (1 hunks)

🧰 Additional context used

🧠 Learnings (4)

📓 Common learnings

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1100
File: integration-tests/tests/fixtures/integration_test_logs.py:54-56
Timestamp: 2025-08-17T16:10:38.722Z
Learning: For PR #1100 (feat(integration-tests): Add CLP package integration tests boilerplate), do not raise cache weakness problems related to the pytest cache implementation in the integration test logs fixtures.

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1100
File: integration-tests/pyproject.toml:11-13
Timestamp: 2025-08-16T10:24:29.316Z
Learning: In the CLP project's integration-tests directory, the integration tests package is not intended to be shipped or distributed - only the tests are leveraged for local testing purposes, so console script entry points should not be included in pyproject.toml.

📚 Learning: 2025-08-17T16:10:38.722Z

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1100
File: integration-tests/tests/fixtures/integration_test_logs.py:54-56
Timestamp: 2025-08-17T16:10:38.722Z
Learning: For PR #1100 (feat(integration-tests): Add CLP package integration tests boilerplate), do not raise cache weakness problems related to the pytest cache implementation in the integration test logs fixtures.

Applied to files:

integration-tests/tests/fixtures/integration_test_logs.py

📚 Learning: 2025-08-08T21:15:10.905Z

Learnt from: haiqi96
PR: y-scope/clp#1100
File: integration-tests/tests/test_identity_transformation.py:87-95
Timestamp: 2025-08-08T21:15:10.905Z
Learning: In the CLP project's integration tests (Python code), variable names should use "logs" (plural) rather than "log" (singular) when referring to test logs or log-related entities, as this aligns with the naming conventions used throughout the codebase.

Applied to files:

integration-tests/tests/fixtures/integration_test_logs.py

📚 Learning: 2025-08-20T09:01:25.825Z

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1100
File: integration-tests/tests/fixtures/integration_test_logs.py:60-71
Timestamp: 2025-08-20T09:01:25.825Z
Learning: For integration tests using curl downloads in PR #1100, curl's built-in timeout and retry mechanisms (with --fail and --show-error flags) are sufficient, and additional subprocess timeout parameters are not needed.

Applied to files:

integration-tests/tests/fixtures/integration_test_logs.py

🧬 Code graph analysis (1)

integration-tests/tests/fixtures/integration_test_logs.py (3)

integration-tests/tests/utils/config.py (2)

IntegrationTestConfig (74-95)

IntegrationTestLogs (99-123)

integration-tests/tests/utils/utils.py (1)

unlink (54-75)

integration-tests/tests/fixtures/integration_test_config.py (1)

integration_test_config (16-29)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)

GitHub Check: musllinux_1_2-x86_64-static-linked-bins
GitHub Check: ubuntu-jammy-static-linked-bins
GitHub Check: ubuntu-jammy-dynamic-linked-bins
GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
GitHub Check: package-image
GitHub Check: ubuntu-jammy-lint
GitHub Check: centos-stream-9-static-linked-bins
GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
GitHub Check: manylinux_2_28-x86_64-static-linked-bins
GitHub Check: centos-stream-9-dynamic-linked-bins
GitHub Check: build-macos (macos-15, false)
GitHub Check: build-macos (macos-15, true)

🔇 Additional comments (4)

integration-tests/tests/fixtures/integration_test_logs.py (4)

18-29: Fixture looks good (session-scoped, typed, consistent naming).

No issues spotted; this cleanly delegates to the helper and follows the “logs” naming convention.

32-43: Second dataset fixture is consistent and correct.

Matches the first fixture’s pattern; scoped and typed appropriately.

61-65: Good fail‑fast check for curl availability.

Resolving the binary with shutil.which and raising a clear error is the right approach.

79-79: Confirm unlink’s sudo fallback won’t break CI/Windows.

tests.utils.utils.unlink escalates to “sudo rm -rf” on PermissionError. If Windows runners or non‑sudo CI are in scope, this can fail.

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Ruff (0.12.2)

integration-tests/tests/fixtures/integration_test_logs.py

�[1;31mruff failed�[0m
�[1mCause:�[0m Failed to load configuration /ruff.toml
�[1mCause:�[0m Failed to parse /ruff.toml
�[1mCause:�[0m TOML parse error at line 26, column 3
|
26 | "RSE100", # Use of assert detected
| ^^^^^^^^
Unknown rule selector: RSE100

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fe78f4f and 6880a0f.

⛔ Files ignored due to path filters (1)

integration-tests/uv.lock is excluded by !**/*.lock

📒 Files selected for processing (14)

.gitignore (1 hunks)
docs/src/dev-guide/building-package.md (3 hunks)
integration-tests/.pytest.ini (1 hunks)
integration-tests/.python-version (1 hunks)
integration-tests/pyproject.toml (1 hunks)
integration-tests/src/integration_tests/__init__.py (1 hunks)
integration-tests/tests/conftest.py (1 hunks)
integration-tests/tests/fixtures/base_config.py (1 hunks)
integration-tests/tests/fixtures/dataset_logs.py (1 hunks)
integration-tests/tests/test_identity_transformation.py (1 hunks)
integration-tests/tests/utils/config.py (1 hunks)
integration-tests/tests/utils/utils.py (1 hunks)
taskfile.yaml (2 hunks)
taskfiles/lint.yaml (1 hunks)

🧰 Additional context used

🧠 Learnings (4)

📓 Common learnings

Learnt from: kirkrodrigues
PR: y-scope/clp#881
File: components/core/tools/scripts/lib_install/ubuntu-jammy/install-prebuilt-packages.sh:35-41
Timestamp: 2025-05-06T09:48:55.408Z
Learning: For installation scripts in the CLP project, prefer explicit error handling over automatic dependency resolution (like `apt-get install -f`) when installing packages to give users more control over their system.

Learnt from: gibber9809
PR: y-scope/clp#955
File: components/core/src/clp_s/search/sql/CMakeLists.txt:8-26
Timestamp: 2025-06-02T18:22:24.060Z
Learning: In the clp project, ANTLR code generation at build time is being removed by another PR. When reviewing CMake files, be aware that some temporary suboptimal configurations may exist to reduce merge conflicts between concurrent PRs, especially around ANTLR_TARGET calls.

Learnt from: quinntaylormitchell
PR: y-scope/clp#968
File: docs/src/user-guide/quick-start/overview.md:53-54
Timestamp: 2025-06-18T20:48:48.990Z
Learning: CLP is designed to run on Linux systems where Python is typically pre-installed, so Python installation links are generally not needed in CLP documentation.

docs/src/dev-guide/building-package.md (6)

Learnt from: jackluo923
PR: y-scope/clp#718
File: components/core/tools/scripts/utils/create-debian-package.py:41-41
Timestamp: 2025-02-12T22:24:17.723Z
Learning: For the clp-core Debian package creation script, strict version format validation is considered unnecessary complexity and should be avoided.

Learnt from: kirkrodrigues
PR: y-scope/clp#881
File: components/core/tools/scripts/lib_install/ubuntu-jammy/install-prebuilt-packages.sh:35-41
Timestamp: 2025-05-06T09:48:55.408Z
Learning: For installation scripts in the CLP project, prefer explicit error handling over automatic dependency resolution (like `apt-get install -f`) when installing packages to give users more control over their system.

Learnt from: quinntaylormitchell
PR: y-scope/clp#1069
File: docs/src/user-guide/quick-start/clp-json.md:135-152
Timestamp: 2025-07-05T03:38:16.779Z
Learning: The CLP project team has decided to refrain from using include directives in their documentation at present, preferring to maintain duplicated content rather than using shared includes or partials for de-duplication.

Learnt from: quinntaylormitchell
PR: y-scope/clp#968
File: docs/src/user-guide/quick-start/overview.md:73-109
Timestamp: 2025-06-18T20:39:05.899Z
Learning: The CLP project team prefers to use video content to demonstrate detailed procedural steps (like tarball extraction) rather than including every step in the written documentation, keeping the docs focused on conceptual guidance.

Learnt from: quinntaylormitchell
PR: y-scope/clp#968
File: docs/src/user-guide/quick-start/overview.md:53-54
Timestamp: 2025-06-18T20:48:48.990Z
Learning: CLP is designed to run on Linux systems where Python is typically pre-installed, so Python installation links are generally not needed in CLP documentation.

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.

integration-tests/.pytest.ini (2)

Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.

Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-clp_s-end_to_end.cpp:109-110
Timestamp: 2024-11-29T22:50:17.206Z
Learning: In `components/core/tests/test-clp_s-end_to_end.cpp`, the success of `constructor.store()` is verified through `REQUIRE` statements and subsequent comparisons.

integration-tests/tests/test_identity_transformation.py (5)

Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/search.py:100-106
Timestamp: 2025-07-03T20:10:43.789Z
Learning: In the current CLP codebase implementation, dataset validation using validate_dataset() is performed within the native scripts (like clp_package_utils/scripts/native/search.py) rather than at the wrapper script level, where the native scripts handle their own parameter validation.

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1050
File: components/clp-package-utils/clp_package_utils/scripts/search.py:100-106
Timestamp: 2025-07-03T20:10:43.789Z
Learning: In the current CLP codebase implementation, dataset validation using validate_dataset() is performed within the native scripts (like clp_package_utils/scripts/native/search.py) rather than at the wrapper script level, where the native scripts handle their own parameter validation.

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1036
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:204-211
Timestamp: 2025-07-03T12:58:18.407Z
Learning: In the CLP codebase, the validate_and_cache_dataset function in components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py uses in-place updates of the existing_datasets set parameter rather than returning a new set, as preferred by the development team.

Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

🧬 Code Graph Analysis (1)

integration-tests/tests/conftest.py (2)

integration-tests/tests/fixtures/base_config.py (1)

base_config (9-22)

integration-tests/tests/fixtures/dataset_logs.py (4)

download_and_extract_dataset (44-71)

hive_24hr (12-16)

postgresql (36-40)

spark_event_logs (28-32)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: ubuntu-jammy-lint
GitHub Check: ubuntu-jammy-static-linked-bins
GitHub Check: centos-stream-9-static-linked-bins
GitHub Check: centos-stream-9-dynamic-linked-bins
GitHub Check: ubuntu-jammy-dynamic-linked-bins
GitHub Check: build-macos (macos-15, false)

🔇 Additional comments (20)

.gitignore (1)

5-5: LGTM! Standard Python cache directory exclusion.

The recursive **/__pycache__/ pattern is appropriate for the new integration tests infrastructure and follows Python project conventions.

integration-tests/tests/utils/config.py (1)

1-18: LGTM! Well-structured configuration classes.

The frozen dataclasses provide clean, immutable configuration structures with proper type hints. The use of Path objects for directory paths and clear field names makes the code maintainable.

integration-tests/tests/utils/utils.py (2)

16-20: LGTM! Clean environment variable retrieval.

The function provides proper error handling with a descriptive assertion message.

22-26: LGTM! Simple and effective command execution wrapper.

The function provides a clean abstraction for running commands with assertion-based error handling.

integration-tests/.python-version (1)

1-1: LGTM! Appropriate Python version specification.

Python 3.10 is a good choice for the integration tests and aligns with the project's requirements.

docs/src/dev-guide/building-package.md (3)

18-18: LGTM! Proper dependency documentation.

The uv requirement is clearly documented with the appropriate version constraint.

65-72: LGTM! Clear testing instructions.

The Test section provides concise, actionable instructions for running integration tests that align with the new infrastructure.

83-83: LGTM! Proper reference link.

The uv reference link is correctly formatted and provides users with additional documentation.

taskfiles/lint.yaml (1)

779-779: LGTM!

Adding the integration-tests directory to the linting process is correct and maintains code quality consistency across the project.

integration-tests/tests/conftest.py (1)

1-7: LGTM!

The conftest.py file correctly aggregates pytest fixtures following standard pytest practices. The imports are consistent with the fixture modules described in the context.

integration-tests/.pytest.ini (2)

10-11: Verify that relative paths work consistently across different execution contexts.

The environment variables use relative paths (../build). Ensure these paths resolve correctly regardless of where the tests are executed from (e.g., from the root directory, from the integration-tests directory, or from CI/CD pipelines).

12-15: Well-defined test markers.

The custom markers provide clear categorization for different types of integration tests, which will be helpful for selective test execution.

integration-tests/tests/fixtures/base_config.py (2)

8-9: Appropriate fixture scope.

Using session scope for the base_config fixture is correct since this configuration should be shared across all tests in the session.

20-21: Robust directory creation.

The directory creation logic correctly uses parents=True and exist_ok=True to handle cases where directories may already exist or need to be created recursively.

integration-tests/pyproject.toml (1)

1-23: LGTM! Well-structured Python project configuration.

The pyproject.toml file is correctly configured with appropriate metadata, build system, and development dependencies for the integration test suite.

taskfile.yaml (2)

18-18: LGTM! Consistent variable naming convention.

The new variable follows the established naming pattern with G_ prefix for global variables.

166-172: LGTM! Well-structured integration test task.

The task properly:

Sets required environment variables for the test suite

Uses uv run for isolated Python execution

Runs from the correct directory context

The environment variables CLP_BUILD_DIR and CLP_PACKAGE_DIR will be available to the tests for locating binaries and packages.

integration-tests/tests/fixtures/dataset_logs.py (1)

11-40: LGTM! Well-structured dataset fixtures.

The fixtures correctly define session-scoped datasets with clear naming and URLs pointing to Zenodo archives.

integration-tests/tests/test_identity_transformation.py (2)

36-67: LGTM! Well-structured identity transformation test for CLP.

The test correctly implements the identity transformation pattern:

Proper cleanup before and after test execution

Uses appropriate CLI flags (--progress, --remove-path-prefix)

Validates that decompressed output matches original input

Follows established cleanup patterns from learned context

69-108: LGTM! Comprehensive identity transformation test for CLP-S.

The test properly handles CLP-S specific requirements:

Implements double compression/decompression cycle for consistency verification

Accounts for JSON key and row order limitations as documented

Includes appropriate TODO comment referencing official documentation

Proper cleanup of all temporary directories

integration-tests/tests/utils/utils.py

integration-tests/src/integration_tests/__init__.py

integration-tests/tests/fixtures/dataset_logs.py

integration-tests/tests/test_identity_transformation.py

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6880a0f and c24cdaf.

📒 Files selected for processing (2)

taskfile.yaml (2 hunks)
taskfiles/lint.yaml (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: kirkrodrigues
PR: y-scope/clp#881
File: components/core/tools/scripts/lib_install/ubuntu-jammy/install-prebuilt-packages.sh:35-41
Timestamp: 2025-05-06T09:48:55.408Z
Learning: For installation scripts in the CLP project, prefer explicit error handling over automatic dependency resolution (like `apt-get install -f`) when installing packages to give users more control over their system.

Learnt from: gibber9809
PR: y-scope/clp#955
File: components/core/src/clp_s/search/sql/CMakeLists.txt:8-26
Timestamp: 2025-06-02T18:22:24.060Z
Learning: In the clp project, ANTLR code generation at build time is being removed by another PR. When reviewing CMake files, be aware that some temporary suboptimal configurations may exist to reduce merge conflicts between concurrent PRs, especially around ANTLR_TARGET calls.

Learnt from: quinntaylormitchell
PR: y-scope/clp#968
File: docs/src/user-guide/quick-start/overview.md:53-54
Timestamp: 2025-06-18T20:48:48.990Z
Learning: CLP is designed to run on Linux systems where Python is typically pre-installed, so Python installation links are generally not needed in CLP documentation.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)

GitHub Check: ubuntu-jammy-dynamic-linked-bins
GitHub Check: ubuntu-jammy-lint
GitHub Check: ubuntu-jammy-static-linked-bins
GitHub Check: centos-stream-9-static-linked-bins
GitHub Check: centos-stream-9-dynamic-linked-bins
GitHub Check: build-macos (macos-15, true)
GitHub Check: build-macos (macos-15, false)
GitHub Check: build (macos-latest)

🔇 Additional comments (2)

taskfiles/lint.yaml (1)

777-778: Integration-tests directory correctly added to Python lint scope

Including integration-tests ensures Black/Ruff run over the new test code. No issues spotted.

taskfile.yaml (1)

18-18: Variable addition looks good

G_INTEGRATION_TESTS_DIR follows existing naming conventions and will simplify future path changes.

taskfile.yaml

haiqi96

Some initial comment. I will need to take a closer look at uv and pytest to fully review this.

integration-tests/README.md

integration-tests/src/integration_tests/__init__.py

taskfile.yaml

integration-tests/tests/test_identity_transformation.py

integration-tests/tests/fixtures/base_config.py

integration-tests/tests/utils/utils.py

integration-tests/tests/fixtures/dataset_logs.py

integration-tests/tests/utils/utils.py

integration-tests/.python-version

taskfile.yaml

integration-tests/tests/fixtures/dataset_logs.py

integration-tests/tests/test_identity_transformation.py

integration-tests/tests/utils/config.py

integration-tests/tests/test_identity_transformation.py

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (3)

integration-tests/tests/fixtures/dataset_logs.py (1)
44-73: Address resource cleanup and error handling improvements.

The download function has several areas that could be improved based on past review feedback:

Resource cleanup: Downloaded archives are not cleaned up after extraction

Error handling: The broad except Exception clause could mask specific issues

Security: Consider validating HTTPS URLs for the datasets

Consider this improvement:
    try:
        download_path_str = str(package_config.uncompressed_logs_dir / f"{dataset_name}.tar.gz")
        # fmt: off
        curl_cmds = [
            "curl",
            "--fail",
            "--location",
            "--output", download_path_str,
            "--show-error",
            dataset_config.tar_url,
        ]
        # fmt: on
        subprocess.run(curl_cmds, check=True)

        extract_path_str = str(package_config.uncompressed_logs_dir / dataset_name)
        shutil.unpack_archive(download_path_str, extract_path_str)
+        # Clean up downloaded archive
+        os.unlink(download_path_str)
-    except Exception as e:
+    except (subprocess.CalledProcessError, shutil.ReadError, OSError) as e:
        raise RuntimeError(f"Failed to download and extract dataset `{dataset_name}`: {e}")
This addresses resource cleanup and provides more specific error handling.
integration-tests/tests/test_identity_transformation.py (2)
126-126: Remove duplicate assignment that causes resource leak

Line 126 reassigns keys_and_rows_sorted, overwriting the context-managed temporary file created on line 125. This causes a resource leak and breaks the intended context management.

Remove this line:
-        keys_and_rows_sorted = NamedTemporaryFile(mode="w+", delete=True)
96-99: Inconsistent naming: "single-file" directories without single-file-archive mode

The variable names single_file_archives_dir and single_file_extract_dir suggest single-file archive mode, but the compression commands don't use the --single-file-archive flag. This creates confusion about the test's intent.

Either rename the variables to remove "single-file" references or add the flag if single-file mode is intended.

Also applies to: 104-105

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9587f99 and 6d73f31.

📒 Files selected for processing (6)

integration-tests/.pytest.ini (1 hunks)
integration-tests/tests/conftest.py (1 hunks)
integration-tests/tests/fixtures/dataset_logs.py (1 hunks)
integration-tests/tests/test_identity_transformation.py (1 hunks)
integration-tests/tests/utils/utils.py (1 hunks)
taskfile.yaml (2 hunks)

🧰 Additional context used

🧠 Learnings (6)

📓 Common learnings

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.149Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.

Learnt from: jackluo923
PR: y-scope/clp#718
File: components/core/tools/scripts/utils/create-debian-package.py:41-41
Timestamp: 2025-02-12T22:24:17.723Z
Learning: For the clp-core Debian package creation script, strict version format validation is considered unnecessary complexity and should be avoided.

Learnt from: kirkrodrigues
PR: y-scope/clp#881
File: components/core/tools/scripts/lib_install/ubuntu-jammy/install-prebuilt-packages.sh:35-41
Timestamp: 2025-05-06T09:48:55.408Z
Learning: For installation scripts in the CLP project, prefer explicit error handling over automatic dependency resolution (like `apt-get install -f`) when installing packages to give users more control over their system.

Learnt from: quinntaylormitchell
PR: y-scope/clp#968
File: docs/src/user-guide/quick-start/overview.md:53-54
Timestamp: 2025-06-18T20:48:48.990Z
Learning: CLP is designed to run on Linux systems where Python is typically pre-installed, so Python installation links are generally not needed in CLP documentation.

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:6-15
Timestamp: 2025-07-01T14:52:15.217Z
Learning: For installation scripts in the CLP project, maintain consistency in command patterns across platforms rather than applying platform-specific optimizations. When a platform follows a pattern of separate update and install commands (like `apt-get update && apt-get install` or `apk update && apk add`), preserve this pattern for uniform script structure.

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:6-15
Timestamp: 2025-07-01T14:52:15.217Z
Learning: For installation scripts in the CLP project, maintain consistency in command patterns across platforms rather than applying platform-specific optimizations. When a platform follows a pattern of separate update and install commands (like `apt-get update && apt-get install` or `apk update && apk add`), preserve this pattern for uniform script structure.

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:6-15
Timestamp: 2025-07-01T14:52:15.217Z
Learning: For installation scripts in the CLP project, maintain consistency in command patterns across platforms rather than applying platform-specific optimizations. When a platform follows a pattern of separate update and install commands (like `apt-get update && apt-get install` or `apk update && apk add`), preserve this pattern for uniform script structure.

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.

integration-tests/.pytest.ini (2)

Learnt from: AVMatthews
PR: #595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.

Learnt from: AVMatthews
PR: #595
File: components/core/tests/test-clp_s-end_to_end.cpp:109-110
Timestamp: 2024-11-29T22:50:17.206Z
Learning: In components/core/tests/test-clp_s-end_to_end.cpp, the success of constructor.store() is verified through REQUIRE statements and subsequent comparisons.

taskfile.yaml (7)

Learnt from: anlowee
PR: #925
File: taskfiles/deps/main.yaml:97-106
Timestamp: 2025-05-28T18:33:30.155Z
Learning: In the taskfiles dependency system (taskfiles/deps/main.yaml), echo commands are used to generate .cmake settings files that are consumed by the main CMake build process. These files set variables like ANTLR_RUNTIME_HEADER to point to dependency locations for use during compilation.

Learnt from: Bill-hbrhbr
PR: #1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.149Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Learnt from: jackluo923
PR: #1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in install-prebuilt-packages.sh, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like --strip-components=1 to handle potential tarball layout changes.

Learnt from: kirkrodrigues
PR: #881
File: components/core/tools/scripts/lib_install/ubuntu-jammy/install-prebuilt-packages.sh:35-41
Timestamp: 2025-05-06T09:48:55.408Z
Learning: For installation scripts in the CLP project, prefer explicit error handling over automatic dependency resolution (like apt-get install -f) when installing packages to give users more control over their system.

Learnt from: jackluo923
PR: #1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:6-15
Timestamp: 2025-07-01T14:52:15.217Z
Learning: For installation scripts in the CLP project, maintain consistency in command patterns across platforms rather than applying platform-specific optimizations. When a platform follows a pattern of separate update and install commands (like apt-get update && apt-get install or apk update && apk add), preserve this pattern for uniform script structure.

integration-tests/tests/utils/utils.py (1)

Learnt from: anlowee
PR: #925
File: taskfile.yaml:132-142
Timestamp: 2025-05-27T19:54:46.662Z
Learning: When checking for uncommitted changes in generated code directories, git status --porcelain is more comprehensive than git diff --quiet because it detects both modified tracked files and new untracked files, while git diff only detects changes to already-tracked files.

integration-tests/tests/test_identity_transformation.py (20)

Learnt from: AVMatthews
PR: #595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.

Learnt from: Bill-hbrhbr
PR: #1036
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:204-211
Timestamp: 2025-07-03T12:58:18.407Z
Learning: In the CLP codebase, the validate_and_cache_dataset function in components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py uses in-place updates of the existing_datasets set parameter rather than returning a new set, as preferred by the development team.

Learnt from: Bill-hbrhbr
PR: #1004
File: components/clp-package-utils/clp_package_utils/scripts/native/decompress.py:139-144
Timestamp: 2025-06-24T08:54:14.438Z
Learning: In the CLP codebase, the get_orig_file_id function signature was changed after a recent merge to no longer accept a dataset parameter, making previous suggestions that reference this parameter invalid.

Learnt from: Bill-hbrhbr
PR: #1050
File: components/clp-package-utils/clp_package_utils/scripts/search.py:100-106
Timestamp: 2025-07-03T20:10:43.789Z
Learning: In the current CLP codebase implementation, dataset validation using validate_dataset() is performed within the native scripts (like clp_package_utils/scripts/native/search.py) rather than at the wrapper script level, where the native scripts handle their own parameter validation.

Learnt from: jackluo923
PR: #1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in install-prebuilt-packages.sh, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like --strip-components=1 to handle potential tarball layout changes.

Learnt from: jackluo923
PR: #1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:6-15
Timestamp: 2025-07-01T14:52:15.217Z
Learning: For installation scripts in the CLP project, maintain consistency in command patterns across platforms rather than applying platform-specific optimizations. When a platform follows a pattern of separate update and install commands (like apt-get update && apt-get install or apk update && apk add), preserve this pattern for uniform script structure.

Learnt from: jackluo923
PR: #1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:6-15
Timestamp: 2025-07-01T14:52:15.217Z
Learning: For installation scripts in the CLP project, maintain consistency in command patterns across different platforms (e.g., using separate update and install commands like apk update && apk add, apt update && apt install, yum update && yum install) rather than platform-specific optimizations, to ensure uniform script structure and readability.

Learnt from: AVMatthews
PR: #595
File: components/core/tests/test-clp_s-end_to_end.cpp:109-110
Timestamp: 2024-11-29T22:50:17.206Z
Learning: In components/core/tests/test-clp_s-end_to_end.cpp, the success of constructor.store() is verified through REQUIRE statements and subsequent comparisons.

Learnt from: AVMatthews
PR: #543
File: components/core/src/clp_s/clp-s.cpp:196-265
Timestamp: 2024-10-07T20:10:08.254Z
Learning: In clp-s.cpp, the run_serializer function interleaves serialization and writing of IR files, making it difficult to restructure it into separate functions.

Learnt from: AVMatthews
PR: #543
File: components/core/src/clp_s/clp-s.cpp:196-265
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In clp-s.cpp, the run_serializer function interleaves serialization and writing of IR files, making it difficult to restructure it into separate functions.

Learnt from: Bill-hbrhbr
PR: #1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.149Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Learnt from: haiqi96
PR: #594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in components/clp-package-utils/clp_package_utils/scripts/, note that it's preferred to keep error handling simple without adding extra complexity.

Learnt from: haiqi96
PR: #634
File: components/job-orchestration/job_orchestration/executor/compress/fs_compression_task.py:0-0
Timestamp: 2024-12-16T21:49:06.483Z
Learning: In fs_compression_task.py, when refactoring the archive processing loop in Python, the src_archive_file.unlink() operation should remain outside of the process_archive function.

Learnt from: haiqi96
PR: #619
File: components/core/src/clp/clp/decompression.cpp:313-313
Timestamp: 2024-12-05T16:32:21.507Z
Learning: In the C++ FileDecompressor implementations at components/core/src/clp/clp/FileDecompressor.cpp and components/core/src/glt/glt/FileDecompressor.cpp, the temp_output_path variable and associated logic are used to handle multiple compressed files with the same name, and should be kept. This logic is separate from the temporary output directory code removed in PR #619 and is necessary for proper file handling.

Learnt from: haiqi96
PR: #651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Learnt from: haiqi96
PR: #594
File: components/package-template/src/sbin/del-archives.sh:6-9
Timestamp: 2024-11-15T16:28:08.644Z
Learning: For wrapper scripts in the components/package-template/src/sbin/ directory, keep them simple and avoid adding additional validation code.

Learnt from: Bill-hbrhbr
PR: #1036
File: components/clp-package-utils/clp_package_utils/general.py:564-579
Timestamp: 2025-06-28T07:10:47.295Z
Learning: The validate_dataset function in components/clp-package-utils/clp_package_utils/general.py is designed to be called once upon function startup for dataset validation, not repeatedly during execution, making caching optimizations unnecessary.

integration-tests/tests/fixtures/dataset_logs.py (2)

Learnt from: Bill-hbrhbr
PR: #1036
File: components/clp-package-utils/clp_package_utils/general.py:564-579
Timestamp: 2025-06-28T07:10:47.295Z
Learning: The validate_dataset function in components/clp-package-utils/clp_package_utils/general.py is designed to be called once upon function startup for dataset validation, not repeatedly during execution, making caching optimizations unnecessary.

Learnt from: Bill-hbrhbr
PR: #1036
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:204-211
Timestamp: 2025-07-03T12:58:18.407Z
Learning: In the CLP codebase, the validate_and_cache_dataset function in components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py uses in-place updates of the existing_datasets set parameter rather than returning a new set, as preferred by the development team.

🧬 Code Graph Analysis (2)

integration-tests/tests/conftest.py (2)

integration-tests/tests/fixtures/dataset_logs.py (4)

download_and_extract_dataset (45-73)

hive_24hr (13-17)

postgresql (37-41)

spark_event_logs (29-33)

integration-tests/tests/fixtures/package_config.py (1)

package_config (9-22)

integration-tests/tests/test_identity_transformation.py (4)

integration-tests/tests/utils/config.py (2)

DatasetLogs (15-17)

PackageConfig (6-11)

integration-tests/tests/utils/utils.py (2)

is_dir_tree_content_equal (7-14)

run_and_assert (23-26)

integration-tests/tests/fixtures/package_config.py (1)

package_config (9-22)

integration-tests/tests/fixtures/dataset_logs.py (1)

download_and_extract_dataset (45-73)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: centos-stream-9-static-linked-bins
GitHub Check: ubuntu-jammy-lint
GitHub Check: centos-stream-9-dynamic-linked-bins
GitHub Check: ubuntu-jammy-static-linked-bins
GitHub Check: ubuntu-jammy-dynamic-linked-bins
GitHub Check: build-macos (macos-15, false)
GitHub Check: build-macos (macos-14, true)
GitHub Check: build-macos (macos-13, true)
GitHub Check: build-macos (macos-13, false)
GitHub Check: lint-check (ubuntu-latest)
GitHub Check: lint-check (macos-latest)

🔇 Additional comments (10)

integration-tests/tests/utils/utils.py (1)

17-20: Clean implementation with helpful error messaging.

The function properly handles missing environment variables with a clear assertion message.

integration-tests/tests/conftest.py (1)

1-7: Standard pytest fixture aggregation pattern.

Clean implementation that properly imports and exposes fixtures for the integration test suite.

integration-tests/.pytest.ini (1)

1-15: Well-structured pytest configuration for integration tests.

The configuration properly sets up environment variables, custom markers, and test execution options. The markers for binaries, clp, and clp_s provide good test categorization capabilities.

taskfile.yaml (3)

18-18: Integration tests directory variable follows existing naming convention.

The variable name and path structure are consistent with other directory variables in the file.

166-174: Well-implemented test task with proper dependency.

The task correctly depends on the package task and properly sets environment variables that match the pytest configuration. This addresses the dependency issue mentioned in past reviews.

176-181: Useful cache clearing task for test maintenance.

The cache clearing task provides a clean way to reset test state, which is valuable for debugging and ensuring clean test runs.

integration-tests/tests/fixtures/dataset_logs.py (4)

12-17: Dataset fixture follows consistent pattern.

The fixture properly returns a DatasetLogs instance with the correct Zenodo URL structure.

20-25: Dataset fixture follows consistent pattern.

The fixture properly returns a DatasetLogs instance with the correct Zenodo URL structure.

28-33: Dataset fixture follows consistent pattern.

The fixture properly returns a DatasetLogs instance with the correct Zenodo URL structure.

36-41: Dataset fixture follows consistent pattern.

The fixture properly returns a DatasetLogs instance with the correct Zenodo URL structure.

integration-tests/tests/utils/utils.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

integration-tests/tests/fixtures/dataset_logs.py (1)
52-84: Improve error handling specificity and add archive cleanup.

The download and extraction logic has areas for improvement:

The bare except Exception clause is too broad and could mask unexpected errors

Downloaded archives are not cleaned up after extraction, potentially wasting disk space

Apply this diff to improve error handling and add cleanup:
-    except Exception as e:
-        raise RuntimeError(f"Failed to download and extract dataset `{dataset_name}`: {e}")
+    except subprocess.CalledProcessError as e:
+        raise RuntimeError(f"Failed to download dataset `{dataset_name}`: {e}")
+    except shutil.ReadError as e:
+        raise RuntimeError(f"Failed to extract dataset `{dataset_name}`: {e}")
+    finally:
+        # Clean up downloaded archive to save disk space
+        download_path = package_config.uncompressed_logs_dir / f"{dataset_name}.tar.gz"
+        download_path.unlink(missing_ok=True)
integration-tests/tests/test_identity_transformation.py (1)
127-139: Fix redundant NamedTemporaryFile creation causing resource leak.

Line 132 creates a new NamedTemporaryFile that shadows the one created in the with statement on line 131, causing a resource leak and incorrect behavior.

Remove the redundant line:
 @contextmanager
 def _sort_json_keys_and_rows(json_fp: Path) -> IO[str]:
     with NamedTemporaryFile(mode="w+", delete=True) as keys_sorted, NamedTemporaryFile(
         mode="w+", delete=True
     ) as flattened, NamedTemporaryFile(mode="w+", delete=True) as keys_and_rows_sorted:
-        keys_and_rows_sorted = NamedTemporaryFile(mode="w+", delete=True)
         subprocess.run(["jq", "--sort-keys", ".", str(json_fp)], check=True, stdout=keys_sorted)
         keys_sorted.flush()
         subprocess.run(["jq", ".", keys_sorted.name], check=True, stdout=flattened)
         flattened.flush()
         subprocess.run(["sort", flattened.name], check=True, stdout=keys_and_rows_sorted)
         keys_and_rows_sorted.flush()
         yield keys_and_rows_sorted

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6d73f31 and 6216e68.

📒 Files selected for processing (3)

integration-tests/tests/conftest.py (1 hunks)
integration-tests/tests/fixtures/dataset_logs.py (1 hunks)
integration-tests/tests/test_identity_transformation.py (1 hunks)

🧰 Additional context used

🧠 Learnings (3)

📓 Common learnings

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.149Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.

Learnt from: jackluo923
PR: y-scope/clp#718
File: components/core/tools/scripts/utils/create-debian-package.py:41-41
Timestamp: 2025-02-12T22:24:17.723Z
Learning: For the clp-core Debian package creation script, strict version format validation is considered unnecessary complexity and should be avoided.

Learnt from: kirkrodrigues
PR: y-scope/clp#881
File: components/core/tools/scripts/lib_install/ubuntu-jammy/install-prebuilt-packages.sh:35-41
Timestamp: 2025-05-06T09:48:55.408Z
Learning: For installation scripts in the CLP project, prefer explicit error handling over automatic dependency resolution (like `apt-get install -f`) when installing packages to give users more control over their system.

Learnt from: quinntaylormitchell
PR: y-scope/clp#968
File: docs/src/user-guide/quick-start/overview.md:53-54
Timestamp: 2025-06-18T20:48:48.990Z
Learning: CLP is designed to run on Linux systems where Python is typically pre-installed, so Python installation links are generally not needed in CLP documentation.

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:6-15
Timestamp: 2025-07-01T14:52:15.217Z
Learning: For installation scripts in the CLP project, maintain consistency in command patterns across platforms rather than applying platform-specific optimizations. When a platform follows a pattern of separate update and install commands (like `apt-get update && apt-get install` or `apk update && apk add`), preserve this pattern for uniform script structure.

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:6-15
Timestamp: 2025-07-01T14:52:15.217Z
Learning: For installation scripts in the CLP project, maintain consistency in command patterns across platforms rather than applying platform-specific optimizations. When a platform follows a pattern of separate update and install commands (like `apt-get update && apt-get install` or `apk update && apk add`), preserve this pattern for uniform script structure.

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:6-15
Timestamp: 2025-07-01T14:52:15.217Z
Learning: For installation scripts in the CLP project, maintain consistency in command patterns across platforms rather than applying platform-specific optimizations. When a platform follows a pattern of separate update and install commands (like `apt-get update && apt-get install` or `apk update && apk add`), preserve this pattern for uniform script structure.

Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.