Skip to content

Conversation

strixy16
Copy link
Collaborator

@strixy16 strixy16 commented Dec 18, 2024

Created CorrelationWriter class for the analyze portion of the pipeline.

I think there will eventually be a FeatureSetWriter that this should probably inherit from, but I need the Correlation one now for Aerts.

Summary by CodeRabbit

  • New Features

    • Introduced a CorrelationWriter class for managing the writing of correlation data to files with customizable paths and filenames.
    • Added support for saving correlation data in both CSV and Excel formats.
  • Bug Fixes

    • Implemented error handling for invalid correlation data, existing files, and filename format validation.
  • Tests

    • Added a comprehensive suite of unit tests for the CorrelationWriter class, covering various scenarios for saving correlation data.

@strixy16 strixy16 requested a review from jjjermiah December 18, 2024 17:28
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 18, 2024

Important

Review skipped

Review was skipped due to path filters

⛔ Files ignored due to path filters (1)
  • pixi.lock is excluded by !**/*.lock

CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including **/dist/** will override the default block on the dist directory, by removing the pattern from both the lists.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

The pull request introduces a new CorrelationWriter class in the correlation_writer.py file, designed to handle writing correlation data to files. This writer extends the BaseWriter class and provides robust functionality for saving correlation DataFrames with custom file paths, filename formats, and overwrite options. Accompanying test cases in test_correlation_writer.py validate the writer's behavior across various scenarios, including file saving, validation, and error handling.

Changes

File Change Summary
src/readii/io/writers/correlation_writer.py - Added CorrelationWriterError, CorrelationWriterIOError, and CorrelationWriterValidationError exception classes
- Implemented CorrelationWriter data class with save method
- Supports saving correlation DataFrames to CSV and Excel formats
tests/io/writers/test_correlation_writer.py - Added comprehensive test suite for CorrelationWriter
- Created fixtures for generating correlation data and writer instances
- Implemented tests for validation, file saving, and error handling

Sequence Diagram

sequenceDiagram
    participant User
    participant CorrelationWriter
    participant DataFrame
    participant FileSystem

    User->>CorrelationWriter: save(correlation_df)
    CorrelationWriter->>CorrelationWriter: validate input
    alt Invalid DataFrame
        CorrelationWriter-->>User: CorrelationWriterValidationError
    else Valid DataFrame
        CorrelationWriter->>FileSystem: check file existence
        alt File exists and overwrite=False
            CorrelationWriter-->>User: CorrelationWriterIOError
        else File can be saved
            CorrelationWriter->>FileSystem: save DataFrame
            FileSystem-->>CorrelationWriter: file saved
            CorrelationWriter-->>User: return file path
        end
    end
Loading

Poem

🐰 A Correlation Tale of Data's Grace 🐰

With writers sharp and logic neat,
Our data finds its saving seat.
CSV or Excel, it matters not,
Each correlation perfectly caught!

Hop, hop, hooray for clean file streams! 📊


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
src/readii/io/writers/correlation_writer.py (2)

53-74: Clarify the docstring to explicitly reflect support for both CSV and Excel formats.
The docstring for the save method states “Save the correlation dataframe to a .csv file.” Even though the code also supports “.xlsx,” the docstring might cause confusion.

Consider updating the docstring to clearly indicate possible file extensions:

- """Save the correlation dataframe to a .csv file.
+ """Save the correlation dataframe to either .csv or .xlsx.

105-113: Consider replacing match-case if Python 3.10 support is uncertain.
The match-case syntax requires Python 3.10+. If you anticipate older environments, an if-elif chain might be more compatible. Otherwise, this approach is fine.

- match out_path.suffix:
-     case ".csv":
-         correlation_df.to_csv(out_path, index=True, index_label="")
-     case ".xlsx":
-         correlation_df.to_excel(out_path, index=True, index_label="")
-     case _:
-         ...
+ suffix = out_path.suffix
+ if suffix == ".csv":
+     correlation_df.to_csv(out_path, index=True, index_label="")
+ elif suffix == ".xlsx":
+     correlation_df.to_excel(out_path, index=True, index_label="")
+ else:
+     ...
tests/io/writers/test_correlation_writer.py (2)

5-5: Remove the unused import 'pathlib.Path'.
It’s imported but never referenced, generating a linter warning (F401).

- from pathlib import Path
🧰 Tools
🪛 Ruff (0.8.2)

5-5: pathlib.Path imported but unused

Remove unused import: pathlib.Path

(F401)


7-7: Remove unused import CorrelationWriterError.
This import is not referenced in the test suite, generating a linter warning (F401).

- from readii.io.writers.correlation_writer import CorrelationWriter, CorrelationWriterValidationError, CorrelationWriterError, CorrelationWriterIOError
+ from readii.io.writers.correlation_writer import CorrelationWriter, CorrelationWriterValidationError, CorrelationWriterIOError
🧰 Tools
🪛 Ruff (0.8.2)

7-7: readii.io.writers.correlation_writer.CorrelationWriterError imported but unused

Remove unused import: readii.io.writers.correlation_writer.CorrelationWriterError

(F401)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c27d40f and 480ef5b.

📒 Files selected for processing (2)
  • src/readii/io/writers/correlation_writer.py (1 hunks)
  • tests/io/writers/test_correlation_writer.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
tests/io/writers/test_correlation_writer.py

5-5: pathlib.Path imported but unused

Remove unused import: pathlib.Path

(F401)


7-7: readii.io.writers.correlation_writer.CorrelationWriterError imported but unused

Remove unused import: readii.io.writers.correlation_writer.CorrelationWriterError

(F401)

🔇 Additional comments (1)
src/readii/io/writers/correlation_writer.py (1)

99-101: Ensure correlation DataFrame columns and index checks align with domain requirements.
Currently, the method enforces that the columns and index must be identical, which often implies a square and symmetric correlation matrix. If there's a scenario where a user might want rectangular or partial correlation data, this strict check may be too restrictive.

Would you like to confirm that this enforcement is absolutely required in your domain, or should we make this validation more flexible?

@jjjermiah jjjermiah force-pushed the katys/add-correlation-writer branch from be63b3d to 7b64e94 Compare December 18, 2024 18:04
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (8)
src/readii/io/writers/correlation_writer.py (4)

11-27: Enhance exception docstrings with more details

While the exception hierarchy is well-structured, the docstrings could be more descriptive by including:

  • Specific scenarios when each exception is raised
  • Example error messages
  • How to handle or prevent these exceptions

30-30: Update class docstring to match its purpose

The current docstring mentions "plot figure files" but this class handles correlation data files.

Suggested docstring:

"""Class for managing file writing with customizable paths and filenames for correlation data files."""

105-112: Simplify file format handling

Since we validate file extensions in __post_init__, the match statement's default case is unreachable. Consider simplifying:

-            match out_path.suffix:
-                case ".csv":
-                    correlation_df.to_csv(out_path, index=True, index_label="")
-                case ".xlsx":
-                    correlation_df.to_excel(out_path, index=True, index_label="")
-                case _:
-                    msg = f"Invalid file extension {out_path.suffix}. Must be one of {self.VALID_EXTENSIONS}."
-                    raise CorrelationWriterValidationError(msg)
+            if out_path.suffix == ".csv":
+                correlation_df.to_csv(out_path, index=True, index_label="")
+            else:  # must be .xlsx due to __post_init__ validation
+                correlation_df.to_excel(out_path, index=True, index_label="")

121-131: Enhance example usage with actual correlation data

Consider adding a complete working example that demonstrates creating and saving correlation data:

if __name__ == "__main__":  # pragma: no cover
    import numpy as np
    import pandas as pd
    
    # Create sample correlation data
    data = pd.DataFrame(np.random.rand(5, 5), columns=[f'feature_{i}' for i in range(5)])
    corr_df = data.corr()
    
    writer = CorrelationWriter(
        root_directory=Path("examples"),
        filename_format="{DatasetName}_correlations.csv",
        overwrite=True,
        create_dirs=True
    )
    
    # Save correlation matrix
    writer.save(corr_df, DatasetName="example")
tests/io/writers/test_correlation_writer.py (4)

7-7: Remove unused import

The CorrelationWriterError exception is imported but never used in the tests.

-from readii.io.writers.correlation_writer import CorrelationWriter, CorrelationWriterValidationError, CorrelationWriterError, CorrelationWriterIOError
+from readii.io.writers.correlation_writer import CorrelationWriter, CorrelationWriterValidationError, CorrelationWriterIOError
🧰 Tools
🪛 Ruff (0.8.2)

7-7: readii.io.writers.correlation_writer.CorrelationWriterError imported but unused

Remove unused import: readii.io.writers.correlation_writer.CorrelationWriterError

(F401)


9-17: Consider adding more test cases to random_feature_correlations fixture

The fixture could be parameterized to provide different types of correlation matrices:

@pytest.fixture(params=[10, 100])  # Test with different matrix sizes
def random_feature_correlations(request):
    size = request.param
    random_matrix = np.random.default_rng(seed=10).random((size, size))
    random_df = pd.DataFrame(random_matrix, columns=[f"feature_{i+1}" for i in range(size)])
    return getFeatureCorrelations(random_df, random_df)

35-41: Enhance validation in test_save_valid_correlation

Consider adding more assertions to verify the saved correlation data:

 def test_save_valid_correlation(corr_writer, request, correlation_df):
     """Test saving a valid correlation dataframe."""
     correlation_df = request.getfixturevalue(correlation_df)
     out_path = corr_writer.save(correlation_df, CorrelationType="Pearson")
     assert out_path.exists()
+    # Verify saved content
+    saved_df = pd.read_csv(out_path, index_col=0)
+    pd.testing.assert_frame_equal(saved_df, correlation_df)

54-59: Add tests for additional edge cases

Consider adding tests for:

  1. Saving correlation matrices with non-numeric data
  2. Handling NaN values in correlation data
  3. Very large correlation matrices
  4. Invalid file paths

Example test case:

@pytest.mark.parametrize("invalid_data", [
    pd.DataFrame({'A': ['a', 'b'], 'B': [1, 2]}),  # non-numeric
    pd.DataFrame({'A': [1, np.nan], 'B': [np.nan, 2]}),  # NaN values
])
def test_save_correlation_with_invalid_data(corr_writer, invalid_data):
    """Test saving correlation data with invalid values."""
    with pytest.raises(CorrelationWriterValidationError):
        corr_writer.save(invalid_data, CorrelationType="Pearson")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 480ef5b and 7b64e94.

⛔ Files ignored due to path filters (1)
  • pixi.lock is excluded by !**/*.lock
📒 Files selected for processing (2)
  • src/readii/io/writers/correlation_writer.py (1 hunks)
  • tests/io/writers/test_correlation_writer.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
tests/io/writers/test_correlation_writer.py

5-5: pathlib.Path imported but unused

Remove unused import: pathlib.Path

(F401)


7-7: readii.io.writers.correlation_writer.CorrelationWriterError imported but unused

Remove unused import: readii.io.writers.correlation_writer.CorrelationWriterError

(F401)

🔇 Additional comments (1)
src/readii/io/writers/correlation_writer.py (1)

45-52: LGTM!

The post-initialization validation is well-implemented with proper error handling.

Comment on lines +40 to +43
VALID_EXTENSIONS: ClassVar[list[str]] = (
".csv",
".xlsx"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Fix type annotation for VALID_EXTENSIONS

The type hint list[str] doesn't match the tuple literal. Either change the type hint to tuple[str, ...] or convert the tuple to a list.

-    VALID_EXTENSIONS: ClassVar[list[str]] = (
+    VALID_EXTENSIONS: ClassVar[tuple[str, ...]] = (
         ".csv",
         ".xlsx"
     )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
VALID_EXTENSIONS: ClassVar[list[str]] = (
".csv",
".xlsx"
)
VALID_EXTENSIONS: ClassVar[tuple[str, ...]] = (
".csv",
".xlsx"
)

@codecov
Copy link

codecov bot commented Dec 18, 2024

Codecov Report

Attention: Patch coverage is 77.35849% with 12 lines in your changes missing coverage. Please review.

Project coverage is 41.66%. Comparing base (7efc97b) to head (5143c6d).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/readii/io/writers/correlation_writer.py 77.35% 12 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #96      +/-   ##
==========================================
+ Coverage   40.31%   41.66%   +1.35%     
==========================================
  Files          32       33       +1     
  Lines        1399     1452      +53     
==========================================
+ Hits          564      605      +41     
- Misses        835      847      +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…on-writer' into katys/add-correlation-writer
@strixy16 strixy16 merged commit b241c42 into main Dec 18, 2024
18 checks passed
@strixy16 strixy16 deleted the katys/add-correlation-writer branch December 18, 2024 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants