Skip to content

Feature/tensor support #673

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 77 commits into from
May 4, 2025
Merged

Feature/tensor support #673

merged 77 commits into from
May 4, 2025

Conversation

mdbenito
Copy link
Collaborator

@mdbenito mdbenito commented Apr 25, 2025

Description

This PR adds support for tensor data to pydvl.valuation.dataset.Dataset through generics, an Array prototype and a collection of wrapper array functions in pydvl.utils.array.

‼️Beware of the AI slop‼️ This is an attempt at using claude code to implement a complete feature using serena. Despite careful design of tasks and subtasks, and continuously hand-holding the dummy, there were tons of bugs, inconsistencies in the use of types and generics, as well as the array utilities, craploads of duplicate and inane tests which nevertheless left important cases out, as well as several subtle bugs. I hope to have removed much of this, but there is some left, mostly in array.py and associated tests.

Changes

  • Dataset now supports instantiation with tensors or numpy arrays. The type is preserved
  • Makes all valuation methods agnostic to the array type, except for a few exceptions.
  • Fixes some issues with Dataset indexing
  • Dataset can take memmapped numpy arrays, or memmap them if mmap=True, reducing memory cost per-node.
  • Serialization correctly handles memory maps.
  • Updated the MSR notebook, which uses a torch model for the utility to load the data as tensors.
  • Introduces a new prototype TorchSupervisedModel, which is implemented e.g. by skorch.NeuralNetClassifier, and used in the MSR notebook (not a new dependency)
  • Introduces a new SkorchSupervisedScorer to handle skorch models.

Checklist

  • Wrote Unit tests (if necessary)
  • Updated Documentation (if necessary)
  • Updated Changelog
  • If notebooks were added/changed, added boilerplate cells are tagged with "tags": ["hide"] or "tags": ["hide-input"]

mdbenito and others added 30 commits March 17, 2025 15:51
# Conflicts:
#	src/pydvl/utils/types.py
#	src/pydvl/valuation/scorers/supervised.py
# Conflicts:
#	src/pydvl/valuation/samplers/classwise.py
- Create array_ops.py with utilities for both numpy arrays and PyTorch tensors
- Implement type-preserving functions for array creation and manipulation
- Add proper type hints with Array protocol and TypeVar for type preservation
- Add utility functions for library-specific operations
- Import array_ops in utils/__init__.py

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add stratified_split_indices utility in array_ops.py to handle both numpy arrays and tensors
- Update RawData.__post_init__ with improved type checking
- Update Dataset.from_arrays to support tensors through type-agnostic operations
- Add type hints and update docstrings for tensor support

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Update GroupedDataset to handle PyTorch tensors
- Implement type-agnostic data_to_group and group_to_data mappings
- Maintain tensor type in data_indices and logical_indices methods
- Add comprehensive tests for tensor operations in GroupedDataset

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…sor support

Extended test coverage to validate tensor support in Dataset and GroupedDataset classes:
- Added tests for mixed input types and error handling
- Added tests for edge cases like empty groups
- Added tests for single vs multi-dimensional tensors
- Added test for complex sequences of operations to verify type preservation
- Verified factory methods maintain type consistency

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Updated Sample class to support PyTorch tensors
- Modified IndexSampler to be tensor-agnostic
- Added tests for tensor support in samplers
- Updated hash and equality methods to work with both array types
- Replaced numpy-specific operations with array_ops equivalents

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Fixed type errors in array_ops.py by adding proper type annotations and casts.
- Added overloads for functions to maintain type precision
- Fixed return type annotations for tensor operations
- Added proper casting to ensure type safety
- Fixed tensor-specific operations like .to() and .long()
- Ensured consistent return types match input types

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…cumentation

- Add comprehensive tests for tensor indices handling in samplers
- Verify Sample.subset is always a numpy array
- Test ClasswiseSample with tensor inputs for both subset and ooc_subset
- Add error handling tests for invalid input types
- Add documentation note about converting tensor indices to numpy arrays
- Tests verify proper conversion and appropriate handling of tensor inputs

This completes step 5.3 of the tensor support implementation plan.
@mdbenito mdbenito marked this pull request as ready for review May 4, 2025 15:02
@mdbenito mdbenito requested a review from Copilot May 4, 2025 15:05
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces comprehensive support for tensor data in datasets and valuation methods, enhances PyTorch integration, and updates documentation accordingly.

  • Tensor support has been added to Dataset and valuation methods, preserving input types.
  • New array utilities and skorch dependencies are integrated, with several outdated dataset utilities removed or relocated.
  • Updated documentation and notebooks to reflect the new tensor and PyTorch features.

Reviewed Changes

Copilot reviewed 60 out of 60 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/pydvl/utils/init.py Exposes array utilities by importing the updated array module.
src/pydvl/reporting/plots.py Fixes a broken reference in the ValuationResult link.
requirements-notebooks.txt Adds a new skorch dependency required for PyTorch model support.
notebooks/support/shapley.py Removes outdated dataset utility functions, narrowing file scope.
notebooks/support/influence.py Removes unused functions and redundant imports.
notebooks/support/common.py Updates type annotations and cleans up unused symbols.
notebooks/support/banzhaf.py Overhauls the Torch classifier model and training loops for PyTorch.
Various notebooks Update import paths to use the new datasets module.
mkdocs.yml, docs/*, CHANGELOG.md Updates documentation and changelog to capture tensor support.

@mdbenito mdbenito self-assigned this May 4, 2025
@mdbenito mdbenito merged commit 452790c into develop May 4, 2025
@mdbenito mdbenito deleted the feature/tensor-support branch May 4, 2025 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant