Skip to content

Tabular datamodule (Custom dataset from DataFrame, CSV, or Parquet) #2713

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

manuelkonrad
Copy link
Contributor

@manuelkonrad manuelkonrad commented May 18, 2025

📝 Description

Hi, I submitted a similar PR about six months ago (PR #2403). It has not been reviewed yet, and in the meantime, Anomalib has gone from v1 to v2. Therefore, I decided to refactor the feature according to the new structure and submit it as a new PR.

  • This PR adds the Tabular datamodule which is instantiated directly from a pandas DataFrame. It is an alternative to the Folder datamodule for custom datasets where the labels are not encoded in the directory structure. Useful for situations where labels are refined regularly or for sub-sampling large datasets without copying or moving files.
  • The datamodule also includes a from_file constructor which loads the data from a tabular file supported by pandas.

✨ Changes

Select what type of change your PR is:

  • 🐞 Bug fix (non-breaking change which fixes an issue)
  • 🔨 Refactor (non-breaking change which refactors the code base)
  • 🚀 New feature (non-breaking change which adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📚 Documentation update
  • 🔒 Security update

✅ Checklist

Before you submit your pull request, please make sure you have completed the following steps:

  • 📋 I have summarized my changes in the CHANGELOG and followed the guidelines for my type of change (skip for minor changes, documentation updates, and test enhancements).
  • 📚 I have made the necessary updates to the documentation (if applicable).
  • 🧪 I have written tests that support my changes and prove that my fix is effective or my feature works (if applicable).

For more information about code review checklists, see the Code Review Checklist.

Signed-off-by: Manuel Konrad <84141230+manuelkonrad@users.noreply.github.com>
@manuelkonrad manuelkonrad force-pushed the feature/tabular-datamodule branch from b90fb9f to d5c34ff Compare May 18, 2025 19:03
@manuelkonrad manuelkonrad mentioned this pull request May 18, 2025
9 tasks
@samet-akcay
Copy link
Contributor

Thanks for creating this updated PR, and for your patience. We've been a bit side tracked by some other tasks.

@rajeshgangireddy and @ashwinvaidya17 can you prioritise this PR review please?

@rajeshgangireddy rajeshgangireddy self-requested a review May 19, 2025 07:30
@rajeshgangireddy
Copy link
Contributor

rajeshgangireddy commented May 19, 2025

Hi @manuelkonrad
Great PR. Thank you.
I am yet to try it out, but so far it LGTM.

If it's not too much, could you also add a simple example notebook under examples/notebooks/100_datamodules

Copy link
Contributor

@rajeshgangireddy rajeshgangireddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments.

manuelkonrad and others added 4 commits May 20, 2025 20:06
Signed-off-by: Manuel Konrad <84141230+manuelkonrad@users.noreply.github.com>
Signed-off-by: Manuel Konrad <84141230+manuelkonrad@users.noreply.github.com>
@manuelkonrad
Copy link
Contributor Author

Hi @rajeshgangireddy, thanks a lot for your helpful review!

I implemented the proposed changes and also added an example notebook.

@rajeshgangireddy
Copy link
Contributor

Hi @manuelkonrad ,
It looks like our semgrep check is failing (unrelated to this PR).
I will first need to fix that and then merge your PR.

Copy link
Contributor

@samet-akcay samet-akcay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR! Thanks a lot for your efforts, @manuelkonrad!

I only have a single ask to have an example in the docstring to show from_file() method. Otherwise, looking great!

samet-akcay and others added 2 commits June 7, 2025 21:07
Signed-off-by: Manuel Konrad <84141230+manuelkonrad@users.noreply.github.com>
@manuelkonrad
Copy link
Contributor Author

Great PR! Thanks a lot for your efforts, @manuelkonrad!

I only have a single ask to have an example in the docstring to show from_file() method. Otherwise, looking great!

Thanks for the feedback, @samet-akcay! I added the example.

@samet-akcay samet-akcay merged commit c8ab62d into open-edge-platform:main Jun 9, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants