Skip to content

Tabular datamodule (Custom dataset from DataFrame, CSV, or Parquet) #2713

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

manuelkonrad
Copy link

@manuelkonrad manuelkonrad commented May 18, 2025

📝 Description

Hi, I submitted a similar PR about six months ago (PR #2403). It has not been reviewed yet, and in the meantime, Anomalib has gone from v1 to v2. Therefore, I decided to refactor the feature according to the new structure and submit it as a new PR.

  • This PR adds the Tabular datamodule which is instantiated directly from a pandas DataFrame. It is an alternative to the Folder datamodule for custom datasets where the labels are not encoded in the directory structure. Useful for situations where labels are refined regularly or for sub-sampling large datasets without copying or moving files.
  • The datamodule also includes a from_file constructor which loads the data from a tabular file supported by pandas.

✨ Changes

Select what type of change your PR is:

  • 🐞 Bug fix (non-breaking change which fixes an issue)
  • 🔨 Refactor (non-breaking change which refactors the code base)
  • 🚀 New feature (non-breaking change which adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📚 Documentation update
  • 🔒 Security update

✅ Checklist

Before you submit your pull request, please make sure you have completed the following steps:

  • 📋 I have summarized my changes in the CHANGELOG and followed the guidelines for my type of change (skip for minor changes, documentation updates, and test enhancements).
  • 📚 I have made the necessary updates to the documentation (if applicable).
  • 🧪 I have written tests that support my changes and prove that my fix is effective or my feature works (if applicable).

For more information about code review checklists, see the Code Review Checklist.

Signed-off-by: Manuel Konrad <84141230+manuelkonrad@users.noreply.github.com>
@manuelkonrad manuelkonrad force-pushed the feature/tabular-datamodule branch from b90fb9f to d5c34ff Compare May 18, 2025 19:03
@manuelkonrad manuelkonrad mentioned this pull request May 18, 2025
9 tasks
@samet-akcay
Copy link
Contributor

Thanks for creating this updated PR, and for your patience. We've been a bit side tracked by some other tasks.

@rajeshgangireddy and @ashwinvaidya17 can you prioritise this PR review please?

@rajeshgangireddy rajeshgangireddy self-requested a review May 19, 2025 07:30
@rajeshgangireddy
Copy link
Contributor

rajeshgangireddy commented May 19, 2025

Hi @manuelkonrad
Great PR. Thank you.
I am yet to try it out, but so far it LGTM.

If it's not too much, could you also add a simple example notebook under examples/notebooks/100_datamodules

Copy link
Contributor

@rajeshgangireddy rajeshgangireddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments.

Defaults to ``32``.
num_workers (int): Number of workers for data loading.
Defaults to ``8``.
train_augmentations (Transform | None): Augmentations to apply dto the training images
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
train_augmentations (Transform | None): Augmentations to apply dto the training images
train_augmentations (Transform | None): Augmentations to apply to the training images

Tabular: Tabular Datamodule
"""
pd_kwargs = pd_kwargs or {}
samples = getattr(pd, f"read_{file_format}")(file_path, **pd_kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some basic sanity checks before and after loading the files.
As an example :

read_func = getattr(pd, f"read_{file_format}", None)
if read_func is None:
    raise ValueError(f"Unsupported file format: '{file_format}'")

Please add check if the file exists and if samples have the required columns.

Copy link
Author

@manuelkonrad manuelkonrad May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the sanity checks. Further checks for the columns are in the make_tabular_dataset function and in the setter method of the parent class AnomalibDataset. Furthermore, I added inference of file format from file suffix as default behavior.

# Add root to paths
samples["mask_path"] = samples["mask_path"].fillna("")
if root:
samples["image_path"] = samples["image_path"].map(lambda x: Path(root, x))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check and warn if a image path doesn't exist.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is already done in the setter method of AnomalibDataset.

###########################

# Run match-case twice to add missing columns iteratively
for _ in range(2):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a comment on how this check works and why it's run twice.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored this pattern matching block into more readable control flow blocks. I hope it's a bit clearer now. The main idea is that if the user does not provide all of the columns label_index, label and split, they are inferred from the given columns.

samples = samples.astype({"image_path": "str", "mask_path": "str", "label": "str"})

# Check if anomalous samples are in training set
if len(samples[(samples.label_index == LabelName.ABNORMAL) & (samples.split == Split.TRAIN)]) != 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use any()

@@ -58,7 +58,7 @@ def dataset_path(project_path: Path) -> Path:
for data_format in list(ImageDataFormat):
# Do not generate a dummy dataset for folder datasets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the comment as well

manuelkonrad and others added 4 commits May 20, 2025 20:06
Signed-off-by: Manuel Konrad <84141230+manuelkonrad@users.noreply.github.com>
Signed-off-by: Manuel Konrad <84141230+manuelkonrad@users.noreply.github.com>
@manuelkonrad
Copy link
Author

Hi @rajeshgangireddy, thanks a lot for your helpful review!

I implemented the proposed changes and also added an example notebook.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants