-
Notifications
You must be signed in to change notification settings - Fork 747
Tabular datamodule (Custom dataset from DataFrame, CSV, or Parquet) #2713
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Tabular datamodule (Custom dataset from DataFrame, CSV, or Parquet) #2713
Conversation
Signed-off-by: Manuel Konrad <84141230+manuelkonrad@users.noreply.github.com>
b90fb9f
to
d5c34ff
Compare
Thanks for creating this updated PR, and for your patience. We've been a bit side tracked by some other tasks. @rajeshgangireddy and @ashwinvaidya17 can you prioritise this PR review please? |
Hi @manuelkonrad If it's not too much, could you also add a simple example notebook under examples/notebooks/100_datamodules |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments.
Defaults to ``32``. | ||
num_workers (int): Number of workers for data loading. | ||
Defaults to ``8``. | ||
train_augmentations (Transform | None): Augmentations to apply dto the training images |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
train_augmentations (Transform | None): Augmentations to apply dto the training images | |
train_augmentations (Transform | None): Augmentations to apply to the training images |
Tabular: Tabular Datamodule | ||
""" | ||
pd_kwargs = pd_kwargs or {} | ||
samples = getattr(pd, f"read_{file_format}")(file_path, **pd_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some basic sanity checks before and after loading the files.
As an example :
read_func = getattr(pd, f"read_{file_format}", None)
if read_func is None:
raise ValueError(f"Unsupported file format: '{file_format}'")
Please add check if the file exists and if samples have the required columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the sanity checks. Further checks for the columns are in the make_tabular_dataset
function and in the setter method of the parent class AnomalibDataset
. Furthermore, I added inference of file format from file suffix as default behavior.
# Add root to paths | ||
samples["mask_path"] = samples["mask_path"].fillna("") | ||
if root: | ||
samples["image_path"] = samples["image_path"].map(lambda x: Path(root, x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check and warn if a image path doesn't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is already done in the setter method of AnomalibDataset
.
########################### | ||
|
||
# Run match-case twice to add missing columns iteratively | ||
for _ in range(2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a comment on how this check works and why it's run twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored this pattern matching block into more readable control flow blocks. I hope it's a bit clearer now. The main idea is that if the user does not provide all of the columns label_index
, label
and split
, they are inferred from the given columns.
samples = samples.astype({"image_path": "str", "mask_path": "str", "label": "str"}) | ||
|
||
# Check if anomalous samples are in training set | ||
if len(samples[(samples.label_index == LabelName.ABNORMAL) & (samples.split == Split.TRAIN)]) != 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use any()
tests/conftest.py
Outdated
@@ -58,7 +58,7 @@ def dataset_path(project_path: Path) -> Path: | |||
for data_format in list(ImageDataFormat): | |||
# Do not generate a dummy dataset for folder datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update the comment as well
Signed-off-by: Manuel Konrad <84141230+manuelkonrad@users.noreply.github.com>
Signed-off-by: Manuel Konrad <84141230+manuelkonrad@users.noreply.github.com>
Hi @rajeshgangireddy, thanks a lot for your helpful review! I implemented the proposed changes and also added an example notebook. |
📝 Description
Hi, I submitted a similar PR about six months ago (PR #2403). It has not been reviewed yet, and in the meantime, Anomalib has gone from v1 to v2. Therefore, I decided to refactor the feature according to the new structure and submit it as a new PR.
Tabular
datamodule which is instantiated directly from a pandasDataFrame
. It is an alternative to theFolder
datamodule for custom datasets where the labels are not encoded in the directory structure. Useful for situations where labels are refined regularly or for sub-sampling large datasets without copying or moving files.from_file
constructor which loads the data from a tabular file supported by pandas.✨ Changes
Select what type of change your PR is:
✅ Checklist
Before you submit your pull request, please make sure you have completed the following steps:
For more information about code review checklists, see the Code Review Checklist.