Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor data modules #558

Closed
wants to merge 16 commits into from
Closed

Refactor data modules #558

wants to merge 16 commits into from

Conversation

djdameln
Copy link
Contributor

Description

  • This PR refactors the data modules to reduce duplication.

  • The AnomalibDataset class was introduced, which replaces the individual dataset classes.

  • The base class AnomalibDataModule was introduced, and the setup method was moved from the concrete subclasses to the new base class.

  • The make_mvtec_dataset, make_btech_dataset and make_dataset functions are replaced by the _create_samples method. The implementation of this method differs between the individual data module classes.

  • The refactor also makes it possible to train when no anomalous images are present in the dataset. In this case, Anomalib will not run the test stage after training, so no performance metrics are reported. The user will be notified about this through a logging message.

  • Fixes Support training with only normal images (no evaluation) #277

Changes

  • Bug fix (non-breaking change which fixes an issue)
  • Refactor (non-breaking change which refactors the code base)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Checklist

  • My code follows the pre-commit style and check guidelines of this project.
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing tests pass locally with my changes

Copy link
Collaborator

@ashwinvaidya17 ashwinvaidya17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks the code is much cleaner now. I have two very minor comments.

anomalib/data/base.py Outdated Show resolved Hide resolved
anomalib/data/base.py Show resolved Hide resolved
Copy link
Collaborator

@ashwinvaidya17 ashwinvaidya17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Copy link
Contributor

@samet-akcay samet-akcay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better! Thanks!

I only have a single comment regarding the pre-processing operation.

anomalib/data/base.py Outdated Show resolved Hide resolved
@jpcbertoldo jpcbertoldo mentioned this pull request Sep 15, 2022
11 tasks
Copy link
Contributor

@jpcbertoldo jpcbertoldo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice refactor :)
i just wanted to give my 10 cents with two minor improvement suggestions (althogh the one about setup may not be that minor)

anomalib/data/base.py Outdated Show resolved Hide resolved
anomalib/data/base.py Outdated Show resolved Hide resolved
Copy link
Contributor

@samet-akcay samet-akcay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, a lot more cleaner

@samet-akcay samet-akcay requested review from samet-akcay and jpcbertoldo and removed request for jpcbertoldo September 16, 2022 17:15
Copy link
Contributor

@jpcbertoldo jpcbertoldo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AnomalibDataset.__getitem__ is better without those conditions on self.split IMO, tnx : )


i noticed a few minor things (i marked "[minor]" in my comments) but i have a bigger concern about the AnomalibDataModule._create_samples -- feel free to disconsider if you want to move forward with this sooner

a conceptual question: is an AnomalibDataset assumed to be split-specific?

if yes (IMO it should be), then the self._samples.split == split should be encapsulated somewhere in the AnomalibDataset (I'd put ._create_samples as an abstract method inside AnomalibDataset) and the dataframes should be split-specific (i.e. no need for a split column)

AnomalibDataModule.get_samples(split) would be just a multiplexer:

if split == "train":
    return self.train_data.samples  # which btw could be a `@property` hidding `_samples` 

another reason to do this: the AnomalibDataset would be more standalone (in the current form, one needs to instantiate the AnomalibDataModule in order to construct the former properly, or at least without hacking the latter)

image = read_image(image_path)
label_index = self.samples.iloc[index].label_index

item = dict(image_path=image_path, label=label_index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] inconsistent naming of concepts

within the dataset, label seems refer to a str since we see a label_index (one can guess it's a an int)

outside the dataset, label is an int (so one would expect label_str? or something alike)

if label_index == 0:
mask = np.zeros(shape=image.shape[:2])
else:
mask = cv2.imread(mask_path, flags=0) / 255.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] flags=cv2.IMREAD_GRAYSCALE would be more self-explanatory
or maybe a read_mask encapsulating the /255.0 as well (an analogous to read_image)

class AnomalibDataModule(LightningDataModule, ABC):
"""Base Anomalib data module."""

def __init__(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] val and test are assumed to be each other in a confusing way

(1) X_batch_size with X \in {train, test}, and test_batch_size is used both for test and val DataLoader

(2) transform_config_Y with Y in {train, val}, and transform_config_val is used both for test and val AnomalibDataset

i.e.

batch size: config is available for test and assumed for test and val
transform: config is available for val and assumed for test and val

self.num_workers = num_workers
self.create_validation_set = create_validation_set

if transform_config_train is not None and transform_config_val is None:
Copy link
Contributor

@jpcbertoldo jpcbertoldo Sep 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] Is there a reason for assuming this?

It could make sense that transform_config_train could have "weak" data augmentations (e.g. tiny brightness changes) but that should not be repeated in the validation set.


if this is to be kept, a warning wouldn't be harmful :)


def setup(self, stage: Optional[str] = None) -> None:
"""Setup train, validation and test data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If `stage` is `None`, all splits are setup.

)

if stage in (None, "fit", "validate"):
samples = self.get_samples("val") if self.create_validation_set else self.get_samples("test")
Copy link
Contributor

@jpcbertoldo jpcbertoldo Sep 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor]

Suggested change
samples = self.get_samples("val") if self.create_validation_set else self.get_samples("test")
if `self.create_validation_set`:
samples = self.get_samples("val")
else:
warnings.warn("Validation split is not availabe. Test split will be used for validation.")
samples = self.get_samples("test")

if you decide to move _create_samples to AnomalibDataset then this won't be necessary here

bool: Boolean indicating if any anomalous images have been assigned to the dataset or subset.
"""
samples = self.get_samples(split)
return 1 in list(samples.label_index)
Copy link
Contributor

@jpcbertoldo jpcbertoldo Sep 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor]

Suggested change
return 1 in list(samples.label_index)
return LABEL_INDEX_ANOMALOUS in list(samples.label_index)

a bit picky but a constant would clearer

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case would using enums be better? We can then have something like Label.Normal and Label.Anomalous

@jpcbertoldo
Copy link
Contributor

i created a sketch of the things i suggested in #564
i only modified base.py to be clear about the things i said above
feel free to consider it or not but if you do then the necessary modifications should be done in the downstream classes (mvtec, folder, and btech)

Copy link
Collaborator

@ashwinvaidya17 ashwinvaidya17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with the changes

bool: Boolean indicating if any anomalous images have been assigned to the dataset or subset.
"""
samples = self.get_samples(split)
return 1 in list(samples.label_index)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case would using enums be better? We can then have something like Label.Normal and Label.Anomalous

@djdameln djdameln closed this Oct 7, 2022
@djdameln djdameln deleted the da/refactor-datamodules branch December 13, 2022 11:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data Metrics Metric Component. Tools
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support training with only normal images (no evaluation)
4 participants