Refactor data modules #558

djdameln · 2022-09-14T15:28:09Z

Description

This PR refactors the data modules to reduce duplication.
The AnomalibDataset class was introduced, which replaces the individual dataset classes.
The base class AnomalibDataModule was introduced, and the setup method was moved from the concrete subclasses to the new base class.
The make_mvtec_dataset, make_btech_dataset and make_dataset functions are replaced by the _create_samples method. The implementation of this method differs between the individual data module classes.
The refactor also makes it possible to train when no anomalous images are present in the dataset. In this case, Anomalib will not run the test stage after training, so no performance metrics are reported. The user will be notified about this through a logging message.
Fixes Support training with only normal images (no evaluation) #277

Changes

Bug fix (non-breaking change which fixes an issue)
Refactor (non-breaking change which refactors the code base)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Checklist

My code follows the pre-commit style and check guidelines of this project.
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing tests pass locally with my changes

ashwinvaidya17

Thanks the code is much cleaner now. I have two very minor comments.

anomalib/data/base.py

ashwinvaidya17

Thanks

samet-akcay

Much better! Thanks!

I only have a single comment regarding the pre-processing operation.

anomalib/data/base.py

jpcbertoldo

nice refactor :)
i just wanted to give my 10 cents with two minor improvement suggestions (althogh the one about setup may not be that minor)

anomalib/data/base.py

samet-akcay

Thanks, a lot more cleaner

jpcbertoldo

AnomalibDataset.__getitem__ is better without those conditions on self.split IMO, tnx : )

i noticed a few minor things (i marked "[minor]" in my comments) but i have a bigger concern about the AnomalibDataModule._create_samples -- feel free to disconsider if you want to move forward with this sooner

a conceptual question: is an AnomalibDataset assumed to be split-specific?

if yes (IMO it should be), then the self._samples.split == split should be encapsulated somewhere in the AnomalibDataset (I'd put ._create_samples as an abstract method inside AnomalibDataset) and the dataframes should be split-specific (i.e. no need for a split column)

AnomalibDataModule.get_samples(split) would be just a multiplexer:

if split == "train":
    return self.train_data.samples  # which btw could be a `@property` hidding `_samples`

another reason to do this: the AnomalibDataset would be more standalone (in the current form, one needs to instantiate the AnomalibDataModule in order to construct the former properly, or at least without hacking the latter)

jpcbertoldo · 2022-09-18T15:18:44Z

anomalib/data/base.py

+        image = read_image(image_path)
+        label_index = self.samples.iloc[index].label_index
+
+        item = dict(image_path=image_path, label=label_index)


[minor] inconsistent naming of concepts

within the dataset, label seems refer to a str since we see a label_index (one can guess it's a an int)

outside the dataset, label is an int (so one would expect label_str? or something alike)

jpcbertoldo · 2022-09-18T15:24:18Z

anomalib/data/base.py

+            if label_index == 0:
+                mask = np.zeros(shape=image.shape[:2])
+            else:
+                mask = cv2.imread(mask_path, flags=0) / 255.0


[minor] flags=cv2.IMREAD_GRAYSCALE would be more self-explanatory
or maybe a read_mask encapsulating the /255.0 as well (an analogous to read_image)

jpcbertoldo · 2022-09-18T15:34:59Z

anomalib/data/base.py

+class AnomalibDataModule(LightningDataModule, ABC):
+    """Base Anomalib data module."""
+
+    def __init__(


[minor] val and test are assumed to be each other in a confusing way

(1) X_batch_size with X \in {train, test}, and test_batch_size is used both for test and val DataLoader

(2) transform_config_Y with Y in {train, val}, and transform_config_val is used both for test and val AnomalibDataset

i.e.

batch size: config is available for test and assumed for test and val
transform: config is available for val and assumed for test and val

jpcbertoldo · 2022-09-18T15:48:08Z

anomalib/data/base.py

+        self.num_workers = num_workers
+        self.create_validation_set = create_validation_set
+
+        if transform_config_train is not None and transform_config_val is None:


[minor] Is there a reason for assuming this?

It could make sense that transform_config_train could have "weak" data augmentations (e.g. tiny brightness changes) but that should not be repeated in the validation set.

if this is to be kept, a warning wouldn't be harmful :)

jpcbertoldo · 2022-09-18T16:17:33Z

anomalib/data/base.py

+
+    def setup(self, stage: Optional[str] = None) -> None:
+        """Setup train, validation and test data.
+


Suggested change

If `stage` is `None`, all splits are setup.

jpcbertoldo · 2022-09-18T16:22:48Z

anomalib/data/base.py

+            )
+
+        if stage in (None, "fit", "validate"):
+            samples = self.get_samples("val") if self.create_validation_set else self.get_samples("test")


[minor]

Suggested change

samples = self.get_samples("val") if self.create_validation_set else self.get_samples("test")

if `self.create_validation_set`:

samples = self.get_samples("val")

else:

warnings.warn("Validation split is not availabe. Test split will be used for validation.")

samples = self.get_samples("test")

if you decide to move _create_samples to AnomalibDataset then this won't be necessary here

jpcbertoldo · 2022-09-18T16:28:02Z

anomalib/data/base.py

+            bool: Boolean indicating if any anomalous images have been assigned to the dataset or subset.
+        """
+        samples = self.get_samples(split)
+        return 1 in list(samples.label_index)


[minor]

Suggested change

return 1 in list(samples.label_index)

return LABEL_INDEX_ANOMALOUS in list(samples.label_index)

a bit picky but a constant would clearer

In this case would using enums be better? We can then have something like Label.Normal and Label.Anomalous

jpcbertoldo · 2022-09-19T01:06:49Z

i created a sketch of the things i suggested in #564
i only modified base.py to be clear about the things i said above
feel free to consider it or not but if you do then the necessary modifications should be done in the downstream classes (mvtec, folder, and btech)

ashwinvaidya17

I am fine with the changes

ashwinvaidya17 · 2022-09-19T06:28:15Z

anomalib/data/base.py

+            bool: Boolean indicating if any anomalous images have been assigned to the dataset or subset.
+        """
+        samples = self.get_samples(split)
+        return 1 in list(samples.label_index)


In this case would using enums be better? We can then have something like Label.Normal and Label.Anomalous

djdameln added 8 commits September 9, 2022 17:37

move sample generation to datamodule instead of dataset

ee1cfce

move sample generation from init to setup

ec5199e

remove inference stage and add base classes

9f0a35e

replace dataset classes with AnomalibDataset

dea176f

move setup to base class, create samples as class method

62a04f8

update docstrings

e91afad

refactor btech to new format

df4a805

allow training with no anomalous data

c225a83

djdameln requested review from ashwinvaidya17 and samet-akcay September 14, 2022 15:28

github-actions bot added Data Metrics Metric Component. Tools labels Sep 14, 2022

ashwinvaidya17 reviewed Sep 15, 2022

View reviewed changes

anomalib/data/base.py Outdated Show resolved Hide resolved

anomalib/data/base.py Show resolved Hide resolved

djdameln added 3 commits September 15, 2022 10:56

remove MVTec name from comment

ac0dc8a

raise NotImplementedError in base class

5d90209

allow both png and bmp images for btech

c1e6724

ashwinvaidya17 approved these changes Sep 15, 2022

View reviewed changes

samet-akcay reviewed Sep 15, 2022

View reviewed changes

anomalib/data/base.py Outdated Show resolved Hide resolved

jpcbertoldo mentioned this pull request Sep 15, 2022

Jpcbertoldo/mvtec ad loco 2 #553

Closed

11 tasks

jpcbertoldo reviewed Sep 15, 2022

View reviewed changes

anomalib/data/base.py Outdated Show resolved Hide resolved

anomalib/data/base.py Outdated Show resolved Hide resolved

djdameln added 5 commits September 16, 2022 10:24

use label_index to check if dataset contains anomalous images

2d70d89

refactor getitem in dataset class

f5f17db

use iloc for indexing

f02065f

move dataloader getters to base class

9cba9da

refactor to add validate stage in setup

5b3e841

djdameln requested review from samet-akcay and ashwinvaidya17 September 16, 2022 16:42

samet-akcay approved these changes Sep 16, 2022

View reviewed changes

samet-akcay requested a review from jpcbertoldo September 16, 2022 17:15

samet-akcay requested review from samet-akcay and jpcbertoldo and removed request for jpcbertoldo September 16, 2022 17:15

samet-akcay approved these changes Sep 16, 2022

View reviewed changes

jpcbertoldo reviewed Sep 18, 2022

View reviewed changes

jpcbertoldo mentioned this pull request Sep 19, 2022

sketch my suggestions #564

Closed

ashwinvaidya17 approved these changes Sep 19, 2022

View reviewed changes

djdameln closed this Oct 7, 2022

djdameln deleted the da/refactor-datamodules branch December 13, 2022 11:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor data modules #558

Refactor data modules #558

djdameln commented Sep 14, 2022

ashwinvaidya17 left a comment

ashwinvaidya17 left a comment

samet-akcay left a comment

jpcbertoldo left a comment

samet-akcay left a comment

jpcbertoldo left a comment

jpcbertoldo Sep 18, 2022

jpcbertoldo Sep 18, 2022

jpcbertoldo Sep 18, 2022

jpcbertoldo Sep 18, 2022 •

edited

Loading

jpcbertoldo Sep 18, 2022

jpcbertoldo Sep 18, 2022 •

edited

Loading

jpcbertoldo Sep 18, 2022 •

edited

Loading

ashwinvaidya17 Sep 19, 2022

jpcbertoldo commented Sep 19, 2022

ashwinvaidya17 left a comment

ashwinvaidya17 Sep 19, 2022


		def setup(self, stage: Optional[str] = None) -> None:
		"""Setup train, validation and test data.

-            samples = self.get_samples("val") if self.create_validation_set else self.get_samples("test")
+            if `self.create_validation_set`:
+                samples = self.get_samples("val")
+            else:
+                warnings.warn("Validation split is not availabe. Test split will be used for validation.")
+                samples = self.get_samples("test")

	return 1 in list(samples.label_index)
	return LABEL_INDEX_ANOMALOUS in list(samples.label_index)

Refactor data modules #558

Refactor data modules #558

Conversation

djdameln commented Sep 14, 2022

Description

Changes

Checklist

ashwinvaidya17 left a comment

Choose a reason for hiding this comment

ashwinvaidya17 left a comment

Choose a reason for hiding this comment

samet-akcay left a comment

Choose a reason for hiding this comment

jpcbertoldo left a comment

Choose a reason for hiding this comment

samet-akcay left a comment

Choose a reason for hiding this comment

jpcbertoldo left a comment

Choose a reason for hiding this comment

jpcbertoldo Sep 18, 2022

Choose a reason for hiding this comment

jpcbertoldo Sep 18, 2022

Choose a reason for hiding this comment

jpcbertoldo Sep 18, 2022

Choose a reason for hiding this comment

jpcbertoldo Sep 18, 2022 • edited Loading

Choose a reason for hiding this comment

jpcbertoldo Sep 18, 2022

Choose a reason for hiding this comment

jpcbertoldo Sep 18, 2022 • edited Loading

Choose a reason for hiding this comment

jpcbertoldo Sep 18, 2022 • edited Loading

Choose a reason for hiding this comment

ashwinvaidya17 Sep 19, 2022

Choose a reason for hiding this comment

jpcbertoldo commented Sep 19, 2022

ashwinvaidya17 left a comment

Choose a reason for hiding this comment

ashwinvaidya17 Sep 19, 2022

Choose a reason for hiding this comment

jpcbertoldo Sep 18, 2022 •

edited

Loading

jpcbertoldo Sep 18, 2022 •

edited

Loading

jpcbertoldo Sep 18, 2022 •

edited

Loading