-
Notifications
You must be signed in to change notification settings - Fork 691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor data modules #558
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks the code is much cleaner now. I have two very minor comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much better! Thanks!
I only have a single comment regarding the pre-processing operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice refactor :)
i just wanted to give my 10 cents with two minor improvement suggestions (althogh the one about setup may not be that minor)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, a lot more cleaner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AnomalibDataset.__getitem__
is better without those conditions on self.split
IMO, tnx : )
i noticed a few minor things (i marked "[minor]" in my comments) but i have a bigger concern about the AnomalibDataModule._create_samples
-- feel free to disconsider if you want to move forward with this sooner
a conceptual question: is an AnomalibDataset
assumed to be split-specific?
if yes (IMO it should be), then the self._samples.split == split
should be encapsulated somewhere in the AnomalibDataset
(I'd put ._create_samples
as an abstract method inside AnomalibDataset
) and the dataframes should be split-specific (i.e. no need for a split
column)
AnomalibDataModule.get_samples(split)
would be just a multiplexer:
if split == "train":
return self.train_data.samples # which btw could be a `@property` hidding `_samples`
another reason to do this: the AnomalibDataset
would be more standalone (in the current form, one needs to instantiate the AnomalibDataModule
in order to construct the former properly, or at least without hacking the latter)
image = read_image(image_path) | ||
label_index = self.samples.iloc[index].label_index | ||
|
||
item = dict(image_path=image_path, label=label_index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor] inconsistent naming of concepts
within the dataset, label
seems refer to a str
since we see a label_index
(one can guess it's a an int
)
outside the dataset, label
is an int
(so one would expect label_str
? or something alike)
if label_index == 0: | ||
mask = np.zeros(shape=image.shape[:2]) | ||
else: | ||
mask = cv2.imread(mask_path, flags=0) / 255.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor] flags=cv2.IMREAD_GRAYSCALE
would be more self-explanatory
or maybe a read_mask
encapsulating the /255.0
as well (an analogous to read_image
)
class AnomalibDataModule(LightningDataModule, ABC): | ||
"""Base Anomalib data module.""" | ||
|
||
def __init__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor] val
and test
are assumed to be each other in a confusing way
(1) X_batch_size
with X \in {train, test}
, and test_batch_size
is used both for test
and val
DataLoader
(2) transform_config_Y
with Y in {train, val}
, and transform_config_val
is used both for test
and val
AnomalibDataset
i.e.
batch size: config is available for test
and assumed for test
and val
transform: config is available for val
and assumed for test
and val
self.num_workers = num_workers | ||
self.create_validation_set = create_validation_set | ||
|
||
if transform_config_train is not None and transform_config_val is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor] Is there a reason for assuming this?
It could make sense that transform_config_train
could have "weak" data augmentations (e.g. tiny brightness changes) but that should not be repeated in the validation set.
if this is to be kept, a warning wouldn't be harmful :)
|
||
def setup(self, stage: Optional[str] = None) -> None: | ||
"""Setup train, validation and test data. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If `stage` is `None`, all splits are setup. |
) | ||
|
||
if stage in (None, "fit", "validate"): | ||
samples = self.get_samples("val") if self.create_validation_set else self.get_samples("test") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor]
samples = self.get_samples("val") if self.create_validation_set else self.get_samples("test") | |
if `self.create_validation_set`: | |
samples = self.get_samples("val") | |
else: | |
warnings.warn("Validation split is not availabe. Test split will be used for validation.") | |
samples = self.get_samples("test") |
if you decide to move _create_samples
to AnomalibDataset
then this won't be necessary here
bool: Boolean indicating if any anomalous images have been assigned to the dataset or subset. | ||
""" | ||
samples = self.get_samples(split) | ||
return 1 in list(samples.label_index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor]
return 1 in list(samples.label_index) | |
return LABEL_INDEX_ANOMALOUS in list(samples.label_index) |
a bit picky but a constant would clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case would using enums be better? We can then have something like Label.Normal
and Label.Anomalous
i created a sketch of the things i suggested in #564 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with the changes
bool: Boolean indicating if any anomalous images have been assigned to the dataset or subset. | ||
""" | ||
samples = self.get_samples(split) | ||
return 1 in list(samples.label_index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case would using enums be better? We can then have something like Label.Normal
and Label.Anomalous
Description
This PR refactors the data modules to reduce duplication.
The
AnomalibDataset
class was introduced, which replaces the individual dataset classes.The base class
AnomalibDataModule
was introduced, and thesetup
method was moved from the concrete subclasses to the new base class.The
make_mvtec_dataset
,make_btech_dataset
andmake_dataset
functions are replaced by the_create_samples
method. The implementation of this method differs between the individual data module classes.The refactor also makes it possible to train when no anomalous images are present in the dataset. In this case, Anomalib will not run the test stage after training, so no performance metrics are reported. The user will be notified about this through a logging message.
Fixes Support training with only normal images (no evaluation) #277
Changes
Checklist