-
Notifications
You must be signed in to change notification settings - Fork 690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New datamodules design #572
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went over it and I don't have any major feedback. I feel this is a much cleaner design and more flexible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(i'm going to be a bit annoying because i think this refactor is important, and it will my work a lot)
this version is better than the other one IMO, but overall both are creating divergences with how LightningDatamodule
is supposed to be used (and how their downstream Dataset
is supposed to be designed)
There are some advantages in this design but I think following lightnings's patterns is a better way to go because then you make better use of it
Another guideline i'm considering: the behavior of parent classes should be kept minimal (or use the template method pattern, like lightning) to allow more composability.
i am trying to keep this short but below i give you 2 reasons to not go this way
i wrote another draft in my fork (didn't open another PR yet): my branch
reason 1
the order of things is not quite right
example:
LightningDatamodule.prepare_data()
is supposed to not change state and only manage the files in the system to make sure they`re in placeLightningDatamodule.setup()
should be loading whatever necessary thing to the memory
In the way things are being done here (correct me if i'm wrong), the setup is being made even before the prepare_data()
.
well, because of (2), the Dataset
somehow has to load things to the memory either (a) lazyly (following LightningDatamodule
's pattern) or (b) it should be instantiated at LightningDatamodule.setup()
, which means that the data preparation should not be a method (but a staticmethod could do).
in my draft i solved that by making the Dataset
follow exactly the same pattern: .prepare_data()
(which is calls a staticmethod
) and .setup()
(which will make use of the args from __init__
).
reason 2
i think the current design is biased by the behavior in folder.py
and forcing to fit the (already functional) code of the make_datate
-like functions
from what i understan, a Dataset
should keep the knowlege of how to find and load samples from a specific pre-defined split; it should not deal with dynamic splitting (i.e. creating random subsplits to create a validation set)
i think that kind of behavior should be at the Datamodule
level because it can ensure the compatibiilty between the splits when they've all already been setup
in my draft i used torch.utils.data.Subset
to do that and transfered all this behavior to AnomalibDatamodule
; two advantages:
- downstream class (see
mvtec.py
) become cleaner - the behavior is better factorized out
you will see that, for instance, seed
and create_validation_set
(smt. like that) dont need to be passed downstream to the child class (then to the function)
@djdameln i'm very willing to help on this refactor in particular but unfortunately i'm struggling to find time to create a neat and well explained PR like yours maybe can we make a call? i think it would be more productive |
@jpcbertoldo Thanks for your comments. As it turns out, it's quite tricky to get this design right, so it's nice to have an extra pair of eyes on this. I agree with some, but not all of your suggestions.
The call to
I had a look at your draft and correct me if I'm wrong, but I don't think it's necessary to have the implementation of I do see the added value of moving the setup method to the dataset class, because it allows us to instantiate the dataset object in the constructor of the datamodule. That way we can get rid of the awkward
This is true for MVTec, where the train/test set is fixed, but not for the Folder dataset where we create a random train/test split at runtime. When following your design, we would still need to ensure somehow that the same seed is used between This was actually the main motivation behind my latest design. By starting with a common dataset object with the 'Full' split, we ensure that we only have to call Anyway let's continue the discussion in a call. I'll schedule one for early next week. |
Co-authored-by: Joao P C Bertoldo <[email protected]>
Co-authored-by: Joao P C Bertoldo <[email protected]>
Co-authored-by: Joao P C Bertoldo <[email protected]>
…inotoolkit/anomalib into da/datamodules-alternative
Hmm, ok I see. Makes sense.
Got your point but still would be nice to have a Right now there is that option using the Btw, it's be nice to have an option
Yup.
Yup, good point! Are you going to make any other minor changes?
I think it'd be nice to explicitly state this in the code and/or doc for the sake of the record and to have future devs not question themselves hahaha. |
Good idea, I agree that this could be useful to have. I've added the seed argument to the datamodule. Please note that we're not using this argument when running the training from the entrypoint scripts, because we already set the global seed using
Done.
We're planning to merge this to a feature branch for now, as we're working on adding more functionality to the data side of the library (synthetic anomaly generation, support for video datasets), and we don't want to expose this to the main branch until it's stable. So we'll keep making incremental changes to the datamodules on the feature branch. In terms of overall design I don't expect many more changes though, so if you want to start building stuff on top of these base classes that should be fine. You could just target your PR to the feature branch.
I've added an explanation to the docstring for now. We'll probably update the documentation on the datamodules at a later point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice :)
So we don't need to pass the seed to ensure consistency between runs (but I still see the added value of the seed parameter for custom use-cases)
Yes, you are right.
BUT, there ways of ensuring consistency better than others haha.
Let's leave that for another chat : )
Thanks for the changes, great work!
Co-authored-by: Joao P C Bertoldo <[email protected]>
* New datamodules design (#572) * move sample generation to datamodule instead of dataset * move sample generation from init to setup * remove inference stage and add base classes * replace dataset classes with AnomalibDataset * move setup to base class, create samples as class method * update docstrings * refactor btech to new format * allow training with no anomalous data * remove MVTec name from comment * raise NotImplementedError in base class * allow both png and bmp images for btech * use label_index to check if dataset contains anomalous images * refactor getitem in dataset class * use iloc for indexing * move dataloader getters to base class * refactor to add validate stage in setup * implement alternative datamodules solution * small improvements * improve design * remove unused constructor arguments * adapt btech to new design * add prepare_data method for mvtec * implement more generic random splitting function * update docstrings for folder module * ensure type consistency when performing operations on dataset * change imports * change variable names * replace pass with NotImplementedError * allow training on folder without test images * use relative path for normal_test_dir * fix dataset tests * update validation set parameter in configs * change default argument * use setter for samples * hint options for val_split_mode * update assert message and docstring * revert name change dataset vs datamodule * typing and docstrings * remove samples argument from dataset constructor * val/test -> eval * remove Split.Full from enum * sort samples when setting * update warn message * formatting * use setter when creating samples in dataset classes * add tests for new dataset class * add test case for label aware random split * update parameter name in inferencers * move _setup implementation to base class * address codacy issues * fix pylint issues * codacy * update example dataset config in docs * fix test * move base classes to separate files (avoid circular import) * add base classes * update docstring * fix imports * validation_split_mode -> val_split_mode * update docs * Update anomalib/data/base/dataset.py Co-authored-by: Joao P C Bertoldo <[email protected]> * get length from self.samples * assert unique indices * check is_setup for individual datasets Co-authored-by: Joao P C Bertoldo <[email protected]> * remove assert in __getitem_\ Co-authored-by: Joao P C Bertoldo <[email protected]> * Update anomalib/data/btech.py Co-authored-by: Joao P C Bertoldo <[email protected]> * clearer assert message * clarify list inversion in comment * comments and typing * validate contents of samples dataframe before setting * add file paths check * add seed to random_split function * fix expected columns * fix typo * add seed parameter to datamodules * set global seed in test entrypoint * add NONE option to valsplitmode * clarify setup behaviour in docstring * fix typo Co-authored-by: Joao P C Bertoldo <[email protected]> Co-authored-by: Joao P C Bertoldo <[email protected]> * Video Datamodules (#676) * move sample generation to datamodule instead of dataset * move sample generation from init to setup * remove inference stage and add base classes * replace dataset classes with AnomalibDataset * move setup to base class, create samples as class method * update docstrings * refactor btech to new format * allow training with no anomalous data * remove MVTec name from comment * raise NotImplementedError in base class * allow both png and bmp images for btech * use label_index to check if dataset contains anomalous images * refactor getitem in dataset class * use iloc for indexing * move dataloader getters to base class * refactor to add validate stage in setup * implement alternative datamodules solution * small improvements * improve design * remove unused constructor arguments * adapt btech to new design * add prepare_data method for mvtec * implement more generic random splitting function * update docstrings for folder module * ensure type consistency when performing operations on dataset * change imports * change variable names * replace pass with NotImplementedError * allow training on folder without test images * use relative path for normal_test_dir * fix dataset tests * update validation set parameter in configs * change default argument * use setter for samples * hint options for val_split_mode * update assert message and docstring * revert name change dataset vs datamodule * typing and docstrings * remove samples argument from dataset constructor * val/test -> eval * remove Split.Full from enum * sort samples when setting * update warn message * formatting * use setter when creating samples in dataset classes * add tests for new dataset class * add test case for label aware random split * update parameter name in inferencers * move _setup implementation to base class * address codacy issues * fix pylint issues * codacy * update example dataset config in docs * fix test * move base classes to separate files (avoid circular import) * add base classes * update docstring * fix imports * validation_split_mode -> val_split_mode * update docs * Update anomalib/data/base/dataset.py Co-authored-by: Joao P C Bertoldo <[email protected]> * get length from self.samples * assert unique indices * check is_setup for individual datasets Co-authored-by: Joao P C Bertoldo <[email protected]> * remove assert in __getitem_\ Co-authored-by: Joao P C Bertoldo <[email protected]> * Update anomalib/data/btech.py Co-authored-by: Joao P C Bertoldo <[email protected]> * clearer assert message * clarify list inversion in comment * comments and typing * validate contents of samples dataframe before setting * add file paths check * add seed to random_split function * fix expected columns * fix typo * add pedestrian and avenue datasets and video utils * add seed parameter to datamodules * set global seed in test entrypoint * add NONE option to valsplitmode * clarify setup behaviour in docstring * add basic visualization for video datasets * simplify ucsdped implementation * add ucsd and avenue to __all__ * add default value for task * add tests for ucsd and avenue * add tests for video dataset and utils * add download info for avenue dataset * add download info for ucsd pedestrian dataset * more consistent naming * fix path to masks folder in gt dir * pass original image in batch to facilitate visualization * convert mask files for avenue * suppress warning due to torchvision bug * fix bug in avenue masks * store visualizations for each video in separate folder * rename parameters * add warning for clip_length > 1 * fix dataset tests * fix labels tensor shape bug * add pyav to requirements * add description for avenue dataset * use pathlib * Update anomalib/data/avenue.py Co-authored-by: Samet Akcay <[email protected]> * Update anomalib/data/avenue.py Co-authored-by: Samet Akcay <[email protected]> * Update anomalib/data/utils/video.py Co-authored-by: Samet Akcay <[email protected]> * Update anomalib/data/base/video.py Co-authored-by: Samet Akcay <[email protected]> * Update anomalib/data/base/video.py Co-authored-by: Samet Akcay <[email protected]> * Update anomalib/data/ucsd_ped.py Co-authored-by: Samet Akcay <[email protected]> * import video dataset from base * fix bug when collecting ucsd samples * clean up datamodules tests * fix tests * remove redundant test cases * retrieve masks as numpy array * use pathlib * variable name * pathlib * use preprocesser from arguments * fix indexing bug Co-authored-by: Joao P C Bertoldo <[email protected]> Co-authored-by: Samet Akcay <[email protected]> * Update lightning_inference.py * Make `val split ratio` configurable (#760) * make val split ratio configurable * use DeprecationWarning, update config key * Add support for Detection task type (#732) * add basic support for detection task * use enum for task type * formatting * small bugfix * add unit tests for bounding box conversion * update error message * use as_tensor * typing and docstring * explicit keyword arguments * simplify bbox handling in video dataset * docstring consistency * add missing licenses * add whitespace for readability * add missing license * Update anomalib/data/utils/boxes.py Co-authored-by: Samet Akcay <[email protected]> * Revert "Update anomalib/data/utils/boxes.py" This reverts commit cec6138. * add test case for custom collate function * docstring * add integration tests for detection dataloading * extend and clean up datamodules tests * add detection task type to visualizer tests * only show pred_boxes during inference * add detection support for torch inference * add detection support for openvino inference * test inference for all task types * pylint Co-authored-by: Samet Akcay <[email protected]> * [Datamodules] Update deprecation messages (#764) * update deprecation messages * raise warnings as DeprecationWarning * Improve image source parsing for Folder dataset (#784) * mask -> mask_dir * properly handle absolute and relative paths * make root path parameter optional * formatting * path -> root * update docs * remove options hint for name parameter * refactor function * Update anomalib/config/config.py Co-authored-by: Samet Akcay <[email protected]> * Update anomalib/config/config.py Co-authored-by: Samet Akcay <[email protected]> * make root and abnormal_dir optional * Update anomalib/data/folder.py Co-authored-by: Samet Akcay <[email protected]> Co-authored-by: Samet Akcay <[email protected]> * Synthetic anomaly for testing and validation (#634) * move sample generation to datamodule instead of dataset * move sample generation from init to setup * remove inference stage and add base classes * replace dataset classes with AnomalibDataset * move setup to base class, create samples as class method * update docstrings * refactor btech to new format * allow training with no anomalous data * remove MVTec name from comment * raise NotImplementedError in base class * allow both png and bmp images for btech * use label_index to check if dataset contains anomalous images * refactor getitem in dataset class * use iloc for indexing * move dataloader getters to base class * refactor to add validate stage in setup * implement alternative datamodules solution * small improvements * improve design * remove unused constructor arguments * adapt btech to new design * add prepare_data method for mvtec * implement more generic random splitting function * update docstrings for folder module * ensure type consistency when performing operations on dataset * change imports * change variable names * replace pass with NotImplementedError * allow training on folder without test images * use relative path for normal_test_dir * fix dataset tests * update validation set parameter in configs * change default argument * use setter for samples * hint options for val_split_mode * update assert message and docstring * revert name change dataset vs datamodule * typing and docstrings * remove samples argument from dataset constructor * val/test -> eval * remove Split.Full from enum * sort samples when setting * update warn message * formatting * use setter when creating samples in dataset classes * add tests for new dataset class * add test case for label aware random split * update parameter name in inferencers * move _setup implementation to base class * address codacy issues * fix pylint issues * codacy * update example dataset config in docs * fix test * move base classes to separate files (avoid circular import) * add synthetic dataset class * move augmenter to data directory * add base classes * update docstring * use synthetic dataset in base datamodule * fix imports * clean up synthetic anomaly dataset implementation * fix mistake in augmenter * change default split ratio * remove accidentally added file * validation_split_mode -> val_split_mode * update docs * Update anomalib/data/base/dataset.py Co-authored-by: Joao P C Bertoldo <[email protected]> * get length from self.samples * assert unique indices * check is_setup for individual datasets Co-authored-by: Joao P C Bertoldo <[email protected]> * remove assert in __getitem_\ Co-authored-by: Joao P C Bertoldo <[email protected]> * Update anomalib/data/btech.py Co-authored-by: Joao P C Bertoldo <[email protected]> * clearer assert message * clarify list inversion in comment * comments and typing * validate contents of samples dataframe before setting * add file paths check * add seed to random_split function * fix expected columns * fix typo * add seed parameter to datamodules * set global seed in test entrypoint * add NONE option to valsplitmode * clarify setup behaviour in docstring * add logging message * use val_split_ratio for synthetic validation set * pathlib * make synthetic anomaly available for test set * update configs * add tests * simplify test set splitting logic * update docstring * add missing licence * split_normal_and_anomalous -> split_by_label * VideoAnomalib -> AnomalibVideo Co-authored-by: Joao P C Bertoldo <[email protected]> * Bugfixes for Datamodules feature branch (#800) * properly handle NoneType mask_dir and add test case * fix wrong deprecation handling * Deprecate PreProcessor (#795) * deprecate PreProcessor * update configs * update deprecation messages * update video dataset * update inference dataset * move transforms to data module * update and extend transform tests * fix cyclic import * add validity checks for image size and center crop * pass image size as tuple * update path to get_transforms * update error message * fix center crop tuple conversion * update inferencers * remove draem transform config * update changelog * fix cyclic import * add crop size vs image size check * improve readability * mypy * use enum to configure input normalization * update lightning inference * update inference dataset * [Datamodules] Fix bug in bbox score to image score conversion (#803) handle empty box predictions * Improve handling of `test_split_mode='none'` and `val_split_mode='none'` (#801) * enable none as split mode * use get to retrieve config keys * update deprecation message and config key * fix to float transform * Detection improvements (#820) * apply pixel threshold to bbox detections * allow visualizing normal boxes * normalize box scores * fix bbox logic in base anomaly module * boxes_scores -> box_scores * fix inferencers * update changelog * update csflow config to new format * remove unused imports * line length * suppress bandit warnings * use torch rng in augmenter * use tuple instead of list * add missing params to dosctring * add missing licence information * COLS -> COLUMNS * typing and variable naming * remove duplicate parameter in docstring * im_dir -> image_dir * typing and docstring * typing * ValSplitMode -> ValidationSplitMode * add missing licence * rename variable * remove empty comment * remove unused class attribute * [Detection] Compute box score when generating boxes from masks (#828) * infer box scores from anomaly maps * discard single pixel boxes * revert discard single pixel boxes * add test case for bbox scores * update torch inferencer * minor refactor * revert val_split_mode -> validation_split_mode Co-authored-by: Joao P C Bertoldo <[email protected]> Co-authored-by: Samet Akcay <[email protected]>
* New datamodules design (#572) * move sample generation to datamodule instead of dataset * move sample generation from init to setup * remove inference stage and add base classes * replace dataset classes with AnomalibDataset * move setup to base class, create samples as class method * update docstrings * refactor btech to new format * allow training with no anomalous data * remove MVTec name from comment * raise NotImplementedError in base class * allow both png and bmp images for btech * use label_index to check if dataset contains anomalous images * refactor getitem in dataset class * use iloc for indexing * move dataloader getters to base class * refactor to add validate stage in setup * implement alternative datamodules solution * small improvements * improve design * remove unused constructor arguments * adapt btech to new design * add prepare_data method for mvtec * implement more generic random splitting function * update docstrings for folder module * ensure type consistency when performing operations on dataset * change imports * change variable names * replace pass with NotImplementedError * allow training on folder without test images * use relative path for normal_test_dir * fix dataset tests * update validation set parameter in configs * change default argument * use setter for samples * hint options for val_split_mode * update assert message and docstring * revert name change dataset vs datamodule * typing and docstrings * remove samples argument from dataset constructor * val/test -> eval * remove Split.Full from enum * sort samples when setting * update warn message * formatting * use setter when creating samples in dataset classes * add tests for new dataset class * add test case for label aware random split * update parameter name in inferencers * move _setup implementation to base class * address codacy issues * fix pylint issues * codacy * update example dataset config in docs * fix test * move base classes to separate files (avoid circular import) * add base classes * update docstring * fix imports * validation_split_mode -> val_split_mode * update docs * Update anomalib/data/base/dataset.py Co-authored-by: Joao P C Bertoldo <[email protected]> * get length from self.samples * assert unique indices * check is_setup for individual datasets Co-authored-by: Joao P C Bertoldo <[email protected]> * remove assert in __getitem_\ Co-authored-by: Joao P C Bertoldo <[email protected]> * Update anomalib/data/btech.py Co-authored-by: Joao P C Bertoldo <[email protected]> * clearer assert message * clarify list inversion in comment * comments and typing * validate contents of samples dataframe before setting * add file paths check * add seed to random_split function * fix expected columns * fix typo * add seed parameter to datamodules * set global seed in test entrypoint * add NONE option to valsplitmode * clarify setup behaviour in docstring * fix typo Co-authored-by: Joao P C Bertoldo <[email protected]> Co-authored-by: Joao P C Bertoldo <[email protected]> * Video Datamodules (#676) * move sample generation to datamodule instead of dataset * move sample generation from init to setup * remove inference stage and add base classes * replace dataset classes with AnomalibDataset * move setup to base class, create samples as class method * update docstrings * refactor btech to new format * allow training with no anomalous data * remove MVTec name from comment * raise NotImplementedError in base class * allow both png and bmp images for btech * use label_index to check if dataset contains anomalous images * refactor getitem in dataset class * use iloc for indexing * move dataloader getters to base class * refactor to add validate stage in setup * implement alternative datamodules solution * small improvements * improve design * remove unused constructor arguments * adapt btech to new design * add prepare_data method for mvtec * implement more generic random splitting function * update docstrings for folder module * ensure type consistency when performing operations on dataset * change imports * change variable names * replace pass with NotImplementedError * allow training on folder without test images * use relative path for normal_test_dir * fix dataset tests * update validation set parameter in configs * change default argument * use setter for samples * hint options for val_split_mode * update assert message and docstring * revert name change dataset vs datamodule * typing and docstrings * remove samples argument from dataset constructor * val/test -> eval * remove Split.Full from enum * sort samples when setting * update warn message * formatting * use setter when creating samples in dataset classes * add tests for new dataset class * add test case for label aware random split * update parameter name in inferencers * move _setup implementation to base class * address codacy issues * fix pylint issues * codacy * update example dataset config in docs * fix test * move base classes to separate files (avoid circular import) * add base classes * update docstring * fix imports * validation_split_mode -> val_split_mode * update docs * Update anomalib/data/base/dataset.py Co-authored-by: Joao P C Bertoldo <[email protected]> * get length from self.samples * assert unique indices * check is_setup for individual datasets Co-authored-by: Joao P C Bertoldo <[email protected]> * remove assert in __getitem_\ Co-authored-by: Joao P C Bertoldo <[email protected]> * Update anomalib/data/btech.py Co-authored-by: Joao P C Bertoldo <[email protected]> * clearer assert message * clarify list inversion in comment * comments and typing * validate contents of samples dataframe before setting * add file paths check * add seed to random_split function * fix expected columns * fix typo * add pedestrian and avenue datasets and video utils * add seed parameter to datamodules * set global seed in test entrypoint * add NONE option to valsplitmode * clarify setup behaviour in docstring * add basic visualization for video datasets * simplify ucsdped implementation * add ucsd and avenue to __all__ * add default value for task * add tests for ucsd and avenue * add tests for video dataset and utils * add download info for avenue dataset * add download info for ucsd pedestrian dataset * more consistent naming * fix path to masks folder in gt dir * pass original image in batch to facilitate visualization * convert mask files for avenue * suppress warning due to torchvision bug * fix bug in avenue masks * store visualizations for each video in separate folder * rename parameters * add warning for clip_length > 1 * fix dataset tests * fix labels tensor shape bug * add pyav to requirements * add description for avenue dataset * use pathlib * Update anomalib/data/avenue.py Co-authored-by: Samet Akcay <[email protected]> * Update anomalib/data/avenue.py Co-authored-by: Samet Akcay <[email protected]> * Update anomalib/data/utils/video.py Co-authored-by: Samet Akcay <[email protected]> * Update anomalib/data/base/video.py Co-authored-by: Samet Akcay <[email protected]> * Update anomalib/data/base/video.py Co-authored-by: Samet Akcay <[email protected]> * Update anomalib/data/ucsd_ped.py Co-authored-by: Samet Akcay <[email protected]> * import video dataset from base * fix bug when collecting ucsd samples * clean up datamodules tests * fix tests * remove redundant test cases * retrieve masks as numpy array * use pathlib * variable name * pathlib * use preprocesser from arguments * fix indexing bug Co-authored-by: Joao P C Bertoldo <[email protected]> Co-authored-by: Samet Akcay <[email protected]> * Update lightning_inference.py * Make `val split ratio` configurable (#760) * make val split ratio configurable * use DeprecationWarning, update config key * Add support for Detection task type (#732) * add basic support for detection task * use enum for task type * formatting * small bugfix * add unit tests for bounding box conversion * update error message * use as_tensor * typing and docstring * explicit keyword arguments * simplify bbox handling in video dataset * docstring consistency * add missing licenses * add whitespace for readability * add missing license * Update anomalib/data/utils/boxes.py Co-authored-by: Samet Akcay <[email protected]> * Revert "Update anomalib/data/utils/boxes.py" This reverts commit cec6138. * add test case for custom collate function * docstring * add integration tests for detection dataloading * extend and clean up datamodules tests * add detection task type to visualizer tests * only show pred_boxes during inference * add detection support for torch inference * add detection support for openvino inference * test inference for all task types * pylint Co-authored-by: Samet Akcay <[email protected]> * [Datamodules] Update deprecation messages (#764) * update deprecation messages * raise warnings as DeprecationWarning * Improve image source parsing for Folder dataset (#784) * mask -> mask_dir * properly handle absolute and relative paths * make root path parameter optional * formatting * path -> root * update docs * remove options hint for name parameter * refactor function * Update anomalib/config/config.py Co-authored-by: Samet Akcay <[email protected]> * Update anomalib/config/config.py Co-authored-by: Samet Akcay <[email protected]> * make root and abnormal_dir optional * Update anomalib/data/folder.py Co-authored-by: Samet Akcay <[email protected]> Co-authored-by: Samet Akcay <[email protected]> * Synthetic anomaly for testing and validation (#634) * move sample generation to datamodule instead of dataset * move sample generation from init to setup * remove inference stage and add base classes * replace dataset classes with AnomalibDataset * move setup to base class, create samples as class method * update docstrings * refactor btech to new format * allow training with no anomalous data * remove MVTec name from comment * raise NotImplementedError in base class * allow both png and bmp images for btech * use label_index to check if dataset contains anomalous images * refactor getitem in dataset class * use iloc for indexing * move dataloader getters to base class * refactor to add validate stage in setup * implement alternative datamodules solution * small improvements * improve design * remove unused constructor arguments * adapt btech to new design * add prepare_data method for mvtec * implement more generic random splitting function * update docstrings for folder module * ensure type consistency when performing operations on dataset * change imports * change variable names * replace pass with NotImplementedError * allow training on folder without test images * use relative path for normal_test_dir * fix dataset tests * update validation set parameter in configs * change default argument * use setter for samples * hint options for val_split_mode * update assert message and docstring * revert name change dataset vs datamodule * typing and docstrings * remove samples argument from dataset constructor * val/test -> eval * remove Split.Full from enum * sort samples when setting * update warn message * formatting * use setter when creating samples in dataset classes * add tests for new dataset class * add test case for label aware random split * update parameter name in inferencers * move _setup implementation to base class * address codacy issues * fix pylint issues * codacy * update example dataset config in docs * fix test * move base classes to separate files (avoid circular import) * add synthetic dataset class * move augmenter to data directory * add base classes * update docstring * use synthetic dataset in base datamodule * fix imports * clean up synthetic anomaly dataset implementation * fix mistake in augmenter * change default split ratio * remove accidentally added file * validation_split_mode -> val_split_mode * update docs * Update anomalib/data/base/dataset.py Co-authored-by: Joao P C Bertoldo <[email protected]> * get length from self.samples * assert unique indices * check is_setup for individual datasets Co-authored-by: Joao P C Bertoldo <[email protected]> * remove assert in __getitem_\ Co-authored-by: Joao P C Bertoldo <[email protected]> * Update anomalib/data/btech.py Co-authored-by: Joao P C Bertoldo <[email protected]> * clearer assert message * clarify list inversion in comment * comments and typing * validate contents of samples dataframe before setting * add file paths check * add seed to random_split function * fix expected columns * fix typo * add seed parameter to datamodules * set global seed in test entrypoint * add NONE option to valsplitmode * clarify setup behaviour in docstring * add logging message * use val_split_ratio for synthetic validation set * pathlib * make synthetic anomaly available for test set * update configs * add tests * simplify test set splitting logic * update docstring * add missing licence * split_normal_and_anomalous -> split_by_label * VideoAnomalib -> AnomalibVideo Co-authored-by: Joao P C Bertoldo <[email protected]> * Bugfixes for Datamodules feature branch (#800) * properly handle NoneType mask_dir and add test case * fix wrong deprecation handling * Deprecate PreProcessor (#795) * deprecate PreProcessor * update configs * update deprecation messages * update video dataset * update inference dataset * move transforms to data module * update and extend transform tests * fix cyclic import * add validity checks for image size and center crop * pass image size as tuple * update path to get_transforms * update error message * fix center crop tuple conversion * update inferencers * remove draem transform config * update changelog * fix cyclic import * add crop size vs image size check * improve readability * mypy * use enum to configure input normalization * update lightning inference * update inference dataset * [Datamodules] Fix bug in bbox score to image score conversion (#803) handle empty box predictions * Improve handling of `test_split_mode='none'` and `val_split_mode='none'` (#801) * enable none as split mode * use get to retrieve config keys * update deprecation message and config key * fix to float transform * Detection improvements (#820) * apply pixel threshold to bbox detections * allow visualizing normal boxes * normalize box scores * fix bbox logic in base anomaly module * boxes_scores -> box_scores * fix inferencers * update changelog * update csflow config to new format * remove unused imports * line length * refactor make_mvtec_dataset to improve flexibility * add visa dataset * move download and extract functionality to shared location * move visa subset splitting to separate method * update changelog * add tests for visa dataset * suppress bandit url warning * update test * address PR comments * suppress bandit warnings * use torch rng in augmenter * fix logic in prepare_data * add comments * cleaner zipfile import * address PR comments * use tuple instead of list * add missing params to dosctring * add missing licence information * COLS -> COLUMNS * typing and variable naming * remove duplicate parameter in docstring * im_dir -> image_dir * typing and docstring * typing * ValSplitMode -> ValidationSplitMode * add missing licence * rename variable * remove empty comment * remove unused class attribute * [Detection] Compute box score when generating boxes from masks (#828) * infer box scores from anomaly maps * discard single pixel boxes * revert discard single pixel boxes * add test case for bbox scores * update torch inferencer * minor refactor * revert val_split_mode -> validation_split_mode * use empty string instead of nan as empty mask path * typing Co-authored-by: Joao P C Bertoldo <[email protected]> Co-authored-by: Samet Akcay <[email protected]>
* fix pylint issues * codacy * update example dataset config in docs * fix test * move base classes to separate files (avoid circular import) * add base classes * update docstring * fix imports * validation_split_mode -> val_split_mode * update docs * Update anomalib/data/base/dataset.py Co-authored-by: Joao P C Bertoldo <[email protected]> * get length from self.samples * assert unique indices * check is_setup for individual datasets Co-authored-by: Joao P C Bertoldo <[email protected]> * remove assert in __getitem_\ Co-authored-by: Joao P C Bertoldo <[email protected]> * Update anomalib/data/btech.py Co-authored-by: Joao P C Bertoldo <[email protected]> * clearer assert message * clarify list inversion in comment * comments and typing * validate contents of samples dataframe before setting * add file paths check * add seed to random_split function * fix expected columns * fix typo * add pedestrian and avenue datasets and video utils * add seed parameter to datamodules * set global seed in test entrypoint * add NONE option to valsplitmode * clarify setup behaviour in docstring * Created rbad directory * Keep refactoring region-extractor * rename new_image_sizes to transformed_image_sizes * Renamed the variables in region extractor * post-process function in region extractor * Refactored tile-boxes function * Added feature extractor * Add main.py * Added feature extractor to tests * Update the jupyter notebook * Uncomment loa weights from region.py * Add feature and region extractors * Finished feature-extractor implementation * Rename the algo as rkde * New datamodules design (#572) * move sample generation to datamodule instead of dataset * move sample generation from init to setup * remove inference stage and add base classes * replace dataset classes with AnomalibDataset * move setup to base class, create samples as class method * update docstrings * refactor btech to new format * allow training with no anomalous data * remove MVTec name from comment * raise NotImplementedError in base class * allow both png and bmp images for btech * use label_index to check if dataset contains anomalous images * refactor getitem in dataset class * use iloc for indexing * move dataloader getters to base class * refactor to add validate stage in setup * implement alternative datamodules solution * small improvements * improve design * remove unused constructor arguments * adapt btech to new design * add prepare_data method for mvtec * implement more generic random splitting function * update docstrings for folder module * ensure type consistency when performing operations on dataset * change imports * change variable names * replace pass with NotImplementedError * allow training on folder without test images * use relative path for normal_test_dir * fix dataset tests * update validation set parameter in configs * change default argument * use setter for samples * hint options for val_split_mode * update assert message and docstring * revert name change dataset vs datamodule * typing and docstrings * remove samples argument from dataset constructor * val/test -> eval * remove Split.Full from enum * sort samples when setting * update warn message * formatting * use setter when creating samples in dataset classes * add tests for new dataset class * add test case for label aware random split * update parameter name in inferencers * move _setup implementation to base class * address codacy issues * fix pylint issues * codacy * update example dataset config in docs * fix test * move base classes to separate files (avoid circular import) * add base classes * update docstring * fix imports * validation_split_mode -> val_split_mode * update docs * Update anomalib/data/base/dataset.py Co-authored-by: Joao P C Bertoldo <[email protected]> * get length from self.samples * assert unique indices * check is_setup for individual datasets Co-authored-by: Joao P C Bertoldo <[email protected]> * remove assert in __getitem_\ Co-authored-by: Joao P C Bertoldo <[email protected]> * Update anomalib/data/btech.py Co-authored-by: Joao P C Bertoldo <[email protected]> * clearer assert message * clarify list inversion in comment * comments and typing * validate contents of samples dataframe before setting * add file paths check * add seed to random_split function * fix expected columns * fix typo * add seed parameter to datamodules * set global seed in test entrypoint * add NONE option to valsplitmode * clarify setup behaviour in docstring * fix typo Co-authored-by: Joao P C Bertoldo <[email protected]> Co-authored-by: Joao P C Bertoldo <[email protected]> * add basic visualization for video datasets * simplify ucsdped implementation * TODO: Investigate torch_model * add ucsd and avenue to __all__ * add default value for task * add tests for ucsd and avenue * add tests for video dataset and utils * add download info for avenue dataset * add download info for ucsd pedestrian dataset * more consistent naming * fix path to masks folder in gt dir * pass original image in batch to facilitate visualization * convert mask files for avenue * suppress warning due to torchvision bug * fix bug in avenue masks * store visualizations for each video in separate folder * rename parameters * add warning for clip_length > 1 * fix dataset tests * fix labels tensor shape bug * add pyav to requirements * Add TODO notes * add todo notes * add description for avenue dataset * use pathlib * Update anomalib/data/avenue.py Co-authored-by: Samet Akcay <[email protected]> * Update anomalib/data/avenue.py Co-authored-by: Samet Akcay <[email protected]> * Update anomalib/data/utils/video.py Co-authored-by: Samet Akcay <[email protected]> * Update anomalib/data/base/video.py Co-authored-by: Samet Akcay <[email protected]> * Update anomalib/data/base/video.py Co-authored-by: Samet Akcay <[email protected]> * Update anomalib/data/ucsd_ped.py Co-authored-by: Samet Akcay <[email protected]> * import video dataset from base * fix bug when collecting ucsd samples * clean up datamodules tests * fix tests * remove redundant test cases * add test case for normality model * retrieve masks as numpy array * use pathlib * variable name * pathlib * use preprocesser from arguments * fix indexing bug * Video Datamodules (#676) * move sample generation to datamodule instead of dataset * move sample generation from init to setup * remove inference stage and add base classes * replace dataset classes with AnomalibDataset * move setup to base class, create samples as class method * update docstrings * refactor btech to new format * allow training with no anomalous data * remove MVTec name from comment * raise NotImplementedError in base class * allow both png and bmp images for btech * use label_index to check if dataset contains anomalous images * refactor getitem in dataset class * use iloc for indexing * move dataloader getters to base class * refactor to add validate stage in setup * implement alternative datamodules solution * small improvements * improve design * remove unused constructor arguments * adapt btech to new design * add prepare_data method for mvtec * implement more generic random splitting function * update docstrings for folder module * ensure type consistency when performing operations on dataset * change imports * change variable names * replace pass with NotImplementedError * allow training on folder without test images * use relative path for normal_test_dir * fix dataset tests * update validation set parameter in configs * change default argument * use setter for samples * hint options for val_split_mode * update assert message and docstring * revert name change dataset vs datamodule * typing and docstrings * remove samples argument from dataset constructor * val/test -> eval * remove Split.Full from enum * sort samples when setting * update warn message * formatting * use setter when creating samples in dataset classes * add tests for new dataset class * add test case for label aware random split * update parameter name in inferencers * move _setup implementation to base class * address codacy issues * fix pylint issues * codacy * update example dataset config in docs * fix test * move base classes to separate files (avoid circular import) * add base classes * update docstring * fix imports * validation_split_mode -> val_split_mode * update docs * Update anomalib/data/base/dataset.py Co-authored-by: Joao P C Bertoldo <[email protected]> * get length from self.samples * assert unique indices * check is_setup for individual datasets Co-authored-by: Joao P C Bertoldo <[email protected]> * remove assert in __getitem_\ Co-authored-by: Joao P C Bertoldo <[email protected]> * Update anomalib/data/btech.py Co-authored-by: Joao P C Bertoldo <[email protected]> * clearer assert message * clarify list inversion in comment * comments and typing * validate contents of samples dataframe before setting * add file paths check * add seed to random_split function * fix expected columns * fix typo * add pedestrian and avenue datasets and video utils * add seed parameter to datamodules * set global seed in test entrypoint * add NONE option to valsplitmode * clarify setup behaviour in docstring * add basic visualization for video datasets * simplify ucsdped implementation * add ucsd and avenue to __all__ * add default value for task * add tests for ucsd and avenue * add tests for video dataset and utils * add download info for avenue dataset * add download info for ucsd pedestrian dataset * more consistent naming * fix path to masks folder in gt dir * pass original image in batch to facilitate visualization * convert mask files for avenue * suppress warning due to torchvision bug * fix bug in avenue masks * store visualizations for each video in separate folder * rename parameters * add warning for clip_length > 1 * fix dataset tests * fix labels tensor shape bug * add pyav to requirements * add description for avenue dataset * use pathlib * Update anomalib/data/avenue.py Co-authored-by: Samet Akcay <[email protected]> * Update anomalib/data/avenue.py Co-authored-by: Samet Akcay <[email protected]> * Update anomalib/data/utils/video.py Co-authored-by: Samet Akcay <[email protected]> * Update anomalib/data/base/video.py Co-authored-by: Samet Akcay <[email protected]> * Update anomalib/data/base/video.py Co-authored-by: Samet Akcay <[email protected]> * Update anomalib/data/ucsd_ped.py Co-authored-by: Samet Akcay <[email protected]> * import video dataset from base * fix bug when collecting ucsd samples * clean up datamodules tests * fix tests * remove redundant test cases * retrieve masks as numpy array * use pathlib * variable name * pathlib * use preprocesser from arguments * fix indexing bug Co-authored-by: Joao P C Bertoldo <[email protected]> Co-authored-by: Samet Akcay <[email protected]> * properly handle batch processing * include batch index in rois tensor * return rkde results as lists * update default rkde config * add basic support for detection task * use enum for task type * formatting * small bugfix * add unit tests for bounding box conversion * update error message * use as_tensor * typing and docstring * explicit keyword arguments * simplify bbox handling in video dataset * docstring consistency * add missing licenses * add whitespace for readability * add missing license * Update anomalib/data/utils/boxes.py Co-authored-by: Samet Akcay <[email protected]> * Revert "Update anomalib/data/utils/boxes.py" This reverts commit cec6138. * add test case for custom collate function * docstring * add integration tests for detection dataloading * extend and clean up datamodules tests * add detection task type to visualizer tests * Update lightning_inference.py * only show pred_boxes during inference * add detection support for torch inference * add detection support for openvino inference * test inference for all task types * pylint * Make `val split ratio` configurable (#760) * make val split ratio configurable * use DeprecationWarning, update config key * Add support for Detection task type (#732) * add basic support for detection task * use enum for task type * formatting * small bugfix * add unit tests for bounding box conversion * update error message * use as_tensor * typing and docstring * explicit keyword arguments * simplify bbox handling in video dataset * docstring consistency * add missing licenses * add whitespace for readability * add missing license * Update anomalib/data/utils/boxes.py Co-authored-by: Samet Akcay <[email protected]> * Revert "Update anomalib/data/utils/boxes.py" This reverts commit cec6138. * add test case for custom collate function * docstring * add integration tests for detection dataloading * extend and clean up datamodules tests * add detection task type to visualizer tests * only show pred_boxes during inference * add detection support for torch inference * add detection support for openvino inference * test inference for all task types * pylint Co-authored-by: Samet Akcay <[email protected]> * [Datamodules] Update deprecation messages (#764) * update deprecation messages * raise warnings as DeprecationWarning * update rkde * Improve image source parsing for Folder dataset (#784) * mask -> mask_dir * properly handle absolute and relative paths * make root path parameter optional * formatting * path -> root * update docs * remove options hint for name parameter * refactor function * Update anomalib/config/config.py Co-authored-by: Samet Akcay <[email protected]> * Update anomalib/config/config.py Co-authored-by: Samet Akcay <[email protected]> * make root and abnormal_dir optional * Update anomalib/data/folder.py Co-authored-by: Samet Akcay <[email protected]> Co-authored-by: Samet Akcay <[email protected]> * Synthetic anomaly for testing and validation (#634) * move sample generation to datamodule instead of dataset * move sample generation from init to setup * remove inference stage and add base classes * replace dataset classes with AnomalibDataset * move setup to base class, create samples as class method * update docstrings * refactor btech to new format * allow training with no anomalous data * remove MVTec name from comment * raise NotImplementedError in base class * allow both png and bmp images for btech * use label_index to check if dataset contains anomalous images * refactor getitem in dataset class * use iloc for indexing * move dataloader getters to base class * refactor to add validate stage in setup * implement alternative datamodules solution * small improvements * improve design * remove unused constructor arguments * adapt btech to new design * add prepare_data method for mvtec * implement more generic random splitting function * update docstrings for folder module * ensure type consistency when performing operations on dataset * change imports * change variable names * replace pass with NotImplementedError * allow training on folder without test images * use relative path for normal_test_dir * fix dataset tests * update validation set parameter in configs * change default argument * use setter for samples * hint options for val_split_mode * update assert message and docstring * revert name change dataset vs datamodule * typing and docstrings * remove samples argument from dataset constructor * val/test -> eval * remove Split.Full from enum * sort samples when setting * update warn message * formatting * use setter when creating samples in dataset classes * add tests for new dataset class * add test case for label aware random split * update parameter name in inferencers * move _setup implementation to base class * address codacy issues * fix pylint issues * codacy * update example dataset config in docs * fix test * move base classes to separate files (avoid circular import) * add synthetic dataset class * move augmenter to data directory * add base classes * update docstring * use synthetic dataset in base datamodule * fix imports * clean up synthetic anomaly dataset implementation * fix mistake in augmenter * change default split ratio * remove accidentally added file * validation_split_mode -> val_split_mode * update docs * Update anomalib/data/base/dataset.py Co-authored-by: Joao P C Bertoldo <[email protected]> * get length from self.samples * assert unique indices * check is_setup for individual datasets Co-authored-by: Joao P C Bertoldo <[email protected]> * remove assert in __getitem_\ Co-authored-by: Joao P C Bertoldo <[email protected]> * Update anomalib/data/btech.py Co-authored-by: Joao P C Bertoldo <[email protected]> * clearer assert message * clarify list inversion in comment * comments and typing * validate contents of samples dataframe before setting * add file paths check * add seed to random_split function * fix expected columns * fix typo * add seed parameter to datamodules * set global seed in test entrypoint * add NONE option to valsplitmode * clarify setup behaviour in docstring * add logging message * use val_split_ratio for synthetic validation set * pathlib * make synthetic anomaly available for test set * update configs * add tests * simplify test set splitting logic * update docstring * add missing licence * split_normal_and_anomalous -> split_by_label * VideoAnomalib -> AnomalibVideo Co-authored-by: Joao P C Bertoldo <[email protected]> * Bugfixes for Datamodules feature branch (#800) * properly handle NoneType mask_dir and add test case * fix wrong deprecation handling * Deprecate PreProcessor (#795) * deprecate PreProcessor * update configs * update deprecation messages * update video dataset * update inference dataset * move transforms to data module * update and extend transform tests * fix cyclic import * add validity checks for image size and center crop * pass image size as tuple * update path to get_transforms * update error message * fix center crop tuple conversion * update inferencers * remove draem transform config * update changelog * fix cyclic import * add crop size vs image size check * improve readability * mypy * use enum to configure input normalization * update lightning inference * update inference dataset * expose more parameters and fix wrong return format * fix tdd tests * update config * [Datamodules] Fix bug in bbox score to image score conversion (#803) handle empty box predictions * update config * apply pixel threshold to bbox detections * remove confidence threshold parameter from rkde * hardcode steepness and offset * rename variable * remove unused parameters from config * Improve handling of `test_split_mode='none'` and `val_split_mode='none'` (#801) * enable none as split mode * use get to retrieve config keys * update deprecation message and config key * update config with new keys * remove unused parameter * set device in rpn stage * move prediction format conversion to lightning model * clean up torch model * move region- and feature-extractor to separate files * allow visualizing normal boxes * refactor * WIP: simplify region extractor * simplify region extractor * cleanup and docstrings * typing * expose max detections per image parameter * explain configurable parameters * fix wrong config value * remove unnecessary squeeze * box_likelihood -> rcnn_box_threshold * update comments * remove unnecessary typing * separate density estimation stage from torch model * improve readability * change default transform settings * fix to float transform * simplify feature extractor * normalize box scores * further simplify region extractor * update comment * improve prn configurability * remove unnecessary check * use enum for roi stage options * use enum for feature scaling method * re-order parameters * clean up model dir * fix bbox logic in base anomaly module * update key in output dict * boxes_scores -> box_scores * remove notebook * add comments and todo * Detection improvements (#820) * apply pixel threshold to bbox detections * allow visualizing normal boxes * normalize box scores * fix bbox logic in base anomaly module * boxes_scores -> box_scores * fix inferencers * add readme * update changelog * update changelog * update csflow config to new format * initialize max_length as empty tensor * include RKDE in model tests * remove unused imports * line length * move kde classifier to shared location * fix import * re-use RKDE classifier in DFKDE * remove old imports * docstrings * fix codacy issues * load feature extractor weights from url * suppress bandit warnings * use torch rng in augmenter * typing * add fit method to torch model * fix typo * use enum when checking stage * use tuple instead of list * add missing params to dosctring * add missing licence information * COLS -> COLUMNS * typing and variable naming * remove duplicate parameter in docstring * im_dir -> image_dir * typing and docstring * typing * ValSplitMode -> ValidationSplitMode * add missing licence * rename variable * remove empty comment * remove unused class attribute * [Detection] Compute box score when generating boxes from masks (#828) * infer box scores from anomaly maps * discard single pixel boxes * revert discard single pixel boxes * add test case for bbox scores * update torch inferencer * minor refactor * revert val_split_mode -> validation_split_mode Co-authored-by: Joao P C Bertoldo <[email protected]> Co-authored-by: Samet <[email protected]>
Description
Summary of the design
AnomalibDataset
and its subclasses now have asetup
method, which is called fromDataModule._setup()
. When called, the dataframe will be created by using thecreate_dataset
function.setup
must be called before the dataset class can be used. This is so that we can instantiate the dataset in the constructor of the datamodule, beforeprepare_data
is called.DataModule
is responsible for performing the random subset splitting. It first creates the fixed subsets in the constructor, and then performs any additional subset splitting inDataModule._setup()
._setup
method from theDataModule
is called only once, and is independent of thestage
argument.DataModule
side, helper functions can be added to the data utils (seeconcatenate_datasets
andrandom_split
for an example).Responsibilites of the different classes:
create_dataset
functions: Create a dataframe with information about the samples, including any information about the fixed train/val/test split that may follow from the folder organization or annotation files of the dataset.AnomalibDataset
and subclasses: Prepare dataset items and ground truth labels + masks to be used as model input.AnomalibDataModule
: Create dataloaders and perform dynamic subset splitting.Advantages
Known issues
Some terminology is still a bit confusing, e.g. inconsistent use ofval
,infer
andtest
(tranform_config_val
vspre_process_infer
vstest_batch_size
)Tests need to be updated.