Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev #60

Merged
merged 14 commits into from
Sep 6, 2024
Merged

Dev #60

merged 14 commits into from
Sep 6, 2024

Conversation

tlarcher
Copy link
Collaborator

@tlarcher tlarcher commented Sep 6, 2024

📝 Changelog

Major

  • Added GLC24 pre_extracted habitat dataset and example (see PR 58 in the Links section)

  • Changed the way checkpoints are loaded from loading the state_dict of the model object to loading the state_dict of the LightningModule. This is a breaking change as examples needed to be updated by removing the replacement of "model." string in the loaded state_dict.

  • Added possibility to download model weights for any Malpolon model given a URL and a few file paths

  • Updated the way checkpoint_path is passed on to models. Added an attribute checkpoint_path for all Malpolon models

    • Updated every examples consequently
  • Added Malpolon as (local) model provider.

    • Created new module malpolon.models.custom_models which will host custom models proposed by Malpolon
    • Split classes from geolifeclef2024_multimodal_ensemble.py to glc2024_multimodal_ensemble_model.py and glc2024_pre_extracted_prediction_system.py in custom_models to prevent circular import from malpolon.models.model_builder after adding Malpolon as (local) provider

Minor

  • Updated malpolon.data.data_module.export_predict_csv to enable more flexibility when outputting the prediction CSV for a single data point.

Examples

  • Added GLC24 pre-extracted examples (habitat and species) using the MultiModalEnsemble (MME) model
    • Automatic download of the dataset from Kaggle (depending on the value of boolean config parameter data.download_data)
    • Automatic download of the model weights from Seafile if not already on disk, via a new model.model_kwargs.pretrained key in the config file. The weights enable users to directly run our MME model on our GLC24_pre_extracted Test set and reach ~30% micro F1-score with ~26% micro precision and ~36% micro Recall, as well as ~96% micro AuC.

Tests

  • Added and updated unit tests for GLC24 pre-extracted examples (habitat and species)

🔗 Links

✅ Checklist

  • Lint and tests pass locally with my changes
  • I've added necessary documentation

tlarcher and others added 14 commits August 13, 2024 10:15
* Changed custom models location in new module 'malpolon.models.custom_models'. This includes glc24 pre_extracted MME model and multi_modal.py. For MME: classificationsystem and nn module have been split in 2 files to allow calling MME from model_builder without triggering a circular import through check_model. Updated examples consequently.

* Fix: state_dict altered during training.
- state_dict contains a loss parameter pos_weight as key loss.pos_weight. This key is created when the loss is instantiated by GenericPredictionSystem. However, this loss parameter was accessed and modified during the _step() process, which also alters the state_dict. Consequently, when loading the model by its checkpoint, there would be a value mismatch and the model would not load to resume training. This has been fixed by restoring the initial value of the loss parameter within the _step() function before the return statement.
- 'positive_weigh_factor' model hyperparameter has been deleted and replaced by loss parameter 'pos_weight', which achieves the same purpose. In the config file, 'positive_weigh_factor' model key has been substituted for subkey 'pos_weight' nested under 'loss_kwargs' nested in the optimizer section

* Cleaned remainings of previous commit testing

* Added download weight option for all classification system and updated checkopoint_path call for MME example

* Fixed wrong checkpoint_path path initialization behavior.
- glc24_cnn_multimodal_ensemble: updated example config file and main script to new checkpoint_path behavior, in both training and inference runs
- standard_prediction_systems.py: Fixed wrong checkpoint_path path initialization behavior
- glc2024_pre_extracted_prediction_system.py: added missing checkpoint_path argument and removed checkpoint_path setter as it is carried out by GenericPredictionSystem

* Updated example cnn_on_rgbnir_torchgeo following checkpoint_path update

* Updated example cnn_on_rgbnir_concat following checkpoint_path update

* Updated example cnn_on_rgbnir_glc23_patches following checkpoint_path update

* Reset yaml file glc23 example

* Fixed wrong variable assignment in exmaples micro_geolifeclef2022/cnn_on_rgb_nir_patches  and micro_geolifeclef2022/cnn_on_rgb_patches

* Added predict run part in example geolifeclef2022/cnn_on_rgb_patches and updated main script following checkpoint_path update.
- data_module: Added more flexibility for predictions without targets
- geolifeclef2022 dataset: Added default -1 value for targets in predict mode to comply with standard_prediction_system predict() method

* Updated glc22 and microglc22 examples following checkpoint_path update, and added inference part in the run section for those which didn't have one. Added input argument in custom GLC22 datamodules + model output in prediction mode, to such extent.

* Updated CIFAR-10 example following checkpoint_path update

* Updated all inference examples following checkpoint_path update

* Removed duplicate import

* Updated code docstrings

* Fixed task value from binary to multilabel (doesn't change behavior)

* Added 'malpolon' as model providers.
- model_builder: Added provider method and created new dictionary with model names as keys, and local imports of models as values

- data_module: Added posisblity of applying no activation function when running inference, so as to output the model's logits. Enhanced CSV export method's info prints.

- glc2024_multimodal_ensemble_model: Added new init argument and class attribute 'pretrained' which the datmaodule uses to determine whether to download pretrained weights (formerly: a standalone 'weights_download' variable was used by the datamodule). Added docstrings.

- glc2024_pre_extracted_prediction_system: Changed handling behavior of the model's loss during '_step()' to prevent overwritting the loss parameter during training which resulted in a de-synchronization of the state_dcit() before and after running the model (since loss parameters are automatically added as learnable parameters)
- glc24_cnn_multimodal_ensemble.yaml: Updated config file accordingly. Cleaned config file with correct values.
- glc24_cnn_multimodal_ensemble.py: Updated MME main srcipt accordingly. Changed activation function of inference run from softmax() to sigmoid()

* Updated glc22 tests following class getter changes

* Removed commented dict
* Corrected docstring

* Improved script splitting csv obs by species frequency by adding callable arguments, reducing computation time, adding comments, making it more generic. Renamed the script to split_obs_per_column_frequency.py

* Fixed unwanted behavior and further improved split_obs_per_column_frequency.py

* Fixed output test name syntax being different from the other splits

* Renamed or deleted files

* Added inference metrics evaluation scripts and output files for GLC24 MME model. Added the top25 predictions files as they are not heavy.

* Added specific .gitignore for MME inference and evaluation folder to opt out heavy files only

* Updated values of previously created .gitignore

* Moved and updated previous specific .gitignore

* Added entries to root .gitignore

* Added task selection (multilabel or other) in malpolon.data.datasets.geolifeclef2024_pre_extracted.GLC24Datamodule

* WiP: glc24 mme habitat integration

* COrrected typos

* Changed multiclass prediction filtering to keep all predictions and probas out of predict_logits_to_class()

* Fixed GLC24 mme habitat download method

* Reset glc24 mme habitat config file

* Added GLC24 MME habitat model dataset as new Malpolon dataset within malpolon.data.datasets.geolifeclef2024_pre_extracted

* Renamed inference evaluation script for GLC24 pre-extracted examples. Added dcostrings to said examples.

* Fixed habitat dataset folder not being created before calling symbolic links

* Added docstrings and linting

* Removed unnecessary files

* Added glc24 pre-extracted species unit test

* Added glc24_pre_extracted examples

* Updated test_examples pytest run skips and cleaned file.

* linting

* Docstrings glc24_pre_extracted
…predict_point': Changed checkpoint state_dict loading from model to LightningModule (breaking changes). Added iterable data type compatibility.
@tlarcher tlarcher self-assigned this Sep 6, 2024
@tlarcher tlarcher merged commit 93fd2c3 into main Sep 6, 2024
1 check passed
@tlarcher tlarcher deleted the dev branch September 6, 2024 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant