Update how models are compared using snakemake #42

enryH · 2023-02-20T20:25:07Z

Principles:

One script per imputation technique (here: notebooks executed using papermill)
Dump predictions in original data space (here normally: log2 transformed intensities)
a. for splits of simulated missing data
b. and predictions of missing data
Allow for model specific pre-processing in each script possible (but ensure dumps are re-transformed)

- Ubuntu lastest (22.04) seems to be not compatible - long installations fail: conda-incubator/setup-miniconda#116 - try setting up manuelly (to avoid updating env)

- try if mamba implementation works on runner instance - runner instance seems to run into memory issue (Kubernetes pod error hints at that)

- improve cmd interface for two key notebooks (as scripts) - mamba (replacement for conda) better for large environment

- index is wrapped into iterable (e.g. for 0 as integer indicating index columns) - remove "%%time" cell magic -> let's papermill miss error in cell. - remove old input - drop depreciated parameter from read_csv

- option for csv would need to be specified - raise error when column Index (level) has no name.

- format should be updated eventually (reading a seq of configs and metrics with same schema)

- exit before DISEASES part -> should be at best separated into new script

- functionality copied from API description: https://www.uniprot.org/help/id_mapping - example provided

- disentangle preprocessing from analysis

- metadata expected for long-wide format transformations

start with single dev dataset - name parameters consistently - model_key: save and use as given (easier for connecting - model abbrev.: RSN, CF, DAE, VAE (make consistent) - move helper function

- update parameter parsing - collect dump figure paths - move one function to package - pick up sample index name automatically

- model_key and model needed currently for grid search - "id" (constructed on loading config and metric files) + "model" name should be unique combination in futue - CF model: batch_size (not batch_size_collab) - ToDo: model pred should be saved by default (as currently done)

- add separate rules for interpolation and median imputation -> more separation - possibility: Optimize models one by one, write results and configs to shared database

- scripts need to be futher adapted

- update model abbrev in notebooks (CF, DAE, VAE, RSN) - missing values (not real_na in comments)

…l_comp_update

- fake NA -> simulated missing values - drop n_obs from plot

- each model type will have it's own rules ToDo: - convoluted setup needs cleaning - folder layout will be adapted

- prepare for export to separate snakefiles

Make it easier to extend the search with further models - folder_dataset/models/{model}/ - folder_dataset/models/{model}/run_id/ > single files are not dumped in subfolders (-> metrics/models) Folders (ordered by hierarchy) - folder_dataset: dataset specific folder (here: level of HeLa development dataset) - root_model: all model runs (each in subfolder) - run_id_template: models with different hyperparameters ToDo: Settle on how model script has to look -> each run should produce the same folder structure.

- remodel dumping of (real) missing values

- single experiment model setup needs to be adapted

- on the way to better composition -> next: refactor interpolation

- notebook needs a refactoring

- median vs Median - best average performance for models with a latent dimension -> only models with a latent dimensino Notebook should be cleaned

- metrics format simplified (more general) - remove interpolation dependence - remove plot from training scripts - not imputed -> "prop" gives share a specific method can impute - adapt comparison "performance_plots" - > metrics and "subset" had to be adapted for grid_search - median: Median

Henry added 30 commits January 12, 2023 18:10

🐛 uses and runs cannot be in one named entity

8abee20

🐛 fix cicd

55cd9be

- Ubuntu lastest (22.04) seems to be not compatible - long installations fail: conda-incubator/setup-miniconda#116 - try setting up manuelly (to avoid updating env)

🐛 large environment -> memory error -> try mamba

0a6be2e

- try if mamba implementation works on runner instance - runner instance seems to run into memory issue (Kubernetes pod error hints at that)

📝 document execution and setup

4c82e6e

- improve cmd interface for two key notebooks (as scripts) - mamba (replacement for conda) better for large environment

📝 preprint link

242bb1d

🐛 remove warning, make index more robust

5cd3686

- index is wrapped into iterable (e.g. for 0 as integer indicating index columns) - remove "%%time" cell magic -> let's papermill miss error in cell. - remove old input - drop depreciated parameter from read_csv

🐛 workflow expects pickle format

ef7c2b8

- option for csv would need to be specified - raise error when column Index (level) has no name.

🐛 update interface and execution logic for both AE

c420e21

🐛 change default freq name

8cc58f2

♻️ unify metric and config loading

622aeda

- format should be updated eventually (reading a seq of configs and metrics with same schema)

🚧 Allow execution withou annotation

8688caa

- exit before DISEASES part -> should be at best separated into new script

✨ update to new Uniprot API

9100126

- functionality copied from API description: https://www.uniprot.org/help/id_mapping - example provided

🎨 move preprocessing of clinical meta data

04b8b55

- disentangle preprocessing from analysis

✅ axis name for index and columns

1c4d64a

- metadata expected for long-wide format transformations

🐛 swapped horizontal lines

3f9f215

🚧✅ prepare more testing

67e666c

🚧 streamline workflow

8cfecc4

start with single dev dataset - name parameters consistently - model_key: save and use as given (easier for connecting - model abbrev.: RSN, CF, DAE, VAE (make consistent) - move helper function

🚧 streamline workflow

69b07bc

start with single dev dataset - name parameters consistently - model_key: save and use as given (easier for connecting - model abbrev.: RSN, CF, DAE, VAE (make consistent) - move helper function

🎨 single comp. script update

42a9272

- update parameter parsing - collect dump figure paths - move one function to package - pick up sample index name automatically

🎨 single comp. script update

b6a0a14

- update parameter parsing - collect dump figure paths - move one function to package - pick up sample index name automatically

🎨 align training notebooks for DAE and VAE

62248db

🎨 align training notebooks for DAE and VAE

d0c7d4c

🎨 make model explicit in grid search

9449986

- add separate rules for interpolation and median imputation -> more separation - possibility: Optimize models one by one, write results and configs to shared database

🎨 make model explicit in grid search

1e7c967

- add separate rules for interpolation and median imputation -> more separation - possibility: Optimize models one by one, write results and configs to shared database

🎨 format workflows using snkefmt

a0ae39f

🎨 format workflows using snkefmt

6c94d6c

🚧 adapt execution order of grid search

f758598

- scripts need to be futher adapted

📝🐛 update markdown parser for sphinx

6ce33a1

Henry added 18 commits February 20, 2023 21:17

merge

78cc2a8

🎨 rename concepts

29cb58d

- update model abbrev in notebooks (CF, DAE, VAE, RSN) - missing values (not real_na in comments)

Merge branch 'dev' into model_comp_update

a342ac3

Merge branch 'dev' of https://github.com/RasmussenLab/pimms into mode…

5420367

…l_comp_update

🐛 fix adapted filepaths after merge

617679f

- fake NA -> simulated missing values - drop n_obs from plot

🚧🐛 complete disentanglement

70efde3

- each model type will have it's own rules ToDo: - convoluted setup needs cleaning - folder layout will be adapted

🎨 Order rules

0a65e5f

- prepare for export to separate snakefiles

🎨 plot details, unused imports, helper function

becb5fb

✨ missing values (not simulated= real NA) dump

fa5b978

- remodel dumping of (real) missing values

🐛 fix config loading for now

a129da7

- single experiment model setup needs to be adapted

🎨 remove dependency on median in single experiment

7893bce

- on the way to better composition -> next: refactor interpolation

🐛 edge case: nothing to style

033da6c

- notebook needs a refactoring

🔥 remove old code and comments

d87d7f3

🐛

73bdd61

- median vs Median - best average performance for models with a latent dimension -> only models with a latent dimensino Notebook should be cleaned

🔧 update small N grid search

bcb24fe

🎨 name missing value dump (of "real" NAs)

5a03e3f

enryH merged commit 45125ba into dev Mar 9, 2023

enryH deleted the model_comp_update branch March 9, 2023 08:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update how models are compared using snakemake #42

Update how models are compared using snakemake #42

enryH commented Feb 20, 2023

Update how models are compared using snakemake #42

Update how models are compared using snakemake #42

Conversation

enryH commented Feb 20, 2023