Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update how models are compared using snakemake #42

Merged
merged 48 commits into from
Mar 9, 2023
Merged

Conversation

enryH
Copy link
Member

@enryH enryH commented Feb 20, 2023

Principles:

  1. One script per imputation technique (here: notebooks executed using papermill)
  2. Dump predictions in original data space (here normally: log2 transformed intensities)
    a. for splits of simulated missing data
    b. and predictions of missing data
  3. Allow for model specific pre-processing in each script possible (but ensure dumps are re-transformed)

Henry added 30 commits January 12, 2023 18:10
- Ubuntu lastest (22.04) seems to be not compatible
- long installations fail: conda-incubator/setup-miniconda#116
- try setting up manuelly (to avoid updating env)
- try if mamba implementation works on runner instance
- runner instance seems to run into memory issue
  (Kubernetes pod error hints at that)
- improve cmd interface for two key notebooks (as scripts)
- mamba (replacement for conda) better for large environment
- index is wrapped into iterable
  (e.g. for 0 as integer indicating index columns)
- remove "%%time" cell magic -> let's papermill
  miss error in cell.
- remove old input
- drop depreciated parameter from read_csv
- option for csv would need to be specified
- raise error when column Index (level) has no name.
- format should be updated eventually
  (reading a seq of configs and metrics with same schema)
- exit before DISEASES part
  -> should be at best separated into new script
- functionality copied from API description:
  https://www.uniprot.org/help/id_mapping
- example provided
- disentangle preprocessing from analysis
- metadata expected for long-wide format transformations
start with single dev dataset

- name parameters consistently
- model_key: save and use as given (easier for connecting
- model abbrev.: RSN, CF, DAE, VAE (make consistent)
- move helper function
start with single dev dataset

- name parameters consistently
- model_key: save and use as given (easier for connecting
- model abbrev.: RSN, CF, DAE, VAE (make consistent)
- move helper function
- update parameter parsing
- collect dump figure paths
- move one function to package
- pick up sample index name automatically
- update parameter parsing
- collect dump figure paths
- move one function to package
- pick up sample index name automatically
- model_key and model needed currently for grid search
- "id" (constructed on loading config and metric files)
 + "model" name should be unique combination in futue
- CF model: batch_size (not batch_size_collab)

- ToDo: model pred should be saved by default (as currently done)
- model_key and model needed currently for grid search
- "id" (constructed on loading config and metric files)
 + "model" name should be unique combination in futue
- CF model: batch_size (not batch_size_collab)

- ToDo: model pred should be saved by default (as currently done)
- add separate rules for interpolation and median imputation
  -> more separation
- possibility: Optimize models one by one,
  write results and configs to shared database
- add separate rules for interpolation and median imputation
  -> more separation
- possibility: Optimize models one by one,
  write results and configs to shared database
- scripts need to be futher adapted
Henry added 18 commits February 20, 2023 21:17
- update model abbrev in notebooks (CF, DAE, VAE, RSN)
- missing values (not real_na in comments)
- fake NA -> simulated missing values
- drop n_obs from plot
- each model type will have it's own rules

ToDo:
- convoluted setup needs cleaning
- folder layout will be adapted
- prepare for export to separate snakefiles
Make it easier to extend the search with further models

- folder_dataset/models/{model}/
- folder_dataset/models/{model}/run_id/

> single files are not dumped in subfolders (-> metrics/models)

Folders (ordered by hierarchy)
- folder_dataset: dataset specific folder
  (here: level of HeLa development dataset)
- root_model: all model runs (each in subfolder)
- run_id_template: models with different hyperparameters

ToDo: Settle on how model script has to look -> each run should produce
  the same folder structure.
- remodel dumping of (real) missing values
- single experiment model setup needs to be adapted
- on the way to better composition
-> next: refactor interpolation
- notebook needs a refactoring
- median vs Median
- best average performance for models with a latent dimension
  -> only models with a latent dimensino

Notebook should be cleaned
- metrics format simplified (more general)
- remove interpolation dependence

- remove plot from training scripts
- not imputed -> "prop" gives share a specific method can impute
- adapt comparison "performance_plots"

- > metrics and "subset" had to be adapted for grid_search

- median: Median
@enryH enryH merged commit 45125ba into dev Mar 9, 2023
@enryH enryH deleted the model_comp_update branch March 9, 2023 08:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant