Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: kapsner/mlexperiments
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: edb76c3
Choose a base ref
...
head repository: kapsner/mlexperiments
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: v0.0.1
Choose a head ref

Commits on Oct 1, 2022

  1. feat: initial commits

    added hyperparameter optimization (grid and bayesian);
    added hyperparameter validation with cross validation;
    added first survival learner surv_glmnet_cox
    kapsner committed Oct 1, 2022

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    07f39da View commit details
  2. Copy the full SHA
    772bba0 View commit details
  3. Copy the full SHA
    9ca9385 View commit details
  4. Copy the full SHA
    2afba3f View commit details
  5. Copy the full SHA
    cc821d6 View commit details

Commits on Oct 2, 2022

  1. feat: implemented nested cv

    kapsner committed Oct 2, 2022
    Copy the full SHA
    47e0669 View commit details
  2. Copy the full SHA
    d54475c View commit details
  3. feat: add xgboost surv

    kapsner committed Oct 2, 2022
    Copy the full SHA
    415be47 View commit details
  4. Copy the full SHA
    c73e20c View commit details
  5. Copy the full SHA
    c8db3b7 View commit details

Commits on Oct 4, 2022

  1. Copy the full SHA
    576066e View commit details
  2. Copy the full SHA
    03daecd View commit details
  3. fix: fixed issue with prediction in ranger

    and further small enhancements
    kapsner committed Oct 4, 2022
    Copy the full SHA
    b71f4f8 View commit details
  4. refactor: learners to mllrnrs package

    plus general utility functions to kdry package
    kapsner committed Oct 4, 2022
    Copy the full SHA
    b1ee0da View commit details

Commits on Oct 5, 2022

  1. Copy the full SHA
    5932e6c View commit details
  2. feat: implemented knn grid / nested grid

    bayesian tuning yet to do
    kapsner committed Oct 5, 2022
    Copy the full SHA
    5e1d910 View commit details
  3. Copy the full SHA
    4b6d974 View commit details

Commits on Oct 6, 2022

  1. Copy the full SHA
    70697fa View commit details
  2. chore: updated readme

    kapsner committed Oct 6, 2022
    Copy the full SHA
    5fc4a0a View commit details
  3. Copy the full SHA
    c697e5f View commit details
  4. Copy the full SHA
    0ce1f6c View commit details
  5. ci: fixed linting errors

    kapsner committed Oct 6, 2022
    Copy the full SHA
    a3d3eb0 View commit details
  6. fix: removing duplicates by name

    instead of value
    kapsner committed Oct 6, 2022
    Copy the full SHA
    01be901 View commit details
  7. Copy the full SHA
    015d5f6 View commit details

Commits on Oct 7, 2022

  1. Copy the full SHA
    401bdf1 View commit details

Commits on Oct 8, 2022

  1. refactor: removed metric_perf_h_b

    and moved learner to be already instantiated beforehand and to private
    as preparations for defining optim_higher_better during instantiation as well
    as the optim-metric / the performance metric
    kapsner committed Oct 8, 2022
    Copy the full SHA
    e63b3da View commit details
  2. refactor: metric-performance to cv-class now

    also added predict_args and metric_args
    kapsner committed Oct 8, 2022
    Copy the full SHA
    d4209e5 View commit details
  3. Copy the full SHA
    d9f6f67 View commit details

Commits on Oct 9, 2022

  1. ci: refactored unit tests

    and implemented early failing in case of learners that have no
    parameter tuning like glm and lm
    kapsner committed Oct 9, 2022
    Copy the full SHA
    d003088 View commit details

Commits on Oct 10, 2022

  1. fix: fixed buc in bayesian scoring function

    that occured when optimizing with metric_higher_better=FALSE;
    now, the score for parbayesianoptimization is multiplied by -1
    in these cases in order to guide bayesian search into right
    direction
    kapsner committed Oct 10, 2022
    Copy the full SHA
    a1214e1 View commit details
  2. ci: fixed test to comply with cran checks

    limiting ncores to 2L
    kapsner committed Oct 10, 2022
    Copy the full SHA
    8026711 View commit details
  3. Copy the full SHA
    771a3f1 View commit details
  4. chore: reformating message

    kapsner committed Oct 10, 2022
    Copy the full SHA
    74fed90 View commit details

Commits on Oct 11, 2022

  1. refactor: learner to self

    kapsner committed Oct 11, 2022
    Copy the full SHA
    263ae51 View commit details
  2. chore: updated news.md

    kapsner committed Oct 11, 2022
    Copy the full SHA
    575266e View commit details
  3. Copy the full SHA
    c4f5287 View commit details

Commits on Oct 12, 2022

  1. ci: fixed linting error

    kapsner committed Oct 12, 2022
    Copy the full SHA
    35190b6 View commit details
  2. Copy the full SHA
    54a1ec6 View commit details

Commits on Oct 15, 2022

  1. Copy the full SHA
    94705eb View commit details

Commits on Oct 16, 2022

  1. Copy the full SHA
    54766cc View commit details
  2. Copy the full SHA
    294e9f3 View commit details

Commits on Oct 17, 2022

  1. Copy the full SHA
    081ac26 View commit details

Commits on Oct 18, 2022

  1. Copy the full SHA
    77e06c8 View commit details
  2. Copy the full SHA
    eba4a91 View commit details
  3. Copy the full SHA
    d06d056 View commit details

Commits on Oct 19, 2022

  1. Copy the full SHA
    21102f0 View commit details
  2. Copy the full SHA
    6c0084f View commit details

Commits on Oct 20, 2022

  1. docs: updated documentation

    kapsner committed Oct 20, 2022
    Copy the full SHA
    0ff5511 View commit details

Commits on Oct 21, 2022

  1. Copy the full SHA
    6fa520c View commit details

Commits on Oct 22, 2022

  1. docs: working on vignette

    kapsner committed Oct 22, 2022
    Copy the full SHA
    66177d3 View commit details
Showing with 10,265 additions and 50 deletions.
  1. +1 −0 .gitignore
  2. +27 −8 DESCRIPTION
  3. +15 −1 NAMESPACE
  4. +258 −15 NEWS.md
  5. +272 −0 R/cv_class_base.R
  6. +307 −0 R/cv_class_nested.R
  7. +162 −0 R/cv_helper_base.R
  8. +54 −0 R/cv_helper_nested.R
  9. +193 −0 R/cv_performance.R
  10. +131 −0 R/cv_predictions.R
  11. +243 −0 R/learner_class_base.R
  12. +110 −0 R/learner_glm.R
  13. +212 −0 R/learner_knn.R
  14. +106 −0 R/learner_lm.R
  15. +352 −0 R/learner_rpart.R
  16. +149 −0 R/ml_class_base.R
  17. +90 −0 R/ml_helper_base.R
  18. +42 −0 R/optimizer_class_base.R
  19. +35 −0 R/optimizer_class_bayesian.R
  20. +10 −0 R/optimizer_class_grid.R
  21. +51 −0 R/optimizer_helper_base.R
  22. +138 −0 R/optimizer_helper_bayesian.R
  23. +77 −0 R/optimizer_helper_grid.R
  24. +195 −0 R/tune_class_base.R
  25. +170 −0 R/tune_helper_base.R
  26. +10 −0 R/utils_eval_params.R
  27. +20 −0 R/utils_method_params_refactor.R
  28. +227 −0 R/utils_metrics.R
  29. +127 −0 R/utils_validate_fold_equality.R
  30. +22 −1 R/zzz.R
  31. +330 −8 README.md
  32. +5 −0 data-raw/.gitignore
  33. +79 −14 data-raw/devstuffs.R
  34. +30 −0 data-raw/wiki.R
  35. +105 −0 man/LearnerGlm.Rd
  36. +121 −0 man/LearnerKnn.Rd
  37. +105 −0 man/LearnerLm.Rd
  38. +137 −0 man/LearnerRpart.Rd
  39. +94 −0 man/MLBase.Rd
  40. +358 −0 man/MLCrossValidation.Rd
  41. +196 −0 man/MLExperimentsBase.Rd
  42. +370 −0 man/MLLearnerBase.Rd
  43. +426 −0 man/MLNestedCV.Rd
  44. +283 −0 man/MLTuneParameters.Rd
  45. +40 −0 man/metric.Rd
  46. +47 −0 man/metric_types_helper.Rd
  47. +108 −0 man/performance.Rd
  48. +84 −0 man/predictions.Rd
  49. +93 −0 man/validate_fold_equality.Rd
  50. 0 rpkgTemplate.Rproj → mlexperiments.Rproj
  51. +2 −2 tests/testthat.R
  52. +97 −0 tests/testthat/test-fold_equality.R
  53. +155 −0 tests/testthat/test-glm.R
  54. +219 −0 tests/testthat/test-glm_predictions.R
  55. +298 −0 tests/testthat/test-knn.R
  56. +147 −0 tests/testthat/test-lm.R
  57. +244 −0 tests/testthat/test-rpart_classification.R
  58. +244 −0 tests/testthat/test-rpart_regression.R
  59. +5 −1 tic.R
  60. +2 −0 vignettes/.gitignore
  61. +282 −0 vignettes/mlexperiments_knn_binary.Rmd
  62. +282 −0 vignettes/mlexperiments_knn_multiclass.Rmd
  63. +370 −0 vignettes/mlexperiments_rpart_binary.Rmd
  64. +284 −0 vignettes/mlexperiments_rpart_multiclass.Rmd
  65. +369 −0 vignettes/mlexperiments_rpart_regression.Rmd
  66. +448 −0 vignettes/mlexperiments_starter.Rmd
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -48,3 +48,4 @@
/Meta/
docs
Meta
inst/doc
35 changes: 27 additions & 8 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,24 +1,43 @@
Package: rpkgTemplate
Title: Template for creating R packages
Version: 0.0.0.9003
Package: mlexperiments
Title: Toolkit for Machine Learning Experiments
Version: 0.0.1
Authors@R:
person("Lorenz A.", "Kapsner", , "lorenz.kapsner@gmail.com", role = c("cre", "aut", "cph"),
comment = c(ORCID = "0000-0003-1866-860X"))
Description: A template repository for creating R packages.
Description: A set of functions to perform machine learning experiments,
such as (nested) cross-validation and hyperparameter optimization.
License: GPL (>= 3)
URL: https://github.com/kapsner/rpkgTemplate
BugReports: https://github.com/kapsner/rpkgTemplate/issues
URL: https://github.com/kapsner/mlexperiments
BugReports: https://github.com/kapsner/mlexperiments/issues
Depends:
R (>= 2.10)
Imports:
data.table,
magrittr
kdry,
parallel,
progress,
R6,
splitTools,
stats
Suggests:
class,
datasets,
ggpubr,
knitr,
lintr,
mlbench,
mlr3measures,
ParBayesianOptimization,
rpart,
testthat (>= 3.0.1)
VignetteBuilder:
knitr
Date/Publication: 2022-09-14 12:32:28 UTC
Remotes:
github::kapsner/kdry@main,
github::mayer79/splitTools@master
Config/testthat/edition: 3
Config/testthat/parallel: false
Date/Publication: 2022-11-10 18:34:27 UTC
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.2.1
16 changes: 15 additions & 1 deletion NAMESPACE
Original file line number Diff line number Diff line change
@@ -1,4 +1,18 @@
# Generated by roxygen2: do not edit by hand

export(LearnerGlm)
export(LearnerKnn)
export(LearnerLm)
export(LearnerRpart)
export(MLCrossValidation)
export(MLLearnerBase)
export(MLNestedCV)
export(MLTuneParameters)
export(metric)
export(metric_types_helper)
export(performance)
export(predictions)
export(validate_fold_equality)
import(data.table)
importFrom(magrittr,"%>%")
importFrom(R6,R6Class)
importFrom(data.table,".SD")
273 changes: 258 additions & 15 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,270 @@
# rpkgTemplate NEWS
# mlexperiments NEWS

## Unreleased (2022-09-06)
## Unreleased (2022-11-06)

#### Breaking changes

- removed performance metric name
([01e4230](https://github.com/kapsner/mlexperiments/tree/01e423043b8938dc6e39cd72ca95586eb1f824ed))
- prepare for list of metrics
([db8b709](https://github.com/kapsner/mlexperiments/tree/db8b7090f943bea96fd326ad6a8487da46e9c033))

#### New features

- added rpart learner
([808042e](https://github.com/kapsner/mlexperiments/tree/808042e5d5d121adaccd699b9dd9ee155715959f))
- added support for defining multiple metrics
([6cc6340](https://github.com/kapsner/mlexperiments/tree/6cc634057c0d23ab294cc33a46556215b0dcbb6f))
- metric\_types\_helper again from kdry
([22b6cdc](https://github.com/kapsner/mlexperiments/tree/22b6cdcd01421a196014b048d8f2dd59f1e9c6f6))
- added validation function for fold equality
([54a1ec6](https://github.com/kapsner/mlexperiments/tree/54a1ec643052ce650efad85ad303c8a97708371e))
- added functions for predictions and preformance calculations on
cv-models
([c4f5287](https://github.com/kapsner/mlexperiments/tree/c4f5287512e3b6136c66145adb2682f9e975320c))
- finalized implementation of knn
([70697fa](https://github.com/kapsner/mlexperiments/tree/70697fa0dd74d7043b0f8fe5dfe3c0f7be391be2))
- implemented knn grid / nested grid
([5e1d910](https://github.com/kapsner/mlexperiments/tree/5e1d910d1e1f008b1dda1382d08138ec787f6908))
- working on implementing knn
([5932e6c](https://github.com/kapsner/mlexperiments/tree/5932e6c18a3659dfb61e2c3c3dea9bb00ffd387a))
- added coxph; working on unit tests
([03daecd](https://github.com/kapsner/mlexperiments/tree/03daecdc3d1e15ae64a25fad984b2be6c5d11149))
- working on xgboost implementation (wip)
([c8db3b7](https://github.com/kapsner/mlexperiments/tree/c8db3b75e0a903cc2f9420de87349f654511172c))
- add xgboost surv
([415be47](https://github.com/kapsner/mlexperiments/tree/415be47895feb361504aac3042fa9f277d87cd5b))
- implemented nested cv
([47e0669](https://github.com/kapsner/mlexperiments/tree/47e066915550e1b58ac5408fe841539245de92cf))
- initial commits
([07f39da](https://github.com/kapsner/mlexperiments/tree/07f39da254f6ae51248361e29b3c42888c1f2325))

#### Bug fixes

- fixed typo
([5067f95](https://github.com/kapsner/mlexperiments/tree/5067f95707971aa948738ce1fc2662288c7e0735))
- fixed cbind of list in predictions
([a81309e](https://github.com/kapsner/mlexperiments/tree/a81309e8a0c010e7955b8cb92aa16a9360c6b7af))
- fixed performance and predictions
([e0174cf](https://github.com/kapsner/mlexperiments/tree/e0174cf95b7470aafd93828bf8beb2565e44cbc7))
- fixing examples
([ed3a461](https://github.com/kapsner/mlexperiments/tree/ed3a46155a47341fb71e2c0fbe44e945fa32b355))
- consolidized metrics as lists, fixed unit tests
([393a4d6](https://github.com/kapsner/mlexperiments/tree/393a4d642a168b501ae13c70116347fbb6a06a29))
- updated recursive function call
([f30c925](https://github.com/kapsner/mlexperiments/tree/f30c92597e37e6b542c73c462f8e0b35ee078e26))
- added recursiveness to calculating performance
([17f38a3](https://github.com/kapsner/mlexperiments/tree/17f38a3aac77c3f2af8b88e06a00283fc8326702))
- fixed issues with data types when calculating performance metrics
([b57c53e](https://github.com/kapsner/mlexperiments/tree/b57c53e4e4702453b53b0ac8bea4ae3468d8b33a))
- fixed buc in bayesian scoring function
([a1214e1](https://github.com/kapsner/mlexperiments/tree/a1214e1dbd4af11b8954fa8ff0d86c44ba01de03))
- fixed general parsing of parameters
([015d5f6](https://github.com/kapsner/mlexperiments/tree/015d5f6bc26272b3bbd22430db4599781cc1dcc5))
- removing duplicates by name
([01be901](https://github.com/kapsner/mlexperiments/tree/01be9011fb9e8437737f16d7f9e842fff892e120))
- fixed issue when selecting parameters
([4b6d974](https://github.com/kapsner/mlexperiments/tree/4b6d9747f105fc509ce86348e1f5f4149f7b1e00))
- fixed issue with prediction in ranger
([b71f4f8](https://github.com/kapsner/mlexperiments/tree/b71f4f809d75c32b2b740839ae9ff180fc27013c))
- fixed some issues; finished xgboost surv implementation
([576066e](https://github.com/kapsner/mlexperiments/tree/576066e47895dbf9fd9f3c37378c300f3eff77e9))
- fixed last issue with seed in grid-search
([cc821d6](https://github.com/kapsner/mlexperiments/tree/cc821d6f1c2b4364e19a92d7277fcda5fbfc7c95))
- fixed issue related to parallel backend in bayesian tuning
([2afba3f](https://github.com/kapsner/mlexperiments/tree/2afba3ff6c1407d8b87942662bb14e48a4497366))
- removed duplicate class names and fixed related issues
([9ca9385](https://github.com/kapsner/mlexperiments/tree/9ca9385167841ccfea490e05f42e94f07f757cac))

#### Refactorings

- further optimizing rpart for reuse
([ce67a4f](https://github.com/kapsner/mlexperiments/tree/ce67a4fcfda8a8a0554dfc63fa4f8e624a0f0272))
- optimized rpart for re-use
([fa0ca53](https://github.com/kapsner/mlexperiments/tree/fa0ca539cec775711ec2c2baec4702b7977e82ae))
- refactored rpart, no user-visible changes
([207aafd](https://github.com/kapsner/mlexperiments/tree/207aafd52852f000abd0e30f261aeb8d8d599e2b))
- metric\_from\_char
([8baaf3a](https://github.com/kapsner/mlexperiments/tree/8baaf3a99d8f56b3f19c4a2dec400854a7585a46))
- switch to .compute\_performance
([3d8565d](https://github.com/kapsner/mlexperiments/tree/3d8565de4f1f9c707c92955b2b5d0f7f36a23a6f))
- switch to .compute\_performance
([fa24034](https://github.com/kapsner/mlexperiments/tree/fa24034719f84e8c97557fece5a0f9f43764c299))
- added .compute\_performance
([fffdf78](https://github.com/kapsner/mlexperiments/tree/fffdf78e10562f42b3da6ff4841697fc1030642f))
- updated performance fun to comply with list
([67792f3](https://github.com/kapsner/mlexperiments/tree/67792f398f956a11bafedcf7e353d2bbd0342fad))
- adapted metric calculation to new list logic
([46500f7](https://github.com/kapsner/mlexperiments/tree/46500f74a2d41c2360bf62c2766c14d2fa590f81))
- metric as function now to list
([8bccfbc](https://github.com/kapsner/mlexperiments/tree/8bccfbcb90194a10e65b4a65a26d07ffc0dc5e18))
- to internal metric\_types\_helper
([024adcb](https://github.com/kapsner/mlexperiments/tree/024adcb7fd765c603e94c41474d64a34791efe1a))
- to internal metric\_types\_helper
([fde6e77](https://github.com/kapsner/mlexperiments/tree/fde6e770fb1e4743447edde46b0df5a7c8d69368))
- to internal metric\_types\_helper
([97deeb7](https://github.com/kapsner/mlexperiments/tree/97deeb7fe20fb22b7da162de427d9935c80338e7))
- moved fix performance types to kdry
([600a660](https://github.com/kapsner/mlexperiments/tree/600a6605bc53b94fd1a885fc66edf2de446e770c))
- format\_xy to kdry
([322db94](https://github.com/kapsner/mlexperiments/tree/322db944f5115dbb6127ca2dab1f4962931d355d))
- updated code to upstream changes
([54766cc](https://github.com/kapsner/mlexperiments/tree/54766cc8b63c84b74170ac3a629ddf0f82cd7da2))
- learner to self
([263ae51](https://github.com/kapsner/mlexperiments/tree/263ae51b365f4db79f1b006052d2dfdf0f7d8570))
- now relying on list.append from kdry
([d9f6f67](https://github.com/kapsner/mlexperiments/tree/d9f6f679139de06dce34d4d3d4f0d168030a1061))
- metric-performance to cv-class now
([d4209e5](https://github.com/kapsner/mlexperiments/tree/d4209e5aad59ad46cc9297e29f9a1e5ca20be493))
- removed metric\_perf\_h\_b
([e63b3da](https://github.com/kapsner/mlexperiments/tree/e63b3da8e465ff73f1a136c5a6a869789e9ff679))
- optmizing code quality
([401bdf1](https://github.com/kapsner/mlexperiments/tree/401bdf14328bb4825b5030a7420559f1cb7e4e1a))
- now using of learner\_args is also possible
([0ce1f6c](https://github.com/kapsner/mlexperiments/tree/0ce1f6c92f503779bf232bfd3cac5a38bee63e60))
- working on making code more straightforward
([c697e5f](https://github.com/kapsner/mlexperiments/tree/c697e5f49e2641ecaad6cf5ea0498056b2a73d00))
- learners to mllrnrs package
([b1ee0da](https://github.com/kapsner/mlexperiments/tree/b1ee0da24fb221e3f73359912ed6da7366309401))
- reorgainzed code to make use of more inheritance
([772bba0](https://github.com/kapsner/mlexperiments/tree/772bba0cadd439f65fbf04ede5899948acc5b53c))

#### CI

- add workflow to automatically update tic.yml
([31d5711](https://github.com/kapsner/rpkgTemplate/tree/31d57111c513187677556349f94a4adcbceba6e4))
- updated tic.yml
([e823904](https://github.com/kapsner/rpkgTemplate/tree/e8239048abd14b83487d393a6f291c9e49672cc0))
- updated github actions
([70431e5](https://github.com/kapsner/rpkgTemplate/tree/70431e5cfdd56bb9feab76a2076c62d0104575ff))
- removed vignette building from ci
([97f3dbe](https://github.com/kapsner/mlexperiments/tree/97f3dbee1a0658345d0456530b0272187f663ac4))
- explicitly installing doparallel
([a288058](https://github.com/kapsner/mlexperiments/tree/a288058dd65b4dbc335637394f5b010751407be5))
- explicitly installing suggests of kdry
([7682912](https://github.com/kapsner/mlexperiments/tree/768291265b52153a61e9529b5dfb4cb4d48db6a5))
- switch again to code step
([d3e6556](https://github.com/kapsner/mlexperiments/tree/d3e6556b99bda11e7c5963e1e19cc95d62780eac))
- try to fix installation of kdry
([359de02](https://github.com/kapsner/mlexperiments/tree/359de0281a752d2674d46b5c402202e2d8a0f8ca))
- whatever is going on with tic
([0ca30f0](https://github.com/kapsner/mlexperiments/tree/0ca30f0252b347fb5612bf883ca61029fd6bb97e))
- back to previous tic
([e1a4541](https://github.com/kapsner/mlexperiments/tree/e1a45410a5226af057a1e0e4ff02899bbf0ecd35))
- hopefully fixes ci
([17e5b28](https://github.com/kapsner/mlexperiments/tree/17e5b28f97bb43cd7769e1b15758fd545b6ae2c5))
- try fixing
([5cd1579](https://github.com/kapsner/mlexperiments/tree/5cd1579681809863c536039b4ee919a65f55c19c))
- try to fix ci
([7c1dad3](https://github.com/kapsner/mlexperiments/tree/7c1dad366d147362caa2a6d4bba1c7d373cfdb18))
- try to fix ci
([2ba876e](https://github.com/kapsner/mlexperiments/tree/2ba876e5b7d807fd59b0bdc1b3b152bd9c4052e0))
- update tic
([4ffa188](https://github.com/kapsner/mlexperiments/tree/4ffa188ef56ef11dbe352472019031f704ea41fc))
- installing dependencies for kdry
([59b95e0](https://github.com/kapsner/mlexperiments/tree/59b95e0191e8b7c1d7c3b8eab55f2a577ab039bf))
- fixed linting error
([35190b6](https://github.com/kapsner/mlexperiments/tree/35190b62b45ec9773efd922dd9e30485cc5dbff9))
- solve warning for missing global variables
([771a3f1](https://github.com/kapsner/mlexperiments/tree/771a3f10dbf93037933c229948e7743fca52ae0b))
- fixed test to comply with cran checks
([8026711](https://github.com/kapsner/mlexperiments/tree/8026711b9f6f59534cabd51fc60d8dae63c0f45e))
- refactored unit tests
([d003088](https://github.com/kapsner/mlexperiments/tree/d003088c9d19dda872f9e698d9cbcb9f89cb6e95))
- fixed linting errors
([a3d3eb0](https://github.com/kapsner/mlexperiments/tree/a3d3eb0887db86a97b17a3d0b07b90c587477a5a))

#### Docs

- added readme
([8aae171](https://github.com/kapsner/rpkgTemplate/tree/8aae171037c50521a7c643e0647c5be2ac52d07b))
- added vignettes with examples
([b315dad](https://github.com/kapsner/mlexperiments/tree/b315dadba12a4b8d86f70a3233415d86bf392a6a))
- started with vignettes - knn binary
([14c12f8](https://github.com/kapsner/mlexperiments/tree/14c12f871e6f84e266e22335b0c2b4f90611ae40))
- fixed issues in vignette
([33a3951](https://github.com/kapsner/mlexperiments/tree/33a3951ebe3a5d718c82032ac15f37e5f79ff293))
- removed commented output
([d92aad8](https://github.com/kapsner/mlexperiments/tree/d92aad89427e54079ad8e5c6e9a806997bb0874a))
- updated example in vignette
([a958f42](https://github.com/kapsner/mlexperiments/tree/a958f4221782da6232031800452ff85f5b33d359))
- update example in readme
([9663fff](https://github.com/kapsner/mlexperiments/tree/9663fffcba86f60bd3a98055c456b3f87ef226b7))
- updated vignette accordingly
([4442cc5](https://github.com/kapsner/mlexperiments/tree/4442cc53c82e1ceef1c657e56c743e256b9d73b0))
- updated readme
([41dfa7e](https://github.com/kapsner/mlexperiments/tree/41dfa7e7f6346185dd9a12be46f6c7b70fe4ed9f))
- updated readme
([5bb27ca](https://github.com/kapsner/mlexperiments/tree/5bb27caf222018ae8d0bb23abcb285b50be087ec))
- updated documentation
([c2182d4](https://github.com/kapsner/mlexperiments/tree/c2182d4abc9667d64ea366b9e0186363d45fc7cf))
- updated metric types documentation
([dd07e5d](https://github.com/kapsner/mlexperiments/tree/dd07e5da61002c4b2b6a00c1a2f6bd78eb24e4f3))
- finalizing documentation, added exaple to readme
([587de28](https://github.com/kapsner/mlexperiments/tree/587de2854ec002a6493eb94eafc2479aebb37c3e))
- updated vignette intro
([3e0edd4](https://github.com/kapsner/mlexperiments/tree/3e0edd432cbd0a5997521bab85d3eab19dc29923))
- finalizing vignette; working on documentation
([7c24878](https://github.com/kapsner/mlexperiments/tree/7c2487858aa7d56bcdd2315371f390fdf94ef035))
- updated readme
([8add37f](https://github.com/kapsner/mlexperiments/tree/8add37f81a048273a3118a6d157c218e5ade2509))
- updated readme
([40e38e5](https://github.com/kapsner/mlexperiments/tree/40e38e58d709116dab5856bf5500d46a8f26c4df))
- working on vignette
([84ce219](https://github.com/kapsner/mlexperiments/tree/84ce219883335b6cbfa12dacc9234acd4f9710b5))
- updated readme
([14cf011](https://github.com/kapsner/mlexperiments/tree/14cf0116a1fd6c1684d1ad35039630cd9fcafb70))
- updated readme
([0d4e0ce](https://github.com/kapsner/mlexperiments/tree/0d4e0cecf593908a8d317f35f1f0ab1fc86c94c7))
- working on vignette
([e4158f0](https://github.com/kapsner/mlexperiments/tree/e4158f0822cb36680688779087b0eafa40861d8e))
- working on vignette
([66177d3](https://github.com/kapsner/mlexperiments/tree/66177d3f0579d70860b77d5b02e3e6116e8d2e71))
- working on package vignette
([6fa520c](https://github.com/kapsner/mlexperiments/tree/6fa520cac5b25099744f05fe711da57f381c748a))
- updated documentation
([0ff5511](https://github.com/kapsner/mlexperiments/tree/0ff5511cfc7f7f42508efed6ca295d2b351dd8ef))
- finished most of the package documentation
([21102f0](https://github.com/kapsner/mlexperiments/tree/21102f08051ce4a80f4fa5ea8d2154247fc89d50))
- finished documenting all r6 classes
([d06d056](https://github.com/kapsner/mlexperiments/tree/d06d05646f8eb6e92e32cbc4935e9da1e905e2aa))
- working on documenting r6 classes
([eba4a91](https://github.com/kapsner/mlexperiments/tree/eba4a91d193190450049b2e18e59a90839846dbd))
- working on class documentation
([77e06c8](https://github.com/kapsner/mlexperiments/tree/77e06c89bb0f63202231bd0111de13e53ec8bbb1))
- started working on documentation
([081ac26](https://github.com/kapsner/mlexperiments/tree/081ac26d999b3e809984bf0fd3d4064a4a0dc398))

#### Other changes

- fixed badge url
([681232c](https://github.com/kapsner/rpkgTemplate/tree/681232c309896969005a36f9979d47cbeb114669))
- added news.md
([b1fa3d5](https://github.com/kapsner/rpkgTemplate/tree/b1fa3d50fdbcb8dbb9fffebfd3230e6570e59691))
- fixed canonical form of cran url
([6dbd31a](https://github.com/kapsner/mlexperiments/tree/6dbd31aa49adb572368a4c2ff25e88df56be6525))
- added implementation details to vignettes
([4401139](https://github.com/kapsner/mlexperiments/tree/4401139d83cd9d0204749044b26f06cc3ab7667a))
- updated description and news.md
([49c35b6](https://github.com/kapsner/mlexperiments/tree/49c35b66941491e096465baa993439251bcb09de))
- fixed typo in vignette title
([f51cad2](https://github.com/kapsner/mlexperiments/tree/f51cad2bff63f4ad0dc76a16a499de48f4f2caf3))
- fixed rpart and optimized parsing of parameter grid
([74edba2](https://github.com/kapsner/mlexperiments/tree/74edba2c6ccffd240ef488a40b176e33fbbc1155))
- updated readme example
([06727a0](https://github.com/kapsner/mlexperiments/tree/06727a002e8a85287dc42aa4732d9e0e340ad643))
- finished refactoring rpart
([eb25888](https://github.com/kapsner/mlexperiments/tree/eb25888196d21983efbb452d905b664fef5ba168))
- updated news.md
([a2b1cfb](https://github.com/kapsner/mlexperiments/tree/a2b1cfb55a32a5ff26d2c9cd70385392a0a9bb78))
- fixed typo
([4d497d2](https://github.com/kapsner/mlexperiments/tree/4d497d217083e9aa65ce6f58fd1fb78628b04682))
- updated readme
([c36c9ef](https://github.com/kapsner/mlexperiments/tree/c36c9efbb1d13424043cffff5f1d31cd55f7f21d))
- fixed typo
([c8f7a88](https://github.com/kapsner/mlexperiments/tree/c8f7a88bef27acf1d10bc4766d3b199b28001183))
- fixed typo
([26a9489](https://github.com/kapsner/mlexperiments/tree/26a9489ad1036980dc46102c77857c73274d0788))
- updated readme and description
([6c0084f](https://github.com/kapsner/mlexperiments/tree/6c0084f912f11a681255368b9c64c40109e296ea))
- adaptions to upstream changes
([294e9f3](https://github.com/kapsner/mlexperiments/tree/294e9f3cd19fdfdd7e9febdd87f1421cb8d7af39))
- updated readme, added lifecycle badge
([94705eb](https://github.com/kapsner/mlexperiments/tree/94705eb1dbcaf7ae32424b6fc1ad961f6c8d78d5))
- updated news.md
([575266e](https://github.com/kapsner/mlexperiments/tree/575266e62417fe6f457fe05980c95f9ecf9e7f31))
- reformating message
([74fed90](https://github.com/kapsner/mlexperiments/tree/74fed90ab55e229cbd77a5941c1de512a5605e11))
- updated readme
([5fc4a0a](https://github.com/kapsner/mlexperiments/tree/5fc4a0a9c24e719bbe2bda0584b86e80c13b15cc))

Full set of changes:
[`115b0f1...8bfe834`](https://github.com/kapsner/rpkgTemplate/compare/115b0f1...8bfe834)
[`edb76c3...5067f95`](https://github.com/kapsner/mlexperiments/compare/edb76c3...5067f95)
272 changes: 272 additions & 0 deletions R/cv_class_base.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,272 @@
#' @title R6 Class to perform cross-validation experiments
#'
#' @description
#' The `MLCrossValidation` class is used to construct a cross validation object
#' and to perform a k-fold cross validation for a specified machine learning
#' algorithm using one distinct hyperparameter setting.
#'
#' @details
#' The `MLCrossValidation` class requires to provide a named list of predefined
#' row indices for the cross validation folds, e.g., created with the function
#' [splitTools::create_folds()]. This list also defines the `k` of the k-fold
#' cross-validation. When wanting to perform a repeated k-fold cross
#' validations, just provide a list with all repeated fold definitions, e.g.,
#' when specifing the argument `m_rep` of [splitTools::create_folds()].
#'
#' @seealso [splitTools::create_folds()]
#'
#' @examples
#' dataset <- do.call(
#' cbind,
#' c(sapply(paste0("col", 1:6), function(x) {
#' rnorm(n = 500)
#' },
#' USE.NAMES = TRUE,
#' simplify = FALSE
#' ),
#' list(target = sample(0:1, 500, TRUE))
#' ))
#'
#' fold_list <- splitTools::create_folds(
#' y = dataset[, 7],
#' k = 3,
#' type = "stratified",
#' seed = 123
#' )
#'
#' cv <- MLCrossValidation$new(
#' learner = LearnerKnn$new(),
#' fold_list = fold_list,
#' seed = 123,
#' ncores = 2
#' )
#'
#' # learner parameters
#' cv$learner_args <- list(
#' k = 20,
#' l = 0,
#' test = parse(text = "fold_test$x")
#' )
#'
#' # performance parameters
#' cv$predict_args <- list(type = "response")
#' cv$performance_metric <- metric("bacc")
#'
#' # set data
#' cv$set_data(
#' x = data.matrix(dataset[, -7]),
#' y = dataset[, 7]
#' )
#'
#' cv$execute()
#'
#' @export
#'
MLCrossValidation <- R6::R6Class( # nolint
classname = "MLCrossValidation",
inherit = MLExperimentsBase,
public = list(
#' @field fold_list A named list of predefined row indices for the cross
#' validation folds, e.g., created with the function
#' [splitTools::create_folds()].
fold_list = NULL,

#' @field return_models A logical. If the fitted models should be returned
#' with the results (default: `FALSE`).
return_models = NULL,

#' @field performance_metric Either a named list with metric functions, a
#' single metric function, or a character vector with metric names from
#' the `mlr3measures` package. The provided functions must take two named
#' arguments: `ground_truth` and `predictions`. For metrics from the
#' `mlr3measures` package, the wrapper function [mlexperiments::metric()]
#' exists in order to prepare them for use with the `mlexperiments`
#' package.
performance_metric = NULL,

#' @field performance_metric_args A list. Further arguments required to
#' compute the performance metric.
performance_metric_args = NULL,

#' @field predict_args A list. Further arguments required to compute the
#' predictions.
predict_args = NULL,

#' @description
#' Create a new `MLCrossValidation` object.
#'
#' @param fold_list A named list of predefined row indices for the cross
#' validation folds, e.g., created with the function
#' [splitTools::create_folds()].
#' @param learner An initialized learner object that inherits from class
#' `"MLLearnerBase"`.
#' @param seed An integer. Needs to be set for reproducibility purposes.
#' @param ncores An integer to specify the number of cores used for
#' parallelization (default: `-1L`).
#' @param return_models A logical. If the fitted models should be returned
#' with the results (default: `FALSE`).
#'
#' @details
#' The `MLCrossValidation` class requires to provide a named list of
#' predefined row indices for the cross validation folds, e.g., created
#' with the function [splitTools::create_folds()]. This list also defines
#' the `k` of the k-fold cross-validation. When wanting to perform a
#' repeated k-fold cross validations, just provide a list with all
#' repeated fold definitions, e.g., when specifing the argument `m_rep` of
#' [splitTools::create_folds()].
#'
#' @seealso [splitTools::create_folds()], [mlr3measures::measures],
#' [mlexperiments::metric()]
#'
#' @examples
#' dataset <- do.call(
#' cbind,
#' c(sapply(paste0("col", 1:6), function(x) {
#' rnorm(n = 500)
#' },
#' USE.NAMES = TRUE,
#' simplify = FALSE
#' ),
#' list(target = sample(0:1, 500, TRUE))
#' ))
#' fold_list <- splitTools::create_folds(
#' y = dataset[, 7],
#' k = 3,
#' type = "stratified",
#' seed = 123
#' )
#' cv <- MLCrossValidation$new(
#' learner = LearnerKnn$new(),
#' fold_list = fold_list,
#' seed = 123,
#' ncores = 2
#' )
#'
initialize = function(
learner,
fold_list,
seed,
ncores = -1L,
return_models = FALSE
) {
super$initialize(learner = learner, seed = seed, ncores = ncores)
stopifnot(is.logical(self$return_models <- return_models))
stopifnot(is.list(fold_list) && length(fold_list) >= 3L)
self$fold_list <- fold_list
},

#' @description
#' Execute the cross validation.
#'
#' @return The function returns a data.table with the results of the cross
#' validation. More results are accessible from the field `$results` of
#' the `MLCrossValidation` class.
#'
#' @details
#' All results of the cross validation are saved in the field
#' `$results` of the `MLCrossValidation` class. After successful execution
#' of the cross validation, `$results` contains a list with the items:
#'
#' * "fold" A list of folds containing the following items for each
#' cross validation fold:
#' + "fold_ids" A vector with the utilized in-sample row indices.
#' + "ground_truth" A vector with the ground truth.
#' + "predictions" A vector with the predictions.
#' + "learner.args" A list with the arguments provided to the learner.
#' + "model" If `return_models = TRUE`, the fitted model.
#' * "summary" A data.table with the summarized results (same as
#' the returned value of the `execute` method).
#' * "performance" A list with the value of the performance metric
#' calculated for each of the cross validation folds.
#'
#' @examples
#' dataset <- do.call(
#' cbind,
#' c(sapply(paste0("col", 1:6), function(x) {
#' rnorm(n = 500)
#' },
#' USE.NAMES = TRUE,
#' simplify = FALSE
#' ),
#' list(target = sample(0:1, 500, TRUE))
#' ))
#' fold_list <- splitTools::create_folds(
#' y = dataset[, 7],
#' k = 3,
#' type = "stratified",
#' seed = 123
#' )
#' cv <- MLCrossValidation$new(
#' learner = LearnerKnn$new(),
#' fold_list = fold_list,
#' seed = 123,
#' ncores = 2
#' )
#' cv$learner_args <- list(
#' k = 20,
#' l = 0,
#' test = parse(text = "fold_test$x")
#' )
#' cv$predict_args <- list(type = "response")
#' cv$performance_metric <- metric("bacc")
#'
#' # set data
#' cv$set_data(
#' x = data.matrix(dataset[, -7]),
#' y = dataset[, 7]
#' )
#'
#' cv$execute()
execute = function() {
private$prepare()
return(.run_cv(self = self, private = private))
}
),
private = list(
fun_performance_metric = NULL,
cv_run_model = function(...) {
kwargs <- list(...)
args <- kdry::list.append(
list(
self = self,
private = private
),
kwargs
)
do.call(.cv_fit_model, args)
},
prepare = function() {
stopifnot(
!is.null(private$x), !is.null(private$y),
!is.null(self$fold_list),
ifelse(
test = is.null(self$performance_metric_args),
yes = TRUE,
no = is.list(self$performance_metric_args)
),
is.list(self$performance_metric) ||
is.character(self$performance_metric) ||
is.function(self$performance_metric)
)

if (is.character(self$performance_metric)) {
self$performance_metric <- .metric_from_char(self$performance_metric)
}
if (is.function(self$performance_metric)) {
self$performance_metric <- list(
"performance" = self$performance_metric
)
}
stopifnot(all(sapply(self$performance_metric, is.function)))

# apply parameter_grid stuff
.organize_parameter_grid(self = self, private = private)

stopifnot(
length(intersect(
names(private$method_helper$params_not_optimized),
names(private$execute_params))) == 0L
)
}
)
)
307 changes: 307 additions & 0 deletions R/cv_class_nested.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,307 @@
#' @title R6 Class to perform nested cross-validation experiments
#'
#' @description
#' The `MLNestedCV` class is used to construct a nested cross validation object
#' and to perform a nested cross validation for a specified machine learning
#' algorithm by performing a hyperparameter optimization with the in-sample
#' observations of each of the k outer folds and validate them directly on the
#' out-of-sample observations of the respective fold.
#'
#' @details
#' The `MLNestedCV` class requires to provide a named list of predefined
#' row indices for the outer cross validation folds, e.g., created with the
#' function [splitTools::create_folds()]. This list also defines the `k` of
#' the k-fold cross-validation. Furthermore, a strategy needs to be chosen
#' ("grid" or "bayesian") for the hyperparameter optimization as well as the
#' parameter `k_tuning` to define the number of inner cross validation folds.
#'
#' @seealso [splitTools::create_folds()]
#'
#' @examples
#' dataset <- do.call(
#' cbind,
#' c(sapply(paste0("col", 1:6), function(x) {
#' rnorm(n = 500)
#' },
#' USE.NAMES = TRUE,
#' simplify = FALSE
#' ),
#' list(target = sample(0:1, 500, TRUE))
#' ))
#'
#' fold_list <- splitTools::create_folds(
#' y = dataset[, 7],
#' k = 3,
#' type = "stratified",
#' seed = 123
#' )
#'
#' cv <- MLNestedCV$new(
#' learner = LearnerKnn$new(),
#' strategy = "grid",
#' fold_list = fold_list,
#' k_tuning = 3L,
#' seed = 123,
#' ncores = 2
#' )
#'
#' # learner args (not optimized)
#' cv$learner_args <- list(
#' l = 0,
#' test = parse(text = "fold_test$x")
#' )
#'
#' # parameters for hyperparameter tuning
#' cv$parameter_grid <- expand.grid(
#' k = seq(4, 68, 8)
#' )
#' cv$split_type <- "stratified"
#'
#' # performance parameters
#' cv$predict_args <- list(type = "response")
#' cv$performance_metric <- metric("bacc")
#'
#' # set data
#' cv$set_data(
#' x = data.matrix(dataset[, -7]),
#' y = dataset[, 7]
#' )
#'
#' cv$execute()
#'
#' @export
#'
MLNestedCV <- R6::R6Class( # nolint
classname = "MLNestedCV",
inherit = MLCrossValidation,
public = list(
#' @field strategy A character. The strategy to optimize the hyperparameters
#' (either `"grid"` or `"bayesian"`).

strategy = NULL,
#' @field parameter_bounds A named list of tuples to define the parameter
#' bounds of the Bayesian hyperparameter optimization. For further details
#' please see the documentation of the `ParBayesianOptimization` package.
parameter_bounds = NULL,

#' @field parameter_grid A matrix with named columns in which each column
#' represents a parameter that should be optimized and each row represents
#' a specific hyperparameter setting that should be tested throughout the
#' procedure. For `strategy = "grid"`, each row of the `parameter_grid` is
#' considered as a setting that is evaluated. For `strategy = "bayesian"`,
#' the `parameter_grid` is passed further on to the `initGrid` argument of
#' the function [ParBayesianOptimization::bayesOpt()] in order to
#' initialize the Bayesian process. The maximum rows considered for
#' initializing the Bayesian process can be specified with the R option
#' `option("mlexperiments.bayesian.max_init")`, which is set to `50L` by
#' default.
parameter_grid = NULL,

#' @field optim_args A named list of tuples to define the parameter
#' bounds of the Bayesian hyperparameter optimization. For further details
#' please see the documentation of the `ParBayesianOptimization` package.
optim_args = NULL,

#' @field split_type A character. The splitting strategy to construct the
#' k cross-validation folds. This parameter is passed further on to the
#' function [splitTools::create_folds()] and defaults to `"stratified"`.
split_type = NULL,

#' @field split_vector A vector If another criteria than the provided `y`
#' should be considered for generating the cross-validation folds, it can
#' be defined here. It is important, that a vector of the same length as
#' `x` is provided here.
split_vector = NULL,

#' @field k_tuning An integer to define the number of cross-validation folds
#' used to tune the hyperparameters.
k_tuning = NULL,

#' @description
#' Create a new `MLNestedCV` object.
#'
#' @param fold_list A named list of predefined row indices for the cross
#' validation folds, e.g., created with the function
#' [splitTools::create_folds()].
#' @param strategy A character. The strategy to optimize the hyperparameters
#' (either `"grid"` or `"bayesian"`).
#' @param learner An initialized learner object that inherits from class
#' `"MLLearnerBase"`.
#' @param k_tuning An integer to define the number of cross-validation folds
#' used to tune the hyperparameters.
#' @param seed An integer. Needs to be set for reproducibility purposes.
#' @param ncores An integer to specify the number of cores used for
#' parallelization (default: `-1L`).
#' @param return_models A logical. If the fitted models should be returned
#' with the results (default: `FALSE`).
#'
#' @details
#' The `MLNestedCV` class requires to provide a named list of predefined
#' row indices for the outer cross validation folds, e.g., created with
#' the function [splitTools::create_folds()]. This list also defines the
#' `k` of the k-fold cross-validation. Furthermore, a strategy needs to
#' be chosen ("grid" or "bayesian") for the hyperparameter optimization
#' as well as the parameter `k_tuning` to define the number of inner
#' cross validation folds.
#'
#' @seealso [splitTools::create_folds()]
#'
#' @examples
#' dataset <- do.call(
#' cbind,
#' c(sapply(paste0("col", 1:6), function(x) {
#' rnorm(n = 500)
#' },
#' USE.NAMES = TRUE,
#' simplify = FALSE
#' ),
#' list(target = sample(0:1, 500, TRUE))
#' ))
#'
#' fold_list <- splitTools::create_folds(
#' y = dataset[, 7],
#' k = 3,
#' type = "stratified",
#' seed = 123
#' )
#'
#' cv <- MLNestedCV$new(
#' learner = LearnerKnn$new(),
#' strategy = "grid",
#' fold_list = fold_list,
#' k_tuning = 3L,
#' seed = 123,
#' ncores = 2
#' )
#'
initialize = function(
learner,
strategy = c("grid", "bayesian"),
k_tuning,
fold_list,
seed,
ncores = -1L,
return_models = FALSE
) {
super$initialize(
learner = learner,
fold_list = fold_list,
seed = seed,
ncores = ncores,
return_models = return_models
)
stopifnot(
!is.null(self$learner$.__enclos_env__$private$fun_optim_cv),
as.integer(k_tuning) >= 3L,
is.integer(self$k_tuning <- as.integer(k_tuning))
)
strategy <- match.arg(strategy)
self$strategy <- strategy
},

#' @description
#' Execute the nested cross validation.
#'
#' @return The function returns a data.table with the results of the nested
#' cross validation. More results are accessible from the field `$results`
#' of the `MLNestedCV` class.
#'
#' @details
#' All results of the cross validation are saved in the field `$results` of
#' the `MLNestedCV` class. After successful execution of the nested cross
#' validation, `$results` contains a list with the items:
#'
#' * "results.optimization" A list with the results of the hyperparameter
#' optimization.
#' * "fold" A list of folds containing the following items for each
#' cross validation fold:
#' + "fold_ids" A vector with the utilized in-sample row indices.
#' + "ground_truth" A vector with the ground truth.
#' + "predictions" A vector with the predictions.
#' + "learner.args" A list with the arguments provided to the learner.
#' + "model" If `return_models = TRUE`, the fitted model.
#' * "summary" A data.table with the summarized results (same as
#' the returned value of the `execute` method).
#' * "performance" A list with the value of the performance metric
#' calculated for each of the cross validation folds.
#'
#' @examples
#' dataset <- do.call(
#' cbind,
#' c(sapply(paste0("col", 1:6), function(x) {
#' rnorm(n = 500)
#' },
#' USE.NAMES = TRUE,
#' simplify = FALSE
#' ),
#' list(target = sample(0:1, 500, TRUE))
#' ))
#'
#' fold_list <- splitTools::create_folds(
#' y = dataset[, 7],
#' k = 3,
#' type = "stratified",
#' seed = 123
#' )
#'
#' cv <- MLNestedCV$new(
#' learner = LearnerKnn$new(),
#' strategy = "grid",
#' fold_list = fold_list,
#' k_tuning = 3L,
#' seed = 123,
#' ncores = 2
#' )
#'
#' # learner args (not optimized)
#' cv$learner_args <- list(
#' l = 0,
#' test = parse(text = "fold_test$x")
#' )
#'
#' # parameters for hyperparameter tuning
#' cv$parameter_grid <- expand.grid(
#' k = seq(4, 68, 8)
#' )
#' cv$split_type <- "stratified"
#'
#' # performance parameters
#' cv$predict_args <- list(type = "response")
#' cv$performance_metric <- metric("bacc")
#'
#' # set data
#' cv$set_data(
#' x = data.matrix(dataset[, -7]),
#' y = dataset[, 7]
#' )
#'
#' cv$execute()
#'
execute = function() {
private$prepare()
stopifnot(
!is.null(self$strategy),
ifelse(
test = self$strategy == "bayesian",
yes = !is.null(self$parameter_bounds),
no = !is.null(self$parameter_grid)
),
!is.null(private$method_helper$execute_params)
)
return(.run_cv(self = self, private = private))
}
),
private = list(
cv_run_model = function(...) {
kwargs <- list(...)
args <- kdry::list.append(
list(
self = self,
private = private
),
kwargs
)
do.call(.cv_run_nested_model, args)
}
)
)
162 changes: 162 additions & 0 deletions R/cv_helper_base.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
.run_cv <- function(self, private) {
cv_results <- .fold_looper(self, private)
outlist <- .cv_postprocessing(
self = self,
private = private,
results_object = cv_results
)
class(outlist) <- c("list", "mlexCV")
self$results <- outlist
return(self$results$summary)
}

.fold_looper <- function(self, private) {
# init a progress bar
pb <- progress::progress_bar$new(
format = "CV progress [:bar] :current/:total (:percent)\n",
total = length(self$fold_list)
)

outlist <- list()

for (fold in names(self$fold_list)) {
message(paste0("\nCV fold: ", fold))
# increment progress bar
pb$tick()

# get fold ids
train_index <- self$fold_list[[fold]]

fold_train <- list(x = kdry::mlh_subset(private$x, train_index),
y = kdry::mlh_subset(private$y, train_index))
fold_test <- list(x = kdry::mlh_subset(private$x, -train_index),
y = kdry::mlh_subset(private$y, -train_index))

run_args <- list(
train_index = train_index,
fold_train = fold_train,
fold_test = fold_test
)

# for nested cv, just overwrite private$cv_run_model in inherited
# class and add hyperparameter search before calling .cv_run_model

outlist[[fold]] <- do.call(private$cv_run_model, run_args)
}
return(outlist)
}

.cv_postprocessing <- function(
self,
private,
results_object
) {

outlist <- list(folds = results_object)

# calculate error metric for each fold
for (fold in names(results_object)) {
perf_args <- kdry::list.append(
list(
ground_truth = results_object[[fold]][["ground_truth"]],
predictions = results_object[[fold]][["predictions"]]
),
self$performance_metric_args
)
outlist[["performance"]][[fold]] <- .compute_performance(
function_list = self$performance_metric,
y = private$y,
perf_args = perf_args
)
}

# calculate performance metrics here
# add them to a nice results table
outlist[["summary"]] <- data.table::rbindlist(
l = sapply(
X = names(results_object),
FUN = function(x) {
# which learner args should be added in the final output?
add_args <- vapply(
X = results_object[[x]][["learner.args"]],
FUN = function(test_args) {
ifelse(
test = length(test_args) == 1L && is.atomic(test_args),
yes = TRUE,
no = FALSE
)
},
FUN.VALUE = logical(1L)
)

ret <- c(
list("fold" = x),
outlist[["performance"]][[x]]
)

if (sum(add_args) > 0) {
ret <- kdry::list.append(
ret,
results_object[[x]][["learner.args"]][add_args]
)
}
return(ret)
},
simplify = FALSE,
USE.NAMES = TRUE
)
)
# return
return(outlist)
}

.cv_fit_model <- function(self, private, train_index, fold_train, fold_test) {

fit_args <- list(
x = fold_train$x,
y = fold_train$y,
seed = private$seed,
ncores = private$ncores
)

if (is.list(self$learner_args)) {

learner_args <- self$learner_args
learner_args <- .method_params_refactor(learner_args, private$method_helper)
learner_args <- .eval_params(learner_args)

fit_args <- kdry::list.append(
fit_args,
learner_args
)
} else {
learner_args <- NULL
}

set.seed(private$seed)
fitted <- do.call(self$learner$fit, fit_args)

# make predictions
pred_args <- list(
model = fitted,
newdata = fold_test$x,
ncores = private$ncores
)
if (!is.null(private$cat_vars)) {
pred_args <- kdry::list.append(pred_args, list(cat_vars = private$cat_vars))
}
pred_args <- kdry::list.append(pred_args, self$predict_args)
preds <- do.call(self$learner$predict, pred_args)

res <- list(
fold_ids = train_index,
ground_truth = fold_test$y,
predictions = preds,
"learner.args" = learner_args
)

if (self$return_models) {
res <- kdry::list.append(res, list(model = fitted))
}
return(res)
}
54 changes: 54 additions & 0 deletions R/cv_helper_nested.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
.cv_run_nested_model <- function(
self,
private,
train_index,
fold_train,
fold_test
) {

hparam_tuner <- MLTuneParameters$new(
learner = self$learner,
strategy = self$strategy,
seed = private$seed,
ncores = private$ncores
)
hparam_tuner$parameter_bounds <- self$parameter_bounds
hparam_tuner$parameter_grid <-
private$method_helper$execute_params$parameter_grid
hparam_tuner$learner_args <-
private$method_helper$execute_params$params_not_optimized
hparam_tuner$optim_args <- self$optim_args
hparam_tuner$split_type <- self$split_type
hparam_tuner$split_vector <- self$split_vector[train_index]
# run hyper parameter optimization on training fold
hparam_tuner$set_data(
x = fold_train$x,
y = fold_train$y,
cat_vars = private$method_helper$execute_params$cat_vars
)

# execute optimization
hparam_tuner$execute(k = self$k_tuning)

outlist <- list(results.optimization = hparam_tuner$results)

# adjust best settings to fit final modle with
learner_args <- hparam_tuner$results[["best.setting"]]
learner_args <- learner_args[(names(learner_args) != "setting_id")]
learner_args <- learner_args[!kdry::misc_duplicated_by_names(
learner_args, fromLast = TRUE
)]

self$learner_args <- learner_args

# fit final model
res <- .cv_fit_model(
self = self,
private = private,
train_index = train_index,
fold_train = fold_train,
fold_test = fold_test
)
outlist <- kdry::list.append(outlist, res)
return(outlist)
}
193 changes: 193 additions & 0 deletions R/cv_performance.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
#' @title performance
#'
#' @description Calculate performance measures from the predictions results.
#'
#' @param object An R6 object of class `"MLCrossValidation"` for which the
#' performance should be computed.
#' @param prediction_results An object of class `"mlexPredictions"` (the output
#' of the function [mlexperiments::predictions()]).
#' @param y_ground_truth A vector with the ground truth of the predicted data.
#' @param type A character to select a pre-defined set of metrics for "binary"
#' and "regression" tasks. If not specified (default: `NULL`), the metrics
#' that were specified during fitting the `object` are used.
#' @param ... A list. Further arguments required to compute the performance
#' metrics.
#'
#' @details
#' The performance metric has to be specified in the `object` that is used to
#' carry out the experiment, i.e., [mlexperiments::MLCrossValidation] or
#' [mlexperiments::MLNestedCV].
#' Please note that the option `return_models = TRUE` must be set in the
#' experiment class in order to be able to compute the predictions, which are
#' required to conduct the calculation of the performance.
#'
#' @return The function returns a data.table with the computed performance
#' metric of each fold.
#'
#' @examples
#' dataset <- do.call(
#' cbind,
#' c(sapply(paste0("col", 1:6), function(x) {
#' rnorm(n = 500)
#' },
#' USE.NAMES = TRUE,
#' simplify = FALSE
#' ),
#' list(target = sample(0:1, 500, TRUE))
#' ))
#'
#' fold_list <- splitTools::create_folds(
#' y = dataset[, 7],
#' k = 3,
#' type = "stratified",
#' seed = 123
#' )
#'
#' glm_optimization <- mlexperiments::MLCrossValidation$new(
#' learner = LearnerGlm$new(),
#' fold_list = fold_list,
#' seed = 123
#' )
#'
#' glm_optimization$learner_args <- list(family = binomial(link = "logit"))
#' glm_optimization$predict_args <- list(type = "response")
#' glm_optimization$performance_metric_args <- list(positive = "1")
#' glm_optimization$performance_metric <- list(
#' auc = metric("auc"), sensitivity = metric("sensitivity"),
#' specificity = metric("specificity")
#' )
#' glm_optimization$return_models <- TRUE
#'
#' # set data
#' glm_optimization$set_data(
#' x = data.matrix(dataset[, -7]),
#' y = dataset[, 7]
#' )
#'
#' cv_results <- glm_optimization$execute()
#'
#' # predictions
#' preds <- mlexperiments::predictions(
#' object = glm_optimization,
#' newdata = data.matrix(dataset[, -7]),
#' na.rm = FALSE,
#' ncores = 2L,
#' type = "response"
#' )
#'
#' # performance
#' mlexperiments::performance(
#' object = glm_optimization,
#' prediction_results = preds,
#' y_ground_truth = dataset[, 7],
#' positive = "1"
#' )
#'
#' # performance - binary
#' mlexperiments::performance(
#' object = glm_optimization,
#' prediction_results = preds,
#' y_ground_truth = dataset[, 7],
#' type = "binary",
#' positive = "1"
#' )
#'
#' @export
#'
performance <- function(
object,
prediction_results,
y_ground_truth,
type = NULL,
...
) {
stopifnot(
inherits(object, what = "MLCrossValidation"),
R6::is.R6(object),
inherits(prediction_results, "mlexPredictions")
)

kwargs <- list(...)
model_names <- setdiff(colnames(prediction_results), c("mean", "sd"))
perf_fun <- object$performance_metric

perf_metric_args <- object$performance_metric_args

if (!is.null(type)) {
type <- match.arg(type, c("regression", "binary"))
if (!requireNamespace("mlr3measures", quietly = TRUE)) {
stop(
paste0(
"Package \"mlr3measures\" must be installed to use ",
"function 'performance()'."
),
call. = FALSE
)
}
if (type == "regression") {
append_metrics <- c(
"mse", "msle", "mae", "mape", "rmse", "rmsle", "rsq", "sse"
)
} else if (type == "binary") {
append_metrics <- c(
"auc", "prauc", "sensitivity", "specificity", "ppv", "npv", "tn", "tp",
"fn", "fp", "tnr", "tpr", "fnr", "fpr", "bbrier", "acc", "ce", "fbeta"
)
}
base_metric_list <- .metric_from_char(append_metrics)
perf_fun <- kdry::list.append(perf_fun, base_metric_list)
}

res <- data.table::rbindlist(
l = lapply(
X = model_names,
FUN = function(mn) {
perf_args <- kdry::list.append(
list(
ground_truth = y_ground_truth,
predictions = prediction_results[[mn]]
),
kwargs
)

perf_args <- kdry::list.append(
perf_args,
perf_metric_args
)

dt_list <- kdry::list.append(
list("model" = mn),
.compute_performance(
function_list = perf_fun,
y = y_ground_truth,
perf_args = perf_args
)
)
data.table::setDT(dt_list)
return(dt_list)
}
)
)

return(res)
}

.compute_performance <- function(
function_list,
y,
perf_args
) {
res <- sapply(
X = names(function_list),
FUN = function(x) {
metric_types_helper(
FUN = function_list[[x]],
y = y,
perf_args = perf_args
)
},
USE.NAMES = TRUE,
simplify = FALSE
)
return(res)
}
131 changes: 131 additions & 0 deletions R/cv_predictions.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
#' @title predictions
#'
#' @description Apply an R6 object of class `"MLCrossValidation"` to new data
#' to compute predictions.
#'
#' @param object An R6 object of class `"MLCrossValidation"` for which the
#' predictions should be computed.
#' @param newdata The new data for which predictions should be made using
#' the `model`.
#' @param na.rm A logical. If missings should be removed before computing the
#' mean and standard deviation of the performance across different folds for
#' each observation in `newdata`.
#' @param ncores An integer to specify the number of cores used for
#' parallelization (default: `-1L`).
#' @param ... A list. Further arguments required to compute the predictions.
#'
#' @return The function returns a data.table of class `"mlexPredictions"`with
#' one row for each observation in `newdata` and the columns containing
#' the predictions for each fold, along with the mean and standard deviation
#' across all folds.
#'
#' @examples
#' dataset <- do.call(
#' cbind,
#' c(sapply(paste0("col", 1:6), function(x) {
#' rnorm(n = 500)
#' },
#' USE.NAMES = TRUE,
#' simplify = FALSE
#' ),
#' list(target = sample(0:1, 500, TRUE))
#' ))
#'
#' fold_list <- splitTools::create_folds(
#' y = dataset[, 7],
#' k = 3,
#' type = "stratified",
#' seed = 123
#' )
#'
#' glm_optimization <- mlexperiments::MLCrossValidation$new(
#' learner = LearnerGlm$new(),
#' fold_list = fold_list,
#' seed = 123
#' )
#'
#' glm_optimization$learner_args <- list(family = binomial(link = "logit"))
#' glm_optimization$predict_args <- list(type = "response")
#' glm_optimization$performance_metric_args <- list(positive = "1")
#' glm_optimization$performance_metric <- metric("auc")
#' glm_optimization$return_models <- TRUE
#'
#' # set data
#' glm_optimization$set_data(
#' x = data.matrix(dataset[, -7]),
#' y = dataset[, 7]
#' )
#'
#' cv_results <- glm_optimization$execute()
#'
#' # predictions
#' preds <- mlexperiments::predictions(
#' object = glm_optimization,
#' newdata = data.matrix(dataset[, -7]),
#' na.rm = FALSE,
#' ncores = 2L,
#' type = "response"
#' )
#' head(preds)
#'
#' @export
#'
predictions <- function(
object,
newdata,
na.rm = FALSE, # nolint
ncores = -1L,
...
) {
stopifnot(
is.integer(as.integer(ncores)),
ncores != 0L,
inherits(object, what = "MLCrossValidation"),
R6::is.R6(object),
inherits(object$results, "mlexCV"),
isTRUE(object$return_models),
is.logical(na.rm)
)
kwargs <- list(...)
ncores <- kdry::pch_check_available_cores(ncores = as.integer(ncores))
model_names <- names(object$results$folds)
pred_fun <- object$learner$predict

pred_args_base <- kdry::list.append(
list(
newdata = newdata,
ncores = ncores
),
kwargs
)
pred_args_base <- kdry::list.append(
pred_args_base,
object$predict_args
)

res_pre <- sapply(
X = model_names,
FUN = function(m) {
pred_args <- kdry::list.append(
list(model = object$results$folds[[m]]$model),
pred_args_base
)

do.call(pred_fun, pred_args)
},
simplify = FALSE,
USE.NAMES = TRUE
)

res <- data.table::as.data.table(res_pre)

res[, `:=`(
mean = mean(as.numeric(.SD), na.rm = na.rm),
sd = stats::sd(as.numeric(.SD), na.rm = na.rm)
),
.SDcols = colnames(res),
by = seq_len(nrow(res))
]
class(res) <- c("data.frame", "data.table", "mlexPredictions")
return(res)
}
243 changes: 243 additions & 0 deletions R/learner_class_base.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,243 @@
#' @title R6 Class to construct learners
#'
#' @description
#' The `MLLearnerBase` class is used to construct a learner object that can be
#' used with the experiment classes from the `mlexperiments` package. It is
#' thought to serve as a class to inherit from when creating new learners.
#'
#' @details
#' The learner class exposes 4 methods that can be defined:
#' * `$fit` A wrapper around the private function `fun_fit`, which needs to
#' be defined for every learner. The return value of this function is the
#' fitted model.
#' * `$predict` A wrapper around the private function `fun_predict`,
#' which needs to be defined for every learner. The function must accept the
#' three arguments `model`, `newdata`, and `ncores` and is a wrapper around
#' the respective learner's predict-function. In order to allow the passing of
#' further arguments, the ellipsis (`...`) can be used. The function should
#' return the prediction results.
#' * `$cross_validation` A wrapper around the private function
#' `fun_optim_cv`, which needs to be defined when hyperparameters should be
#' optimized with a grid search (required for use with
#' [mlexperiments::MLTuneParameters], and [mlexperiments::MLNestedCV]).
#' * `$bayesian_scoring_function` A wrapper around the private function
#' `fun_bayesian_scoring_function`, which needs to be defined when
#' hyperparameters should be optimized with a Bayesian process (required for
#' use with [mlexperiments::MLTuneParameters], and
#' [mlexperiments::MLNestedCV]).
#'
#' For further details please refer to the package's vignette.
#'
#' @examples
#' MLLearnerBase$new(metric_optimization_higher_better = FALSE)
#'
#' @export
#'
MLLearnerBase <- R6::R6Class( # nolint
classname = "MLLearnerBase",
public = list(
#' @field cluster_export A character vector defining the (internal)
#' functions that need to be exported to the parallelization cluster.
#' This is only required when performing a Bayesian hyperparameter
#' optimization. See also [parallel::clusterExport()].
cluster_export = NULL,

#' @field metric_optimization_higher_better A logical. Defines the direction
#' of the optimization metric used throughout the hyperparameter
#' optimization. This field is set automatically during the initialization
#' of the `MLLearnerBase` object. Its purpose is to make it accessible by
#' the evaluation functions from [mlexperiments::MLTuneParameters].
metric_optimization_higher_better = NULL,

#' @field environment The environment in which to search for the functions
#' of the learner (default: `-1L`).
environment = -1L,

#' @description
#' Create a new `MLLearnerBase` object.
#'
#' @param metric_optimization_higher_better A logical. Defines the direction
#' of the optimization metric used throughout the hyperparameter
#' optimization.
#'
#' @return A new `MLLearnerBase` R6 object.
#'
#' @examples
#' MLLearnerBase$new(metric_optimization_higher_better = FALSE)
#'
initialize = function(
metric_optimization_higher_better # nolint
) {
stopifnot(
is.logical(metric_optimization_higher_better) ||
is.null(metric_optimization_higher_better)
)
self$metric_optimization_higher_better <-
metric_optimization_higher_better
},

#' @description
#' Perform a cross-validation with an `MLLearnerBase`.
#'
#' @details
#' A wrapper around the private function `fun_optim_cv`, which needs to be
#' defined when hyperparameters should be optimized with a grid search
#' (required for use with [mlexperiments::MLTuneParameters], and
#' [mlexperiments::MLNestedCV].
#' However, the function should be never executed directly but by the
#' respective experiment wrappers [mlexperiments::MLTuneParameters], and
#' [mlexperiments::MLNestedCV].
#' For further details please refer to the package's vignette.
#'
#' @param ... Arguments to be passed to the learner's cross-validation
#' function.
#'
#' @return The fitted model.
#'
#' @seealso [mlexperiments::MLTuneParameters],
#' [mlexperiments::MLCrossValidation], and
#' [mlexperiments::MLNestedCV]
#'
#' @examples
#' \dontrun{
#' learner <- MLLearnerBase$new(metric_optimization_higher_better = FALSE)
#' learner$cross_validation()
#' }
#'
cross_validation = function(...) {
kwargs <- list(...)
do.call(private$fun_optim_cv, kwargs)
},

#' @description
#' Fit a `MLLearnerBase` object.
#'
#' @details
#' A wrapper around the private function `fun_fit`, which needs to be
#' defined for every learner. The return value of this function is the
#' fitted model.
#' However, the function should be never executed directly but by the
#' respective experiment wrappers [mlexperiments::MLTuneParameters],
#' [mlexperiments::MLCrossValidation], and
#' [mlexperiments::MLNestedCV].
#' For further details please refer to the package's vignette.
#'
#' @param ... Arguments to be passed to the learner's fitting function.
#'
#' @return The fitted model.
#'
#' @seealso [mlexperiments::MLTuneParameters],
#' [mlexperiments::MLCrossValidation], and
#' [mlexperiments::MLNestedCV]
#'
#' @examples
#' \dontrun{
#' learner <- MLLearnerBase$new(metric_optimization_higher_better = FALSE)
#' learner$fit()
#' }
#'
fit = function(...) {
kwargs <- list(...)
do.call(private$fun_fit, kwargs)
},

#' @description
#' Make predictions from a fitted `MLLearnerBase` object.
#'
#' @details
#' A wrapper around the private function `fun_predict`, which needs to be
#' defined for every learner. The function must accept the three arguments
#' `model`, `newdata`, and `ncores` and is a wrapper around the respective
#' learner's predict-function. In order to allow the passing of further
#' arguments, the ellipsis (`...`) can be used. The function should
#' return the prediction results.
#' However, the function should be never executed directly but by the
#' respective experiment wrappers [mlexperiments::MLTuneParameters],
#' [mlexperiments::MLCrossValidation], and
#' [mlexperiments::MLNestedCV].
#' For further details please refer to the package's vignette.
#'
#' @param model A fitted model of the learner (as returned by
#' `MLLearnerBase$fit()`).
#' @param newdata The new data for which predictions should be made using
#' the `model`.
#' @param ncores An integer to specify the number of cores used for
#' parallelization (default: `-1L`).
#' @param ... Further arguments to be passed to the learner's predict
#' function.
#'
#' @return The predictions for `newdata`.
#'
#' @seealso [mlexperiments::MLTuneParameters],
#' [mlexperiments::MLCrossValidation], and
#' [mlexperiments::MLNestedCV]
#'
#' @examples
#' \dontrun{
#' learner <- MLLearnerBase$new(metric_optimization_higher_better = FALSE)
#' learner$fit()
#' learner$predict()
#' }
#'
predict = function(model, newdata, ncores, ...) {
kwargs <- list(
model = model,
newdata = newdata,
ncores = ncores
)
catch_kwargs <- list(...)
kwargs <- kdry::list.append(kwargs, catch_kwargs)
do.call(private$fun_predict, kwargs)
},

#' @description
#' Perform a Bayesian hyperparameter optimization with an `MLLearnerBase`.
#'
#' @details
#' A wrapper around the private function `fun_bayesian_scoring_function`,
#' which needs to be defined when hyperparameters should be optimized with
#' a Bayesian process (required for use with
#' [mlexperiments::MLTuneParameters], and [mlexperiments::MLNestedCV].
#' However, the function should be never executed directly but by the
#' respective experiment wrappers [mlexperiments::MLTuneParameters], and
#' [mlexperiments::MLNestedCV].
#' For further details please refer to the package's vignette.
#'
#' @param ... Arguments to be passed to the learner's Bayesian scoring
#' function.
#'
#' @return The results of the Bayesian scoring.
#'
#' @seealso [ParBayesianOptimization::bayesOpt()],
#' [mlexperiments::MLTuneParameters], and [mlexperiments::MLNestedCV]
#'
#' @examples
#' \dontrun{
#' learner <- MLLearnerBase$new(metric_optimization_higher_better = FALSE)
#' learner$bayesian_scoring_function()
#' }
#'
bayesian_scoring_function = function(...) {
kwargs <- list(...)
args <- .method_params_refactor(
kwargs,
method_helper
)
res <- do.call(private$fun_bayesian_scoring_function, args)

# take care of transforming results in case higher-better = FALSE
# --> bayesOpt tries to maximize the metric, so it is required to
# inverse score
if (isFALSE(self$metric_optimization_higher_better)) {
res$Score <- as.numeric(I(res$Score * -1L))
}
return(res)
}
),
private = list(
fun_optim_cv = NULL,
fun_bayesian_scoring_function = NULL,
fun_fit = NULL,
fun_predict = NULL
)
)
110 changes: 110 additions & 0 deletions R/learner_glm.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
#' @title LearnerGlm R6 class
#'
#' @description
#' This learner is a wrapper around [stats::glm()] in order to perform a
#' generalized linear regression. There is no implementation for tuning
#' parameters.
#'
#' @details
#' Can be used with
#' * [mlexperiments::MLCrossValidation]
#'
#' Implemented methods:
#' * `$fit` To fit the model.
#' * `$predict` To predict new data with the model.
#'
#' @seealso [stats::glm()]
#'
#' @examples
#' LearnerGlm$new()
#'
#' @export
#'
LearnerGlm <- R6::R6Class( # nolint
classname = "LearnerGlm",
inherit = mlexperiments::MLLearnerBase,
public = list(

#' @description
#' Create a new `LearnerGlm` object.
#'
#' @details
#' This learner is a wrapper around [stats::glm()] in order to perform a
#' generalized linear regression. There is no implementation for tuning
#' parameters, thus the only experiment to use `LearnerGlm` for is
#' [mlexperiments::MLCrossValidation].
#'
#' @return A new `LearnerGlm` R6 object.
#'
#' @seealso [stats::glm()]
#'
#' @examples
#' LearnerGlm$new()
#'
initialize = function() {
super$initialize(
metric_optimization_higher_better = NULL # unnecessary here
)
self$environment <- "mlexperiments"
private$fun_fit <- glm_fit
private$fun_predict <- glm_predict

# there is no optimization step here, so all related functions / fields
# are set to NULL
self$cluster_export <- NULL
private$fun_optim_cv <- NULL
private$fun_bayesian_scoring_function <- NULL
}
)
)

# pass parameters as ...
glm_fit <- function(x, y, ncores, seed, ...) {
message("Parameter 'ncores' is ignored for learner 'LearnerGlm'.")
params <- list(...)

if ("cat_vars" %in% names(params)) {
cat_vars <- params[["cat_vars"]]
glm_params <- params[names(params) != "cat_vars"]
} else {
cat_vars <- NULL
glm_params <- params
}

x <- kdry::dtr_matrix2df(matrix = x, cat_vars = cat_vars)

glm_formula <- stats::as.formula(object = "y ~ .")

args <- kdry::list.append(
list(
formula = glm_formula,
data = x
),
glm_params
)

set.seed(seed)
# fit the model
bst <- do.call(stats::glm, args)
return(bst)
}

glm_predict <- function(model, newdata, ncores, ...) {
kwargs <- list(...)

if ("cat_vars" %in% names(kwargs)) {
cat_vars <- kwargs[["cat_vars"]]
} else {
cat_vars <- NULL
}

pred_args <- kdry::list.append(
list(
object = model,
newdata = kdry::dtr_matrix2df(matrix = newdata, cat_vars = cat_vars)
),
kwargs
)

return(do.call(stats::predict.glm, pred_args))
}
212 changes: 212 additions & 0 deletions R/learner_knn.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
#' @title LearnerKnn R6 class
#'
#' @description
#' This learner is a wrapper around [class::knn()] in order to perform a
#' k-nearest neighbor classification.
#'
#'
#' @details
#' Optimization metric: classification error rate
#' Can be used with
#' * [mlexperiments::MLTuneParameters]
#' * [mlexperiments::MLCrossValidation]
#' * [mlexperiments::MLNestedCV]
#'
#' Implemented methods:
#' * `$fit` To fit the model.
#' * `$predict` To predict new data with the model.
#' * `$cross_validation` To perform a grid search (hyperparameter
#' optimization).
#' * `$bayesian_scoring_function` To perform a Bayesian hyperparameter
#' optimization.
#'
#' For the two hyperparamter optimization strategies ("grid" and "bayesian"),
#' the parameter `metric_optimization_higher_better` of the learner is
#' set to `FALSE` by default as the classification error rate
#' ([mlr3measures::ce()]) is used as the optimization metric.
#'
#' @seealso [class::knn()], [mlr3measures::ce()]
#'
#' @examples
#' LearnerKnn$new()
#'
#' @export
#'
LearnerKnn <- R6::R6Class( # nolint
classname = "LearnerKnn",
inherit = mlexperiments::MLLearnerBase,
public = list(
#'
#' @description
#' Create a new `LearnerKnn` object.
#'
#' @details
#' This learner is a wrapper around [class::knn()] in order to perform a
#' k-nearest neighbour classification. The following experiments are
#' implemented:
#' * [mlexperiments::MLTuneParameters]
#' * [mlexperiments::MLCrossValidation]
#' * [mlexperiments::MLNestedCV]
#' For the two hyperparamter optimization strategies ("grid" and
#' "bayesian"), the parameter `metric_optimization_higher_better` of the
#' learner is set to `FALSE` by default as the classification error rate
#' ([mlr3measures::ce()]) is used as the optimization metric.
#'
#' @seealso [class::knn()], [mlr3measures::ce()]
#'
#' @examples
#' LearnerKnn$new()
#'
#' @export
#'
initialize = function() {
if (!requireNamespace("class", quietly = TRUE)) {
stop(
paste0(
"Package \"class\" must be installed to use ",
"'learner = \"LearnerKnn\"'."
),
call. = FALSE
)
}
super$initialize(
metric_optimization_higher_better = FALSE # classification error
)
self$environment <- "mlexperiments"
self$cluster_export <- knn_ce()
private$fun_optim_cv <- knn_optimization
private$fun_fit <- knn_fit
private$fun_predict <- knn_predict
private$fun_bayesian_scoring_function <- knn_bsF
}
)
)


knn_ce <- function() {
c("knn_optimization", "knn_fit", "knn_predict", "metric")
}

knn_bsF <- function(...) { # nolint
params <- list(...)

# call to knn_optimization here with ncores = 1, since the Bayesian search
# is parallelized already / "FUN is fitted n times in m threads"
set.seed(seed)#, kind = "L'Ecuyer-CMRG")
bayes_opt_knn <- knn_optimization(
x = x,
y = y,
params = params,
fold_list = method_helper$fold_list,
ncores = 1L, # important, as bayesian search is already parallelized
seed = seed
)

ret <- kdry::list.append(
list("Score" = bayes_opt_knn$metric_optim_mean),
bayes_opt_knn
)

return(ret)
}

knn_optimization <- function(x, y, params, fold_list, ncores, seed) {
stopifnot(
is.list(params),
"k" %in% names(params)
)

# initialize a dataframe to store the results
results_df <- data.table::data.table(
"fold" = character(0),
"metric" = numeric(0)
)

# we do not need test here as it is defined explicitly below
params[["test"]] <- NULL

# loop over the folds
for (fold in names(fold_list)) {

# get row-ids of the current fold
train_idx <- fold_list[[fold]]

# train the model for this cv-fold
args <- kdry::list.append(
list(
x = kdry::mlh_subset(x, train_idx),
test = kdry::mlh_subset(x, -train_idx),
y = kdry::mlh_subset(y, train_idx),
use.all = FALSE,
ncores = ncores,
seed = seed
),
params
)
set.seed(seed)
cvfit <- do.call(knn_fit, args)

# optimize error rate
FUN <- metric("ce") # nolint
perf_args <- list(
predictions = knn_predict(
model = cvfit,
newdata = kdry::mlh_subset(x, -train_idx),
ncores = ncores,
type = "response"
),
ground_truth = kdry::mlh_subset(y, -train_idx)
)
perf <- metric_types_helper(
FUN = FUN,
y = y,
perf_args = perf_args
)

results_df <- data.table::rbindlist(
l = list(
results_df,
list(
"fold" = fold,
"validation_metric" = perf
)
),
fill = TRUE
)
}

res <- list(
"metric_optim_mean" = mean(results_df$validation_metric)
)

return(res)
}

knn_fit <- function(x, y, ncores, seed, ...) {
kwargs <- list(...)
stopifnot("k" %in% names(kwargs))

args <- kdry::list.append(
list(
train = x,
cl = y
),
kwargs
)
args$prob <- TRUE
set.seed(seed)
fit <- do.call(class::knn, args)
return(fit)
}

knn_predict <- function(model, newdata, ncores, ...) {
kwargs <- list(...)
stopifnot("type" %in% names(kwargs))

if (kwargs$type == "response") {
return(model)
} else if (kwargs$type == "prob") {
# there is no knn-model but the probabilities predicted for the test data
return(attributes(model)$prob)
}
}
106 changes: 106 additions & 0 deletions R/learner_lm.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
#' @title LearnerLm R6 class
#'
#' @description
#' This learner is a wrapper around [stats::lm()] in order to perform a
#' linear regression. There is no implementation for tuning
#' parameters.
#'
#' @details
#' Can be used with
#' * mlexperiments::MLCrossValidation
#'
#' @details
#' Implemented methods:
#' * `$fit` To fit the model.
#' * `$predict` To predict new data with the model.
#'
#' @seealso [stats::lm()]
#'
#' @examples
#' LearnerLm$new()
#'
#' @export
#'
LearnerLm <- R6::R6Class( # nolint
classname = "LearnerLm",
inherit = mlexperiments::MLLearnerBase,
public = list(

#' @description
#' Create a new `LearnerLm` object.
#'
#' @details
#' This learner is a wrapper around [stats::lm()] in order to perform a
#' linear regression. There is no implementation for tuning
#' parameters, thus the only experiment to use `LearnerLm` for is
#' [mlexperiments::MLCrossValidation]
#'
#' @return A new `LearnerLm` R6 object.
#'
#' @seealso [stats::lm()]
#'
#' @examples
#' LearnerLm$new()
#'
initialize = function() {
super$initialize(
metric_optimization_higher_better = NULL # unnecessary here
)
self$environment <- "mlexperiments"
private$fun_fit <- lm_fit
private$fun_predict <- lm_predict

# there is no optimization step here, so all related functions / fields
# are set to NULL
self$cluster_export <- NULL
private$fun_optim_cv <- NULL
private$fun_bayesian_scoring_function <- NULL
}
)
)

# pass parameters as ...
lm_fit <- function(x, y, ncores, seed, ...) {
message("Parameter 'ncores' is ignored for learner 'LearnerLm'.")
params <- list(...)

if ("cat_vars" %in% names(params)) {
cat_vars <- params[["cat_vars"]]
} else {
cat_vars <- NULL
}

x <- kdry::dtr_matrix2df(matrix = x, cat_vars = cat_vars)

lm_formula <- stats::as.formula(object = "y ~ .")

args <- list(
formula = lm_formula,
data = x
)

set.seed(seed)
# fit the model
bst <- do.call(stats::lm, args)
return(bst)
}

lm_predict <- function(model, newdata, ncores, ...) {
kwargs <- list(...)

if ("cat_vars" %in% names(kwargs)) {
cat_vars <- kwargs[["cat_vars"]]
} else {
cat_vars <- NULL
}

pred_args <- kdry::list.append(
list(
object = model,
newdata = kdry::dtr_matrix2df(matrix = newdata, cat_vars = cat_vars)
),
kwargs
)

return(do.call(stats::predict.lm, pred_args))
}
352 changes: 352 additions & 0 deletions R/learner_rpart.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,352 @@
#' @title LearnerRpart R6 class
#'
#' @description
#' This learner is a wrapper around [rpart::rpart()] in order to fit recursive
#' partitioning and regression trees.
#'
#'
#' @details
#' Optimization metric:
#' * classification (`method = "class"`): classification error rate
#' * regression (`method = "anova"`): mean squared error
#' *
#' Can be used with
#' * [mlexperiments::MLTuneParameters]
#' * [mlexperiments::MLCrossValidation]
#' * [mlexperiments::MLNestedCV]
#'
#' Implemented methods:
#' * `$fit` To fit the model.
#' * `$predict` To predict new data with the model.
#' * `$cross_validation` To perform a grid search (hyperparameter
#' optimization).
#' * `$bayesian_scoring_function` To perform a Bayesian hyperparameter
#' optimization.
#'
#' Parameters that are specified with `parameter_grid` and / or `learner_args`
#' are forwarded to `rpart`'s argument `control` (see
#' [rpart::rpart.control()] for further details).
#'
#' For the two hyperparamter optimization strategies ("grid" and "bayesian"),
#' the parameter `metric_optimization_higher_better` of the learner is
#' set to `FALSE` by default as the classification error rate
#' ([mlr3measures::ce()]) is used as the optimization metric for
#' classification tasks and the mean squared error ([mlr3measures::mse()]) is
#' used for regression tasks.
#'
#' @seealso [rpart::rpart()], [mlr3measures::ce()], [mlr3measures::mse()],
#' [rpart::rpart.control()]
#'
#' @examples
#' LearnerRpart$new()
#'
#' @export
#'
LearnerRpart <- R6::R6Class( # nolint
classname = "LearnerRpart",
inherit = mlexperiments::MLLearnerBase,
public = list(
#'
#' @description
#' Create a new `LearnerRpart` object.
#'
#' @details
#' This learner is a wrapper around [rpart::rpart()] in order to fit
#' recursive partitioning and regression trees. The following experiments
#' are implemented:
#' * [mlexperiments::MLTuneParameters]
#' * [mlexperiments::MLCrossValidation]
#' * [mlexperiments::MLNestedCV]
#'
#' For the two hyperparamter optimization strategies ("grid" and
#' "bayesian"), the parameter `metric_optimization_higher_better` of the
#' learner is set to `FALSE` by default as the classification error rate
#' ([mlr3measures::ce()]) is used as the optimization metric for
#' classification tasks and the mean squared error ([mlr3measures::mse()])
#' is used for regression tasks.
#'
#' @seealso [rpart::rpart()], [mlr3measures::ce()], [mlr3measures::mse()]
#'
#' @examples
#' LearnerRpart$new()
#'
#' @export
#'
initialize = function() {
if (!requireNamespace("rpart", quietly = TRUE)) {
stop(
paste0(
"Package \"rpart\" must be installed to use ",
"'learner = \"LearnerRpart\"'."
),
call. = FALSE
)
}
super$initialize(
metric_optimization_higher_better = FALSE # classification error
)
self$environment <- "mlexperiments"
self$cluster_export <- rpart_ce()
private$fun_optim_cv <- rpart_optimization
private$fun_fit <- function(x, y, ncores, seed, ...) {
kwargs <- list(...)
stopifnot(kwargs$method %in% c("class", "anova"))
args <- kdry::list.append(
list(
x = x, y = y, ncores = ncores, seed = seed
),
kwargs
)
return(do.call(rpart_fit, args))
}
private$fun_predict <- rpart_predict
private$fun_bayesian_scoring_function <- rpart_bsF
}
)
)


rpart_ce <- function() {
c("rpart_optimization", "rpart_cv", "rpart_fit", "rpart_fit_fun",
"rpart_predict_base", "rpart_predict", "metric")
}

rpart_bsF <- function(...) { # nolint
params <- list(...)

# call to rpart_optimization here with ncores = 1, since the Bayesian search
# is parallelized already / "FUN is fitted n times in m threads"
set.seed(seed)#, kind = "L'Ecuyer-CMRG")
bayes_opt_rpart <- rpart_optimization(
x = x,
y = y,
params = params,
fold_list = method_helper$fold_list,
ncores = 1L, # important, as bayesian search is already parallelized
seed = seed
)

ret <- kdry::list.append(
list("Score" = bayes_opt_rpart$metric_optim_mean),
bayes_opt_rpart
)

return(ret)
}

rpart_cv <- function(
x,
y,
params,
fold_list,
ncores,
seed
) {

outlist <- list()

# loop over the folds
for (fold in names(fold_list)) {

# get row-ids of the current fold
train_idx <- fold_list[[fold]]

y_train <- kdry::mlh_subset(y, train_idx)

# train the model for this cv-fold
args <- kdry::list.append(
list(
y = y_train,
x = kdry::mlh_subset(x, train_idx),
ncores = ncores,
seed = seed
),
params
)
set.seed(seed)
cvfit <- do.call(rpart_fit, args)
outlist[[fold]] <- list(cvfit = cvfit,
train_idx = train_idx)
}
return(outlist)
}

rpart_optimization <- function(x, y, params, fold_list, ncores, seed) {
stopifnot(
is.list(params),
"method" %in% names(params),
params$method %in% c("class", "anova")
)

# check, if this is a classification context and select metric accordingly
if (params$method == "class") {
msg <- "Classification: using 'classification error rate'"
FUN <- mlexperiments::metric("ce") # nolint
pred_type <- "class"
} else {
msg <- "Regression: using 'mean squared error'"
FUN <- mlexperiments::metric("mse") # nolint
pred_type <- "vector"
}
message(paste("\n", msg, "as optimization metric."))

args <- list(
x = x,
y = y,
params = params,
fold_list = fold_list,
ncores = ncores,
seed = seed
)
cv_fit_list <- do.call(rpart_cv, args)

# initialize a dataframe to store the results
results_df <- data.table::data.table(
"fold" = character(0),
"metric" = numeric(0)
)

for (fold in names(cv_fit_list)) {

cvfit <- cv_fit_list[[fold]][["cvfit"]]
train_idx <- cv_fit_list[[fold]][["train_idx"]]

pred_args <- list(
model = cvfit,
newdata = kdry::mlh_subset(x, -train_idx),
ncores = ncores,
type = pred_type
)

preds <- do.call(rpart_predict, pred_args)

perf_args <- list(
predictions = preds,
ground_truth = kdry::mlh_subset(y, -train_idx)
)
perf <- metric_types_helper(
FUN = FUN,
y = y,
perf_args = perf_args
)

results_df <- data.table::rbindlist(
l = list(
results_df,
list(
"fold" = fold,
"validation_metric" = perf
)
),
fill = TRUE
)
}

res <- list(
"metric_optim_mean" = mean(results_df$validation_metric)
)

return(res)
}

rpart_fit_fun <- function(x, y, ncores, seed, ...) {
kwargs <- list(...)

rpart_formula <- stats::as.formula(object = "rpart_y_train ~ .")

rpart_y_train <- y # nolint
train_x <- x

rpart_control <- NULL
rpart_control_default <- formals(rpart::rpart.control)
for (update_arg in names(kwargs)) {
control_list <- list()
if (update_arg %in% names(rpart_control_default)) {
control_list <- c(
control_list,
kwargs[update_arg])
# delete item from kwargs
kwargs[[update_arg]] <- NULL
}
if (length(control_list) > 0L) {
rpart_control <- do.call(rpart::rpart.control, control_list)
}
}

args <- kdry::list.append(
list(
formula = rpart_formula,
data = kdry::dtr_matrix2df(train_x, cat_vars = kwargs$cat_vars)
),
kwargs
)
args <- kdry::list.append(
args,
list(control = rpart_control)
)
set.seed(seed)
fit <- do.call(rpart::rpart, args)
return(fit)
}

rpart_fit <- function(x, y, ncores, seed, ...) {
kwargs <- list(...)
stopifnot("method" %in% names(kwargs))
fit_args <- kdry::list.append(
list(
x = x,
y = y,
ncores = ncores,
seed = seed
),
kwargs
)
return(do.call(rpart_fit_fun, fit_args))
}



rpart_predict_base <- function(model, newdata, ncores, kwargs) {
if ("cat_vars" %in% names(kwargs)) {
cat_vars <- kwargs[["cat_vars"]]
rpart_params <- kwargs[names(kwargs) != "cat_vars"]
} else {
cat_vars <- NULL
rpart_params <- kwargs
}

predict_args <- kdry::list.append(
list(
object = model,
newdata = kdry::dtr_matrix2df(matrix = newdata, cat_vars = cat_vars)
),
rpart_params
)

return(do.call(stats::predict, predict_args))
}

rpart_predict <- function(model, newdata, ncores, ...) {
kwargs <- list(...)

args <- list(
model = model,
newdata = newdata,
ncores = ncores,
kwargs = kwargs
)
preds <- do.call(rpart_predict_base, args)

if ("type" %in% names(kwargs)) {
if (kwargs$type == "prob" && ncol(preds) == 2) { # in case of binary classif
preds <- as.vector(preds[, 2])
}
}

if (!is.null(kwargs$reshape)) {
if (isTRUE(kwargs$reshape)) {
preds <- kdry::mlh_reshape(preds)
}
}
return(preds)
}
149 changes: 149 additions & 0 deletions R/ml_class_base.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
#' @title Basic R6 Class for the mlexperiments package
#'
MLBase <- R6::R6Class( # nolint
classname = "MLBase",
public = list(
#' @field results A list. This field is used to store the final results of
#' the respective methods.
results = NULL,

#' @description
#' Create a new `MLBase` object.
#'
#' @param seed An integer. Needs to be set for reproducibility purposes.
#' @param ncores An integer to specify the number of cores used for
#' parallelization (default: `-1L`).
#'
#' @return A new `MLBase` R6 object.
#'
#' @examples
#' \dontrun{
#' MLBase$new(
#' seed = 123,
#' ncores = 2
#' )
#' }
#'
initialize = function(seed, ncores = -1L) {
stopifnot(
is.integer(as.integer(ncores)),
is.integer(as.integer(seed)),
ncores != 0L
)
private$seed <- as.integer(seed)

# check available cores
private$ncores <- kdry::pch_check_available_cores(as.integer(ncores))
}
),
private = list(
ncores = NULL,
seed = NULL
)
)

#' @title R6 Class on which the experiment classes are built on
#'
MLExperimentsBase <- R6::R6Class( # nolint
classname = "MLExperimentsBase",
inherit = MLBase,
public = list(
#' @field learner_args A list containing the parameter settings of the
#' learner algorithm.
learner_args = NULL,

#' @field learner An initialized learner object that inherits from class
#' `"MLLearnerBase"`.
learner = NULL,

#' @description
#' Create a new `MLExperimentsBase` object.
#'
#' @param learner An initialized learner object that inherits from class
#' `"MLLearnerBase"`.
#' @param seed An integer. Needs to be set for reproducibility purposes.
#' @param ncores An integer to specify the number of cores used for
#' parallelization (default: `-1L`).
#'
#' @return A new `MLExperimentsBase` R6 object.
#'
#' @examples
#' \dontrun{
#' MLExperimentsBase$new(
#' learner = LearnerKnn$new(),
#' seed = 123,
#' ncores = 2
#' )
#' }
#'
initialize = function(learner, seed, ncores = -1L) {
super$initialize(seed = seed, ncores = ncores)
stopifnot(
# only accept instantiated learners
R6::is.R6(learner),
inherits(learner, "MLLearnerBase")
)
self$learner <- learner
},

#' @description
#' Set the data for the experiment.
#'
#' @param x A matrix with the training data.
#' @param y A vector with the target.
#' @param cat_vars A character vector with the column names of variables
#' that should be treated as categorical features (if applicable /
#' supported by the respective alogrithm).
#'
#' @return The function has no return value. It internally performs quality
#' checks on the provided data and, if passed, defines private fields of
#' the R6 class.
#'
#' @examples
#' \dontrun{
#' dataset <- do.call(
#' cbind,
#' c(sapply(paste0("col", 1:6), function(x) {
#' rnorm(n = 500)
#' },
#' USE.NAMES = TRUE,
#' simplify = FALSE
#' ),
#' list(target = sample(0:1, 500, TRUE))
#' ))
#'
#' tuner <- MLExperimentsBase$new(
#' learner = LearnerKnn$new(),
#' seed = 123,
#' ncores = 2
#' )
#'
#' # set data
#' tuner$set_data(
#' x = data.matrix(dataset[, -7]),
#' y = dataset[, 7]
#' )
#' }
#'
set_data = function(x, y, cat_vars = NULL) {
stopifnot(
inherits(x = x, what = c("matrix", "array")),
nrow(x) > 1L, !is.vector(x),
ifelse(
test = is.null(cat_vars),
yes = TRUE,
no = is.character(cat_vars) && is.atomic(cat_vars) &&
intersect(cat_vars, colnames(x)) == cat_vars
)
)
private$x <- x
private$y <- y
private$method_helper$cat_vars <- cat_vars
}
),
private = list(
x = NULL,
y = NULL,
method_helper = NULL
)
)
90 changes: 90 additions & 0 deletions R/ml_helper_base.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
.organize_parameter_grid <- function(self, private) {
if (!is.null(private$cat_vars)) {
private$method_helper$execute_params$cat_vars <- private$cat_vars
}

if (!is.null(self$parameter_grid)) {
# even if there is only one param setting, expand to grid here to make
# this code working in any case
if (!is.data.frame(self$parameter_grid)) {
self$parameter_grid <- expand.grid(self$parameter_grid)
}
# to make use of the data.table-syntax, convert self$parameter_grid
self$parameter_grid <- as.data.frame(
self$parameter_grid,
stringsAsFactors = FALSE
)

# logic to detect zero-variance variables
zero_variance <- vapply(
X = self$parameter_grid,
FUN = function(x) {
if (is.expression(x)) {
# all expressions are considered to be zero-variance in grid
return(TRUE)
} else {
if (length(unique(x)) == 1L) {
return(TRUE)
} else {
return(FALSE)
}
}
},
FUN.VALUE = logical(1L)
)
vec <- colnames(self$parameter_grid)[zero_variance]

if (length(vec) > 0L) {
# if a column is an expression, data.table currently fails with an
# error; data.frame is working, however, to select the appropriate
# columns, we then convert them back to a data.table
private$method_helper$execute_params$parameter_grid <-
data.table::as.data.table(
self$parameter_grid
)[, .SD, .SDcols = !vec]
params_not_optimized <- data.table::as.data.table(
self$parameter_grid[1L, ]
)[, .SD, .SDcols = vec]
# sapply trick to remove attributes
private$method_helper$execute_params$params_not_optimized <- sapply(
X = names(params_not_optimized),
FUN = function(x) {
params_not_optimized[[x]]
},
simplify = FALSE,
USE.NAMES = TRUE
)
} else {
private$method_helper$execute_params$parameter_grid <-
data.table::as.data.table(self$parameter_grid)
}
}

# append learner_args to params_not_optimized
if (!is.null(self$learner_args)) {
stopifnot(
is.list(self$learner_args),
ifelse(
test = !is.null(
private$method_helper$execute_params$params_not_optimized
),
yes = length(setdiff(#
names(self$learner_args),
names(
private$method_helper$execute_params$params_not_optimized
))) == length(self$learner_args),
no = TRUE
),
length(setdiff(names(self$learner_args),
names(private$method_helper$execute_params$parameter_grid))) ==
length(self$learner_args)
)


private$method_helper$execute_params$params_not_optimized <-
kdry::list.append(
self$learner_args,
private$method_helper$execute_params$params_not_optimized
)
}
}
42 changes: 42 additions & 0 deletions R/optimizer_class_base.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
BaseOptimizer <- R6::R6Class( # nolint
inherit = MLBase,
classname = "BaseOptimizer",
public = list(
parameter_grid = NULL,
learner = NULL,
initialize = function(learner, seed, ncores) {
stopifnot(R6::is.R6(learner))
super$initialize(seed = seed, ncores = ncores)
self$learner <- learner
},
execute = function(x, y, method_helper, ncores, seed) {
if (is.null(method_helper$execute_params$parameter_grid)) {
if (is.null(private$method_helper$execute_params$parameter_grid)) {
.organize_parameter_grid(self, private)
}
method_helper$execute_params <- private$method_helper$execute_params
}
FUN <- switch( # nolint
EXPR = private$strategy,
"grid" = .grid_optimize,
"bayesian" = .bayesian_optimize
)
optim_results <- do.call(
what = FUN,
args = list(
self = self,
private = private,
x = x,
y = y,
method_helper = method_helper
)
)
self$results <- optim_results
return(self$results)
}
),
private = list(
strategy = NULL,
method_helper = NULL
)
)
35 changes: 35 additions & 0 deletions R/optimizer_class_bayesian.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
BayesianOptimizer <- R6::R6Class( # nolint
inherit = BaseOptimizer,
classname = "BayesianOptimizer",
public = list(
#' @field optim_args A list with the arguments that are passed to
#' \code{ParBayesianOptimization::bayesOpt}
optim_args = NULL,
parameter_bounds = NULL,
initialize = function(learner, seed, ncores, ...) {
if (!requireNamespace("ParBayesianOptimization", quietly = TRUE)) {
stop(
paste0(
"Package \"ParBayesianOptimization\" must be installed to use ",
"'strategy = \"bayesian\"'."
),
call. = FALSE
)
}
super$initialize(learner = learner, seed = seed, ncores = ncores)
private$strategy <- "bayesian"
kwargs <- kdry::misc_argument_catcher(...)

default_args <- formals(ParBayesianOptimization::bayesOpt)
self$optim_args <- default_args[!sapply(default_args, is.symbol)]

# update arguments
if (length(kwargs) > 0) {
self$optim_args <- kdry::list.update(
main_list = self$optim_args,
new_list = kwargs
)
}
}
)
)
10 changes: 10 additions & 0 deletions R/optimizer_class_grid.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
GridOptimizer <- R6::R6Class( # nolint
inherit = BaseOptimizer,
classname = "GridOptimizer",
public = list(
initialize = function(learner, seed, ncores, ...) {
super$initialize(learner = learner, seed = seed, ncores = ncores)
private$strategy <- "grid"
}
)
)
51 changes: 51 additions & 0 deletions R/optimizer_helper_base.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
.config_grid_optimizer <- function(self, private) {
stopifnot(
is.list(private$method_helper$fold_list)
)
# init and configure optimizer
optimizer <- GridOptimizer$new(
learner = self$learner,
seed = private$seed,
ncores = private$ncores
)
return(optimizer)
}

.config_bayesian_optimizer <- function(self, private) {
stopifnot(
!is.null(self$parameter_bounds),
is.list(private$method_helper$fold_list)
)
# init and configure optimizer
optimizer <- BayesianOptimizer$new(
learner = self$learner,
seed = private$seed,
ncores = private$ncores,
... = self$optim_args
)
if (private$ncores > 1L) {
optimizer$optim_args$parallel <- TRUE
}
optimizer$parameter_bounds <- self$parameter_bounds

return(optimizer)
}

.run_optimizer <- function(self, private, optimizer) {
optim_results <- optimizer$execute(
x = private$x,
y = private$y,
method_helper = private$method_helper
)

outlist <- .optimize_postprocessing(
self = self,
private = private,
results_object = optim_results,
metric_higher_better = optimizer$learner$metric_optimization_higher_better
)

outlist <- kdry::list.append(outlist, private$method_helper)
class(outlist) <- c("list", "mlexTune")
self$results <- outlist
}
138 changes: 138 additions & 0 deletions R/optimizer_helper_bayesian.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
.bayesian_optimize <- function(
self, private,
x,
y,
method_helper
) {
stopifnot(!is.null(self$parameter_bounds), private$ncores > 1L)
if (self$optim_args$parallel) {
stopifnot(!is.null(self$learner$cluster_export))
message(sprintf(
"\nRegistering parallel backend using %s cores.",
private$ncores
))

cl <- kdry::pch_register_parallel(private$ncores)

self$optim_args$iters.k <- private$ncores

on.exit(
expr = {
kdry::pch_clean_up(cl)
# reset random number generator
RNGkind(kind = "default")
invisible(gc())
}
)
# cluster options
cluster_options <- kdry::misc_subset_options("mlexperiments")
# required for cluster export
assign(
x = "seed",
value = private$seed
)
# export from current env
parallel::clusterExport(
cl = cl,
varlist = c(
"x", "y", "seed", "method_helper", # , "ncores" #, "cluster_load"
"cluster_options"
),
envir = environment()
)

# export from global env
# if (private$method %in% options("mlexperiments.learner")) {
if (self$learner$environment != -1L) {
# https://stackoverflow.com/questions/67595111/r-package-design-how-to-
# export-internal-functions-to-a-cluster
#%ns <- asNamespace("mlexperiments")
stopifnot(is.character(self$learner$environment))
ns <- asNamespace(self$learner$environment)
parallel::clusterExport(
cl = cl,
#% varlist = unclass(
#% utils::lsf.str(
#% envir = ns,
#% all = TRUE
#% )),
varlist = self$learner$cluster_export,
envir = as.environment(ns)
)
} else {
parallel::clusterExport(
cl = cl,
varlist = self$learner$cluster_export,
envir = -1L
)
}
parallel::clusterSetRNGStream(
cl = cl,
iseed = private$seed
)
parallel::clusterEvalQ(
cl = cl,
expr = {
# set cluster options
options(cluster_options)
#%lapply(cluster_load, library, character.only = TRUE)
## not necessary since using ::-notation everywhere
RNGkind("L'Ecuyer-CMRG")
# set seed in each job for reproducibility
set.seed(seed) #, kind = "L'Ecuyer-CMRG")
}
)
}

# in any case, update gsPoints here, as default calculation fails when
# calling bayesOpt with do.call
if (identical(str2lang("pmax(100, length(bounds)^3)"),
self$optim_args[["gsPoints"]])) {
self$optim_args[["gsPoints"]] <- pmax(100, length(self$parameter_bounds)^3)
}

args <- kdry::list.append(
list(
# for each method, a bayesian scoring function is required
# FUN = eval(parse(text = paste0(
# private$method, "_bsF"
# ))),
FUN = self$learner$bayesian_scoring_function,
bounds = self$parameter_bounds,
initGrid = method_helper$execute_params$parameter_grid
),
self$optim_args
)

# avoid error when setting initGrid / or initPoints
if (!is.null(method_helper$execute_params$parameter_grid)) {
args <- args[names(args) != "initPoints"]
} else {
args <- args[names(args) != "initGrid"]
}

set.seed(private$seed)
opt_obj <- do.call(ParBayesianOptimization::bayesOpt, args)
return(opt_obj)
}

.bayesopt_postprocessing <- function(self, private, object) {
stopifnot(inherits(x = object, what = "bayesOpt"))
exl_cols <- vapply(
X = private$method_helper$execute_params$params_not_optimized,
FUN = is.expression,
FUN.VALUE = logical(1L)
)
optim_results <- cbind(
data.table::as.data.table(object$scoreSummary),
data.table::as.data.table(
private$method_helper$execute_params$params_not_optimized[!exl_cols]
)
)

colnames(optim_results)[grepl(
pattern = "Iteration", x = colnames(optim_results))
] <- "setting_id"

return(optim_results)
}
77 changes: 77 additions & 0 deletions R/optimizer_helper_grid.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
.grid_optimize <- function(
self, private,
x,
y,
method_helper
) {
stopifnot(
(ngrid <- nrow(method_helper$execute_params$parameter_grid)) > 1L
)

# init a progress bar
pb <- progress::progress_bar$new(
format = "\nParameter settings [:bar] :current/:total (:percent)",
total = ngrid
)

optim_results <- lapply(
X = seq_len(ngrid),
FUN = function(setting_id) {

# increment progress bar
pb$tick()

# get the relevant row from param_list with the hyperparameters to use in
# this loop
# this code is required to have names arguments and allow selection of
# expressions (which is not possible with data.table)
grid_search_params <- sapply(
X = colnames(method_helper$execute_params$parameter_grid),
FUN = function(x) {
mhcn <- colnames(method_helper$execute_params$parameter_grid)
xcol <- which(mhcn == x)
method_helper$execute_params$parameter_grid[
setting_id, get(mhcn[xcol])
]
},
simplify = FALSE,
USE.NAMES = TRUE
)

params <- .method_params_refactor(
grid_search_params,
method_helper
)

# FUN <- eval(parse(text = paste0(
# private$method, "_cv"
# )))
FUN <- self$learner$cross_validation # nolint

fun_parameters <- list(
"x" = x,
"y" = y,
"params" = params,
"fold_list" = method_helper$fold_list,
"ncores" = private$ncores,
"seed" = private$seed
)

set.seed(private$seed)
fit_grid <- do.call(FUN, fun_parameters)

ret <- data.table::as.data.table(
c(
list("setting_id" = setting_id),
fit_grid,
params[
setdiff(names(params), names(fit_grid))
]
)
)
#%return(ret[, .SD, .SDcols = colnames(ret)[!sapply(ret, is.expression)]])
return(ret)
}
)
return(optim_results)
}
195 changes: 195 additions & 0 deletions R/tune_class_base.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
#' @title R6 Class to perform hyperparamter tuning experiments
#'
#' @description
#' The `MLTuneParameters` class is used to construct a parameter tuner object
#' and to perform the tuning of a set of hyperparameters for a specified
#' machine learning algorithm using either a grid search or a Bayesian
#' optimization.
#'
#' @details
#' The hyperparameter tuning can be performed with a grid search or a Bayesian
#' optimization. In both cases, each hyperparameter setting is evaluated in a
#' k-fold cross-validation on the dataset specified.
#'
#' @seealso [ParBayesianOptimization::bayesOpt()], [splitTools::create_folds()]
#'
#' @examples
#' knn_tuner <- MLTuneParameters$new(
#' learner = LearnerKnn$new(),
#' seed = 123,
#' strategy = "grid",
#' ncores = 2
#' )
#'
#' @export
#'
MLTuneParameters <- R6::R6Class( # nolint
classname = "MLTuneParameters",
inherit = MLExperimentsBase,
public = list(
#' @field parameter_bounds A named list of tuples to define the parameter
#' bounds of the Bayesian hyperparameter optimization. For further details
#' please see the documentation of the `ParBayesianOptimization` package.
parameter_bounds = NULL,

#' @field parameter_grid A matrix with named columns in which each column
#' represents a parameter that should be optimized and each row represents
#' a specific hyperparameter setting that should be tested throughout the
#' procedure. For `strategy = "grid"`, each row of the `parameter_grid` is
#' considered as a setting that is evaluated. For `strategy = "bayesian"`,
#' the `parameter_grid` is passed further on to the `initGrid` argument of
#' the function [ParBayesianOptimization::bayesOpt()] in order to
#' initialize the Bayesian process. The maximum rows considered for
#' initializing the Bayesian process can be specified with the R option
#' `option("mlexperiments.bayesian.max_init")`, which is set to `50L` by
#' default.
parameter_grid = NULL,

#' @field optim_args A named list of tuples to define the parameter
#' bounds of the Bayesian hyperparameter optimization. For further details
#' please see the documentation of the `ParBayesianOptimization` package.
optim_args = NULL,

#' @field split_type A character. The splitting strategy to construct the
#' k cross-validation folds. This parameter is passed further on to the
#' function [splitTools::create_folds()] and defaults to `"stratified"`.
split_type = NULL,

#' @field split_vector A vector If another criteria than the provided `y`
#' should be considered for generating the cross-validation folds, it can
#' be defined here. It is important, that a vector of the same length as
#' `x` is provided here.
split_vector = NULL,

#' @description
#' Create a new `MLTuneParameters` object.
#'
#' @param learner An initialized learner object that inherits from class
#' `"MLLearnerBase"`.
#' @param seed An integer. Needs to be set for reproducibility purposes.
#' @param strategy A character. The strategy to optimize the hyperparameters
#' (either `"grid"` or `"bayesian"`).
#' @param ncores An integer to specify the number of cores used for
#' parallelization (default: `-1L`).
#'
#' @return A new `MLTuneParameters` R6 object.
#'
#' @examples
#' MLTuneParameters$new(
#' learner = LearnerKnn$new(),
#' seed = 123,
#' strategy = "grid",
#' ncores = 2
#' )
#'
initialize = function(
learner,
seed,
strategy = c("grid", "bayesian"),
ncores = -1L
) {
super$initialize(learner = learner, seed = seed, ncores = ncores)
stopifnot(
!is.null(self$learner$.__enclos_env__$private$fun_optim_cv)
)
strategy <- match.arg(strategy)
stopifnot(
ifelse(
test = strategy == "bayesian",
yes = !is.null(self$learner$cluster_export),
no = TRUE
),
!is.null(
self$learner$.__enclos_env__$private$fun_bayesian_scoring_function
)
)
private$strategy <- strategy

# init some stuff
private$method_helper <- list()
self$split_type <- "stratified"

private$select_optimizer <- switch(
EXPR = strategy,
"grid" = .config_grid_optimizer,
"bayesian" = .config_bayesian_optimizer
)

},

#' @description
#' Execute the hyperparameter tuning.
#'
#' @param k An integer to define the number of cross-validation folds used
#' to tune the hyperparameters.
#'
#' @details
#' All results of the hyperparameter tuning are saved in the field
#' `$results` of the `MLTuneParameters` class. After successful execution
#' of the parameter tuning, `$results` contains a list with the items
#' \describe{
#' \item{"summary"}{A data.table with the summarized results (same as
#' the returned value of the `execute` method).}
#' \item{"best.setting"}{The best setting (according to the learner's
#' parameter `metric_optimization_higher_better`) identified during the
#' hyperparameter tuning.}
#' \item{"bayesOpt"}{The returned value of
#' [ParBayesianOptimization::bayesOpt()] (only for `strategy =
#' "bayesian"`).}
#' }
#'
#' @return A `data.table` with the results of the hyperparameter
#' optimization. The optimized metric, i.e. the cross-validated evaluation
#' metric is given in the column `metric_optim_mean`. More results are
#' accessible from the field `$results` of the `MLTuneParameters` class.
#'
#' @examples
#' dataset <- do.call(
#' cbind,
#' c(sapply(paste0("col", 1:6), function(x) {
#' rnorm(n = 500)
#' },
#' USE.NAMES = TRUE,
#' simplify = FALSE
#' ),
#' list(target = sample(0:1, 500, TRUE))
#' ))
#' tuner <- MLTuneParameters$new(
#' learner = LearnerKnn$new(),
#' seed = 123,
#' strategy = "grid",
#' ncores = 2
#' )
#' tuner$parameter_bounds <- list(k = c(2L, 80L))
#' tuner$parameter_grid <- expand.grid(
#' k = seq(4, 68, 8),
#' l = 0,
#' test = parse(text = "fold_test$x")
#' )
#' tuner$split_type <- "stratified"
#' tuner$optim_args <- list(
#' iters.n = 4,
#' kappa = 3.5,
#' acq = "ucb"
#' )
#'
#' # set data
#' tuner$set_data(
#' x = data.matrix(dataset[, -7]),
#' y = dataset[, 7]
#' )
#'
#' tuner$execute(k = 3)
#'
execute = function(k) {
.tune_init(self, private, k)
optimizer <- private$select_optimizer(self, private)
return(.run_tuning(self = self, private = private, optimizer = optimizer))
}
),
private = list(
select_optimizer = NULL,
strategy = NULL,
tune_params = NULL
)
)
Loading