-
-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPO: Composable Preprocessing Operators #1827
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
R/measures.R:1435:73: style: Use FALSE instead of the symbol F. perror = pec(probs, f, data = newdata[, tn], times = grid, exact = F, exactness = 99L,
~^ |
3660535
to
7044b17
Compare
What's the status here? Does this still need to merged for mlrCPO to work? |
Yes please, |
Thanks, merging. |
zmjones
pushed a commit
that referenced
this pull request
Dec 19, 2017
* Introducing Composable Preprocessing Objects. * ParamSet syntactic sugar * Make git ignore emacs temp files * Bugfixes in ParamSetSugar * Automatically generate function from braced expressions * lintr fixes * Creation of CPOObject * Nice printing * CPOObject concatenation * CPO composition * Composition operator also for attachment * wrapping seems to work now * some experiments * Some reorganizing * Implemented CPOFunctional, against all odds. Probably full of bugs still. * Reorg: Organize CPOObjectBased, CPOFunctional the same * Bugfixes * CPOObject now doesn't need to return 'control', just create it. * lintr * ParamSetSugar test * Indentation * Testing most of CPO, excluding ParamSet feasibility checks * Bugfixes, found through tests * setHyperPars: assert uniquely named parameters * Testing hyperparameter feasibility * Test parameter feasibility * Testing actual data transformation * Testing CPO trafo functions * Testing requirement handling * Forgive absence of parameters with unfulfilled requirements * Requirement handling when changing ID * Repair global var problems in S3 methods in cpo tests * lintr * Application operator * Corrected copy-paste caused typo * Inform user when he forgets to construct CPO * Documentation * Make R CMD check --no-test happy * lintr doesnt recognize CPO function definitions as functions * Retrafo set / access functions * paramSetSugar parameter pss.* now have dot prefix for R param matching reasons * retrafo() machinery * Functional CPO now uses retrafo() * Roxygenise * Tests work again * Bugfixes * More informative error messages * More informative error messages * Testing for error handling * Embarrassing! * static analyzer safe paramSetSugar * Using NA instead of dot to indicate missing parameter * Cleaning up documentation * Documentation fixes * lintr * Turn chain of preprocs into list, and assemble list into chain * roxygenize * Put common CPO test objects into helper_cpo.R * Refactor chainung and un-chaining * Chaining, unchaining of object based retrafos * use 'predict' to apply retrafos * lint * Adding get / set retrafo state functionality * Adding get state and makefromstate for object based * Cleaning up CPO object based * Testing for retrafo state * Cleaning up CPOFunctional * Adding get state and makefromstate for functional based * lintr * R CMD check * Small test correction Evidently I should clear my .GlobalEnv before running tests. * small comment change * retrafo assignment now checks for type, not function * Adding properties parameters * Added docu, todo * Starting task shape verification things * Get format check its own file * Auxiliary files reorg * changed object based callCPO[Re]Trafo, need to propagate the changes now * One more step towards properties & data shape checking * Tests pass again * cleaning up * Make lint approve of TODOs temporarily * Tests first half of target type functionality * checkLearnerBeforeTrain: Wrong error message when unordered not supported * lintr * NOOP * Finished datasplit tests * Most property tests are done * get CPO from learner * Ported properties and datasplit to functional * Tests pass * roxygenise * Make tests faster * travis timeout ++ * Travis timeout +++ * Rewritten CPO core. A beauty to behold! This removes 'makeCPOFunctional' and 'makeCPOObject' and replaces them both with 'makeCPO'. * Travis timeout ++++ * Added 'factor', 'ordered', 'onlyfactor', 'numeric' datasplit Numeric splits also support matrix instead of data frame * Introducing NULLCPO, the neutral element of the CPO monad * Starting targetbound CPO * Targetbound CPO Task conversion backend * static code analyser found bugs * Making big steps towards target CPOs * to-do list, travis timeout ++ again * Pretty much done with target-bound CPO * Introducing stateless CPOs * is.nullcpo * Completing stateless * Roxygenise * Tests pass * lintr * stateless trafo-less CPO * ShapeInfo printing * Nicer ShapeInfo printing * More generics for getting CPO information * Multiplexer, Applicator * Renaming test files * Split up test_cpo_datasplit into *_datasplit and *_properties * Checking par.vals availability at the right places * Proper datasplit numeric / factor / etc handling * New tests * Datasplit numeric, factor, ordered, onlyfactor finally seem to work * Finished datasplit tests * summary bug * example CPOs handle DFs containing non-numeric columns * repair summary * Accept NA vector length * Accept character vectors for discrete character params * cpoSelect CPO * cpoSelect Params Reorg * cpoCbind * Check more rigorously that CPOs don't get called too often. * Test cpoCbind with tasks * listCPO * Testing concrete CPOs so far * Fix retrafo column name test * fix.factors * dummy encoder * Column selection by name * invert option for cpoSelect * Starting to implement affect.* * Interpreting subset * Fixing some bugs, implementing some tests, for affect subset * Don't print meta-params for CPO constructors * Tests for affect.* done * Collect meta-CPOs in a separate file * Finishing cpoMeta and its tests * lintr * Export cpoMeta * Fix summary bug * Export some functions I forgot to export * Adding jupyter vignette * Adding html rendered version of vignette * update .gitignore * Compact html vignette * Fixing test bug for impute * Impute CPO + tests * Adding specialised CPO imputers * CPO Imputers get their own file * lintr * Test that dummys are not created when the flag says so. * Updating Vignette * checkMeasures: instead of missing(), use NULL * Feature Filters * Introducing applyCPO: apply a CPO to a Task / df * Introduce composeCPO: composing two CPOs * Introducing attachCPO: Attaching a CPO to a learner. * Adjust properties of imputers that can only handle certain types * Forgot export * filter features now only operate on the columns of the right type * Constant Feature Remover CPO * CPO for fixing factors * cpoDummyEncode works much better now * MissingIndicators CPO * Repairing checkMeasures * Better travis check * cpoCbind bugfixes * bugfix * Vignette updates * use cases ipynb * Recursive application of CPO fix * roxygenise * Avoid warnings when load_all-ing mlr * Now possible to specify packages associated with CPOs * Bugfix * Fix blackboost bug TODO: report this * Revert "Avoid warnings when load_all-ing mlr" This reverts commit ba2cf73. Reverting this because I made an extra pull-request * Revert "Fix blackboost bug" This reverts commit 9e1007f. Reverting this b/c I made an extra PR * Fix imputation of empty df bug * cpoScale fix for 1-column data * cpoApplyFun * cpoRangeScale * cpoProbEncode * Impact encoding * Adding new CPOs * rename cpoRangeScale -> cpoScaleRange * More natural handling of 'stateless' cpo * .retrafo.format added, with new 'combined' option * optionally only export subset of parameters * fix export * removing 'CPOS3Primitive' class, as ordered * removing 'CPOS3Constructor' class, as ordered * removing 'CPOS3RetrafoPrimitive', as ordered * simplifying class structure, as ordered * simplifying class structure further * done simplifying class structure * makeCPO documentation * Handle ID correctly for non-exported values * Pretty printing; Bugfixes * A few new CPOs * cpoSpatialSign * tuning CPO test * tuning CPO test II * bugfixes * roxygenise * Vignette: Examples, CPOs, Construction * cleaning up vignette * vignette html export * Bugfix * Tuning vignette * vignette * exporting necessary things * deleting superfluous files which are now in mlrCPO * forgot necessary function * Export makeBaseWrapper * Removing paramSetSugar * Roxygenise * Exporting changeData * Removing the last few bits from before the cpo-mlr-split * Export checkPredictLearnerOutput for mlrCPO * add 'keywords internal' to internal use only functions * roxygenise * overlooked while merging * %%
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is my GSoC project. See the preliminary vignette for a quick overview (more compact version with the R output removed).
Description for a General Audience
@everyone. If you have questions, ideas or feedback, please don't hesitate to write me, here or in other places!
What is this?
Functions for data manipulation and pre-processing ➕ a replacement for
makePreprocWrapper
➕ lots of syntactic sugar.Description
CPOs are called like functions and create an object that has Hyperparameters that can be manipulated using
getHyperPars
,setHyperPars
etc.These objects can be applied to
Task
s ordata.frames
to manipulate data, or can be attached to aLearner
to create a wrapped learner similar tomakePreprocWrapper
.Custom CPO constructors can be created using
makeCPOObject
ormakeCPOFunctional
. Note it is possible to write the (re)transformation operations with curly braces, with the function header getting added automatically.Implementation details
@berndbischl, @mllg
"CPO" (the name)
Would you want me to use "TaskTransform" instead of CPO (or something entirely different)?
makeParamSet Syntactic Sugar
I wrote a function
paramSetSugar
that makes creatingParamSet
s much less painful. ExampleDo you like this idea in general (maybe you want to incorporate it into
ParamHelpers
?) or would you rather not like me to use this in my project?Object based vs. Functional CPOs
I implemented both, one in
R/CPOObjectBased.R
, the other inR/CPOFunctional.R
; the code shared by both is mostly inR/CPOAuxiliary.R
. Both have some advantages and disadvantages. The object based could use less memory in theory, since it does not carry around anenvironment
in its model that usually contains the training data. It is also easier to debug if you like to usedebugonce
. In turn, the functional implementation can be applied directly toTask
objects (since the CPO objects are just functions in this case) and could probably quite easily be coerced into collaborating with themagrittr
package.Maybe have a look at the concrete implementations of mine to see which one you like more.
Note About Object Based Implementation
The
makePreprocWrapper
implementation inmlr
relies on the transformation function returning an objectlist(data = [data], control = [control])
. I had the idea of just having it return the resulting data, and using R magic to inspect the function's environment to get at the control. See e.g. the implementation ofcpoScale
:What is your opinion about this? Alternatives are: Copying the entire
cpo.trafo
namespace tocpo.retrafo
, so the user wouldn't need to worry about which variables are available and which are not. The downside to this: This would take the entire training data and save it inside the model, might be memory intensive. I could also stop being fancy and just return thelist(data, control)
as inmakePreprocWrapper
. There is a way to inspectcpo.retrafo
and copy only the objects that are used by it, but this inspection is bound to be incomplete (the problem is halting problem equivalent) and could copy more data than theretrafo
part needs.Composition operator
I choose
%>>%
, since it is similar, but not used by,magrittr
. It applies to CPO in conjunction with Learners on the right and Tasks on the left, but does not doTask %>>% CPO %>>% Learner
because of the associativity problem.State of implementation
The current roadmap, as I see it; comments?
R/ParamSetSugar.R
)R/CPOObjectBased.R
)R/CPOFunctional.R
)predict
step without being attached to a Learner)R/CPO_concrete.R
)properties
handlingMaybe some day...