Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] callbacks per #892 #1264

Merged
merged 26 commits into from
Jul 3, 2016
Merged

[R-package] callbacks per #892 #1264

merged 26 commits into from
Jul 3, 2016

Conversation

khotilov
Copy link
Member

@khotilov khotilov commented Jun 9, 2016

The majority of stuff is working, but some things are raw. The code docs would need some more work, and some changes in how things are handled might be needed (see below and also some TODO markers in the code). I'm pr-ing it mainly to get some feedback.
Note that this coming weekend I would not be able to do any work on all this.

Work notes:

  • Added the callbacks that are mostly similar as in python, plus the cb.save_model one. The 'cb.' prefix is purely for convenience when using the tab-completion, and I'm open to other naming suggestions. Some callbacks have a "finalizer" functionality.
  • Instead of the reset_learning_rates callback, added a more general cb.reset_parameters
  • The cb.save_model callback uses save_period=0 to save the last model (the changed default save_period=NULL in xgb.train would not enable the saving). It also allows to include iteration number into the file name.
  • The cb.early_stop callback allows to set a specific metric column name to use. If it is not set, the last evaluation column is used. Choosing the last one is more natural, and python API does that too.
  • Added training continuation to xgb.train
  • Cleaning: making the training flow better structured, commented, with reusable blocks, and the code to be more R-ish
  • WARNING: return value change for the train and cv methods. My goal was to have a somewhat unified output interface among the two, that would also be what an R user would normally expect, which is a list of various output data items (as opposed to the previous part-list/part-attributes mix, or CV's output changing to a different structure when prediction in CV is required).
  • Changed iteration index to be consistently 1-indexed in R
  • Added formatted print methods for xgb.Booster and xgb.cv.result
  • Changed to use the roxygen-handled useDynLib library loading mechanism in place of doing it manually via onLoad/onUnload. That plays better with devtools.
  • Merged the documentation of xgboost and xgb.Booster to reduce the redundancy
  • Added extra examples to the docs of xgb.train and xgboost
  • Added bunch of additional unit tests

List of some things and questions to consider:

  • predictions from CV: 0) leave them as they are 1) create a callback for that 2) return folds' models, and have a separate predict method to use with the cv-output object? I'm torn b/w 2.0 version, lots of changes #1 and fix loss_type #2.
  • It might make sense to ensure that the early stopping callback is the last
  • Why not to use the limit on number of iterations in predict instead on the number of trees?
  • I was using the style option of "single statement after 'if' goes on a new line with no curly brackets". I find that easier to read when there are many if-checks. But I can change it to the "lots-of-{}" style if you wish.
  • There is still a fair amount of parameter name style inconsistencies among the R package methods. Would it be a good time to make those consistent? Or even as close to python interface as makes sense? (can leave the old ones for some time with deprecation warning)
  • In addition to the current 'synchronous' CV method, I would like to add some time later a simple 'sequential' CV-wrapper for xgb.train, which would be better suited for the classic CV-based model evaluation. I'm mentioning it here because it might be done either as a separate strategy option within xgb.cv, or as a separate function (I prefer the latter). Since it might affect the current xgb.cv interface, I'm looking for some thoughts on that.

@tqchen
Copy link
Member

tqchen commented Jun 10, 2016

LGTM, wait @hetong007 for review

@hetong007
Copy link
Member

Hi @khotilov, Thanks for the great contribution. I am reading the code, it is a bit long so I will discuss with you tomorrow.

It is an important improvement of the R-package, so please also add your name to the author list in DESCRIPTION as well :)

@hetong007
Copy link
Member

My thoughts on your questions:

  • predictions from CV: The first option might be the best, since I don't see any other benefit of exposing the models from each fold to users.
  • Number of iterations: If num_parallel_tree is set, then the number of trees is different from the number of iterations (@tqchen is this statement correct?). I think controlling the number of iterations makes more sense.
  • I prefer the style where else is wrapped by two brackets as } else { so that there is no chance for the "Unexpected ELSE error". Another minor point is, having brackets from the beginning makes future improvement easier.
  • It will be great if the parameter names are updated with deprecation warnings.
  • For the sequential CV-wrapper, can you explain more on that point? I might have not understand why it is "better suited for the classic CV-based model evaluation". Thank you.

Some comments on the TODOs in the code:

  • For cb.reset_parameters: If you think there are some parameters better be fixed, otherwise I guess we could just leave it unrestricted.
  • For print.xgb.Booster: The set of parameters is recorded in xgb.train. If you want to print the current value for each parameter, we could manually construct a function initialized with default value for parameters, then track the changes of the parameter in the process.

Another concern is since there are a lot of changes, we might need to mention that to all the users. A new tutorial would be very helpful after these have been merged.

@tqchen
Copy link
Member

tqchen commented Jun 12, 2016

@hetong007 can you give a list of actionable items so after these are changed we can merge it in? This can help us to merge this change in quicker

@hetong007
Copy link
Member

hetong007 commented Jun 13, 2016

@khotilov If you think my comments above make sense, then the list of actions before we merge it in will be:

  • Create a call back for prediction in CV
  • Wrap if statements with brackets
  • Change the parameter style, and add deprecation warnings
  • Document the return values for xgb.train/xgb.cv
  • Add your name to the DESCRIPTION file

I think the above changes makes this commit consistent, and we can merge them in at this point.

After the commit, there are several things for us to do:

  • Write a vignette about this change, preferrably with comprehensive introduction on a simple example.
  • Control the number of iterations instead of control the number of trees
  • Add 'sequential' CV
  • Add the function to print the parameters and their current values.

@khotilov
Copy link
Member Author

Tong:
Thanks for reviewing! I am still travelling and will reply to your
questions in a day or two.
Regards,
Vadim
On Jun 13, 2016 1:47 PM, "Tong He" [email protected] wrote:

@khotilov https://github.com/khotilov If you think my comments above make
sense, then the list of actions before we merge it in will be:

  • Create a call back for prediction in CV
  • Wrap if statements with brackets
  • Change the parameter style, and add deprecation warnings
  • Document the return values for xgb.train/xgb.cv

I think the above changes makes this commit consistent, and we can merge
them in at this point.

After the commit, there are several things for us to do:

  • Write a vignette about this change, preferrably with comprehensive
    introduction on a simple example.
  • Control the number of iterations instead of control the number of trees
  • Add 'sequential' CV
  • Add the function to print the parameters and their current values.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1264 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/ABie726IZBzLvHlPZYOV4yfdYR34OxEEks5qLZe9gaJpZM4IyAuZ
.

@khotilov
Copy link
Member Author

Below are my comments and questions

  • On CV predictions: I'll make it into a callback with an option to keep the models. My interest in CV models is usually for the debugging purposes, but some folks seem to utilize them even more broadly.
  • #iterations vs #trees: I see I was too brief in my initial statement on the subject. The idea is to use, say, a nrounds paramater in predict as a primary user-controlled knob for limiting the number of boosting iterations. The ntreelimit would stil need to be there for things like random forest, and when it is set it could take priority over the nrounds. With ntreelimit only, it's a minimal but perhaps too compressed/loaded interface, that requires a user to be quite aware about the internal structure of a model.
  • On unifying the style of parameter names: I'll start with the obvious outliers and then we can discuss how far we might want to go. Other than just the style, my practical issue is that due to many parameter names being different, it is currently awkward to share pieces of code between R and Py.
  • On the style of callback names: I am seeing my mix of a dot and underscores in there as more and more ugly... Should I simply use the dots separators as in all the other xgboost method names?
  • Limiting the parameter set in cb.reset_parameters: it might probably be not worth it for now.
  • Function to print the parameters and their current values: do you mean here that we need a C++ function that can pull in all the current model parameters and their default values, or did I misunderstand?

On the synchronous vs sequential CV... The current synchronous approach has a few downsides:

  • high memory use due to creation of data copies for each split, that have to be simultaneously kept in memory
  • it's somewhat hard to inject a data pre-processing step
  • the procedure design makes the estimation of the optimal number of iterations simple, but we cannot perform, e.g., early stopping within each fold.

And what I mean by the "classic CV-based model evaluation" is when models for each fold are built in a very similar fashion as a non-CV model would be built, preventing any possibilities for information leakage between the folds. I understand that pre-packaged CV methods are usually somewhat limited, and I am not a stranger to writing my own custom CV-based workflows. But my hope is that it should be possible to have a fairly generic but simple xgb.train CV-wrapper, that would help to limit the amount of custom boilerplate code and would not have the mentioned downsides.

@hetong007
Copy link
Member

Hi @khotilov , thank you for the explanations.

To reply to your comments:

  • On sharing code between R and python: in my understanding, sharing the code between these two languages is usually just to share the training parameters. Therefore we only need to make sure the parameter names are the same or similar enough.
  • I prefer to use more dots in the callback names, so that people knows they are functions.
  • Function to print the parameters: we could do it in C++. Another faster way is to do it in R since we know the default values for these parameters, so that we could update the values from the input. Doing it in R also means easier maintainance.

For the CV-related:

  • The first two points make sense to me. I just don't get the point to do "early stopping within each fold", because in this case the number of iterations are likely to be different among all the folds, thus it cannot guide the following training steps.
  • I think this part proposal is meaningful, but it just doesn't belong to this specific callback topic. Can you please open a new issue for this topic? In this way our discussion could be found and organized easier.

Please let me know your thoughts.

@tqchen
Copy link
Member

tqchen commented Jun 22, 2016

any updates on this thread?

@khotilov
Copy link
Member Author

Last weekend was too hectic for me to get it out, but the actions items from the list are mostly done (except the documentation). I'll get on it tomorrow evening.

And I'll open separate issues for CV and parameters later.

@khotilov
Copy link
Member Author

I've added some more tests and fixed some kinks;
moved cv-prediction to a CB, and changed CB names to be dots.style ;
changed several parameters to underscore_style and added a deprecation warning method;
added some more documentation and examples;
made some code pieces simpler;
consolidated all importFrom's in one place;
added a reshape parameter to predict.

@hetong007
Copy link
Member

Hi @khotilov , thanks for the work! I am currently attending the useR! 2016 conference and busying with my slides :P

Will review the code asap.

@khotilov khotilov mentioned this pull request Jul 2, 2016
@tqchen
Copy link
Member

tqchen commented Jul 3, 2016

@hetong007 Let us aim to merge this soon, now that useR has pasted

@hetong007
Copy link
Member

@khotilov It seems good to me now. Can you merge the latest version and ping me? Then I will merge it. Thank you!

@khotilov
Copy link
Member Author

khotilov commented Jul 3, 2016

@hetong007 the merge has completed

@hetong007 hetong007 merged commit 44ed6d5 into dmlc:master Jul 3, 2016
@hetong007
Copy link
Member

Merged, thanks.

@tqchen
Copy link
Member

tqchen commented Jul 3, 2016

Thanks for bring this in @khotilov . As next step, do you mind create a tutorial explaining how this feature can be used for various advanced tasks? We can put it onto the document, as well as the blogpost

@tqchen tqchen added this to the v0.6.1 milestone Jul 6, 2016
@tqchen tqchen modified the milestones: v0.6, v0.6.1 Jul 29, 2016
@tqchen tqchen mentioned this pull request Jul 29, 2016
@lock lock bot locked as resolved and limited conversation to collaborators Jan 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants