Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding the cv_results_ #503

Closed
kaiweiang opened this issue Jun 28, 2018 · 4 comments
Closed

Regarding the cv_results_ #503

kaiweiang opened this issue Jun 28, 2018 · 4 comments

Comments

@kaiweiang
Copy link

kaiweiang commented Jun 28, 2018

Hi,

I was reading the result from cv_results_ and trying to understand

  1. If the holdout resampling strategy is used, how are the mean_test_score and mean_fit_time computed since holdout method only has one train and validation set?
  2. If the resampling strategy is CV with 5 folds, when the status is timeout, does it mean one of the 5 folds runs over the time limit or the total time taken to run these 5 folds run over the time limit?
  3. I'm aware the default ml_memory_limit is ~3GB (3072MB) which is quite large. I'm wondering why some of the algorithm fittings will take more than that and hence, cause the memory out, even my dataset is less than 50 mb? Can you provide some of the scenarios?
  4. For the time and memory limit, are the time taken for data and feature processing included?

Thank you

@kaiweiang
Copy link
Author

kaiweiang commented Jun 28, 2018

The other question is, is it possible to estimate the per_run_time_limit and ml_memory_limit based on the size of dataset to minimize the chance of the timeout and memout to occur?

@mfeurer
Copy link
Contributor

mfeurer commented Jul 2, 2018

If the holdout resampling strategy is used, how are the mean_test_score and mean_fit_time computed since holdout method only has one train and validation set?

It's the mean of one repetition -> it's the score on the holdout set.

If the resampling strategy is CV with 5 folds, when the status is timeout, does it mean one of the 5 folds runs over the time limit or the total time taken to run these 5 folds run over the time limit?

In case you're using cv, the time limit is for all five folds. If you use partial-cv, the time limit is per fold (but this disables the use of the ensemble).

I'm aware the default ml_memory_limit is ~3GB (3072MB) which is quite large. I'm wondering why some of the algorithm fittings will take more than that and hence, cause the memory out, even my dataset is less than 50 mb? Can you provide some of the scenarios?

Possible reasons for running over the memory limit are OneHotEncoding and feature expansion mechanism such as random kitchen sinks or the Nyström kernel approximation.

For the time and memory limit, are the time taken for data and feature processing included?

Yes. The time and memory limit are for the execution of the complete pipeline.

The other question is, is it possible to estimate the per_run_time_limit and ml_memory_limit based on the size of dataset to minimize the chance of the timeout and memout to occur?

Potentially yes, but we're not doing this.

@kaiweiang
Copy link
Author

kaiweiang commented Jul 4, 2018

@mfeurer Thanks for answering all my earlier questions.

I do find out that the target algorithms like random forest has its parameter max_depth set at None which makes the tree expands until all leaves are pure or until all leaves contain less than min_samples_split samples, are more likely to hit memory or time limit. So, is there any way to limit the max_depth to certain depth? I'm thinking set_params but am not sure how to correctly use it.

Apart from that, when initial_configurations_via_metalearning set to 25, are these 25 set of target algorithms randomly chosen from the metalearner?

Thank you

@mfeurer
Copy link
Contributor

mfeurer commented Jul 19, 2018

So, is there any way to limit the max_depth to certain depth?

Not really. You could either change the code or create a new component with this hyperparameter activated and then deactivate the original random forest.

when initial_configurations_via_metalearning set to 25, are these 25 set of target algorithms randomly chosen from the metalearner?

No, they are chosen according a kNN algorithm as described in Initializing Bayesian Hyperparameter Optimization via Meta-Learning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants