-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2470 report batch times automlsearch #3577
Conversation
* Unpin nlp_primitives but disallow v2.60 * Update release_notes.rst * more constraint / version updates
* Throws error on describe if uninstantiated * Added test for Exception * Fixed linting and added to release_notes * Made changes that Becca and Jeremy suggested
Codecov Report
@@ Coverage Diff @@
## main #3577 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 335 335
Lines 33456 33512 +56
=======================================
+ Hits 33327 33383 +56
Misses 129 129
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some initial comments - lmk if they make sense! I'll give it another final review after your edits. Thanks!
@@ -216,6 +223,61 @@ def test_search_results(X_y_regression, X_y_binary, X_y_multi, automl_type, obje | |||
) | |||
|
|||
|
|||
def test_search_batch_times(caplog): | |||
caplog.clear() | |||
X, y = load_data( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use a smaller dataset here: check out the X_y_binary
fixture in contest.py
and just search for tests that use it as an example.
@@ -714,7 +776,7 @@ def test_large_dataset_binary(AutoMLTestEnv): | |||
objective=fraud_objective, | |||
additional_objectives=["auc", "f1", "precision"], | |||
max_time=1, | |||
max_iterations=1, | |||
max_iterations=3, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we have to change this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that one was an oops, I think I changed that by accident.
"provider": "categorical", | ||
}, | ||
) | ||
X_train, _, y_train, _ = evalml.preprocessing.split_data( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can skip out on all this logic about splitting and datachecks as well. Look at something like test_pipeline_score_raises
for an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling this and for doing it so quickly!
evalml/automl/automl_search.py
Outdated
|
||
Raises: | ||
AutoMLSearchException: If all pipelines in the current AutoML batch produced a score of np.nan on the primary objective. | ||
|
||
Returns: | ||
Optinal Dict[int, Dict[str, str]]: Returns dict if timing is set to "return" or "both". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: "Optional"
I think we can also just make it so that search always returns this timing dictionary. I think conditional returns is a little troublesome, sometimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think I should get rid of "return" and "both" as options for timing if we want it to always return this dictionary (meaning we only keep "log" as an option for timing)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry this took me so long, but yea, I think yes, it's fine to always return the dictionary with all the things in it.
….com/alteryx/evalml into 2470-Report-batch-times-automlsearch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huge fan of this, but I think we can make things a little clearer/more accessible for our users, most importantly where and what the argument that controls the behavior lives.
evalml/automl/automl_search.py
Outdated
Default: None | ||
log=prints out batch/pipeline timing to console. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the only options are "log" or None, I vote that we switch to a boolean flag for this, something like log_timing
which defaults to false. It'd make both our lives as the implementers easier (checking a boolean instead of string equality, not having to validate the input) and make it clearer for users to boot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed as well!
evalml/automl/automl_search.py
Outdated
Returns: | ||
Dict[int, Dict[str, Timestamp]]: Returns dict. | ||
Key=batch #, value=Dict[key=pipeline name, value=timestamp of pipeline]. | ||
Inner dict has key called "Total time of batch" with value=total time of batch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really hard to understand without reading closely. I'd refactor it, something more like:
Dict[int, Dict[str, Timestamp]]: Dictionary keyed by batch number that maps to the timings for pipelines run in that batch,
as well as the total time for each batch. Pipelines within a batch are labeled by pipeline name.
As a side note, =
in docstrings really throws me off. It'd be better to stick to using colons, which maintain consistency with the rest of our docs!
evalml/automl/automl_search.py
Outdated
log_title(self.logger, "Batch Time Stats") | ||
log_batch_times(self.logger, batch_times) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move the call to log_title
into log_batch_times
itself, since we don't need to call log_batch_times
without setting the title as well.
evalml/automl/automl_search.py
Outdated
@@ -857,16 +858,34 @@ def _handle_keyboard_interrupt(self): | |||
else: | |||
leading_char = "" | |||
|
|||
def search(self, show_iteration_plot=True): | |||
def search(self, show_iteration_plot=True, timing=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should move this to be an argument in AutoMLSearch.__init__
instead of AutoMLSearch.search
. Reason being, we have two ways for users to run search. This is one of them, but we're trying to move more over to running the top level search
method instead of manually instantiating AutoMLSearch
first. With the argument living here, users have no access to the argument.
If we move the arg to AutoMLSearch.__init__
and add it to the top level search methods as well, that will ensure users have full access to controlling this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed - thanks for covering this @eccabay!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with @eccabay's comments. @MichaelFu512 can you request a re-review once those changes are in? Thanks!
evalml/automl/automl_search.py
Outdated
@@ -857,16 +858,34 @@ def _handle_keyboard_interrupt(self): | |||
else: | |||
leading_char = "" | |||
|
|||
def search(self, show_iteration_plot=True): | |||
def search(self, show_iteration_plot=True, timing=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed - thanks for covering this @eccabay!
evalml/automl/automl_search.py
Outdated
Default: None | ||
log=prints out batch/pipeline timing to console. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed as well!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome Michael, thanks for making all these changes! I just have one small comment, but other than that this is looking great.
evalml/automl/automl_search.py
Outdated
@@ -143,6 +145,7 @@ def search( | |||
in time series problems, values should be passed in for the time_index, gap, forecast_horizon, and max_delay variables. | |||
n_splits (int): Number of splits to use with the default data splitter. | |||
verbose (boolean): Whether or not to display semi-real-time updates to stdout while search is running. Defaults to False. | |||
timing (boolean): Whether or not to display pipeline search times to stdout. Defaults to False. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicky clarification: logging info is not guaranteed to display that information in stdout, that will only happen if the logging level is set low enough to expose it. It'd be more accurate to say:
Whether or not to write pipeline search times to the logger. Defaults to False.
By default, if timing
is set to True, the user would still not see the timings being logged in stdout since the default logging behavior is at the warning
level (will not show info/debug level logs). To see this, they would either need to configure the logger themselves, or set verbose=True
. Alternatively, they can choose to dump the log into a file or otherwise redirect that info, in which case it would be logged somewhere but not appearing in stdout.
Sorry for the info dump, I'm just intimately familiar with our logging behavior 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's always good for me to learn more so "info dump(s)" are always great.
evalml/automl/automl_search.py
Outdated
@@ -410,6 +414,8 @@ class AutoMLSearch: | |||
If a parallel engine is selected this way, the maximum amount of parallelism, as determined by the engine, will be used. Defaults to "sequential". | |||
|
|||
verbose (boolean): Whether or not to display semi-real-time updates to stdout while search is running. Defaults to False. | |||
|
|||
timing (boolean): Whether or not to display pipeline search times to stdout. Defaults to False. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about stdout vs logging holds here.
Also, I think you're missing this argument in search_iterative
?
….com/alteryx/evalml into 2470-Report-batch-times-automlsearch
Pull Request Description
Added an option to record batch and pipeline search times from
automl.search()
by using an optional parameter called "timing". Using timing = True will output the batch timings to stdout.automl.search()
now also returns a dictionary that holds individual batches and pipelines times.There's also a value in the inner dictionary called "Total time of batch" which records how long the batch took in total.
Closes #2470