Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2470 report batch times automlsearch #3577

Merged
merged 34 commits into from
Jul 8, 2022

Conversation

MichaelFu512
Copy link
Contributor

@MichaelFu512 MichaelFu512 commented Jun 21, 2022

Pull Request Description

Added an option to record batch and pipeline search times from automl.search() by using an optional parameter called "timing". Using timing = True will output the batch timings to stdout. automl.search() now also returns a dictionary that holds individual batches and pipelines times.

There's also a value in the inner dictionary called "Total time of batch" which records how long the batch took in total.

Closes #2470

@codecov
Copy link

codecov bot commented Jun 21, 2022

Codecov Report

Merging #3577 (93aec54) into main (a303470) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #3577     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        335     335             
  Lines      33456   33512     +56     
=======================================
+ Hits       33327   33383     +56     
  Misses       129     129             
Impacted Files Coverage Δ
evalml/automl/automl_search.py 99.6% <100.0%> (+0.1%) ⬆️
evalml/tests/automl_tests/test_automl.py 99.5% <100.0%> (+0.1%) ⬆️
evalml/tests/utils_tests/test_logger.py 100.0% <100.0%> (ø)
evalml/utils/logger.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a303470...93aec54. Read the comment docs.

@MichaelFu512 MichaelFu512 marked this pull request as ready for review June 21, 2022 17:49
@MichaelFu512 MichaelFu512 enabled auto-merge (squash) June 21, 2022 19:03
Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some initial comments - lmk if they make sense! I'll give it another final review after your edits. Thanks!

evalml/automl/automl_search.py Outdated Show resolved Hide resolved
evalml/automl/automl_search.py Outdated Show resolved Hide resolved
evalml/automl/automl_search.py Outdated Show resolved Hide resolved
@@ -216,6 +223,61 @@ def test_search_results(X_y_regression, X_y_binary, X_y_multi, automl_type, obje
)


def test_search_batch_times(caplog):
caplog.clear()
X, y = load_data(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use a smaller dataset here: check out the X_y_binary fixture in contest.py and just search for tests that use it as an example.

@@ -714,7 +776,7 @@ def test_large_dataset_binary(AutoMLTestEnv):
objective=fraud_objective,
additional_objectives=["auc", "f1", "precision"],
max_time=1,
max_iterations=1,
max_iterations=3,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we have to change this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that one was an oops, I think I changed that by accident.

"provider": "categorical",
},
)
X_train, _, y_train, _ = evalml.preprocessing.split_data(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can skip out on all this logic about splitting and datachecks as well. Look at something like test_pipeline_score_raises for an example.

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this and for doing it so quickly!

evalml/automl/automl_search.py Outdated Show resolved Hide resolved

Raises:
AutoMLSearchException: If all pipelines in the current AutoML batch produced a score of np.nan on the primary objective.

Returns:
Optinal Dict[int, Dict[str, str]]: Returns dict if timing is set to "return" or "both".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "Optional"

I think we can also just make it so that search always returns this timing dictionary. I think conditional returns is a little troublesome, sometimes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think I should get rid of "return" and "both" as options for timing if we want it to always return this dictionary (meaning we only keep "log" as an option for timing)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry this took me so long, but yea, I think yes, it's fine to always return the dictionary with all the things in it.

evalml/automl/automl_search.py Outdated Show resolved Hide resolved
evalml/automl/automl_search.py Outdated Show resolved Hide resolved
evalml/automl/automl_search.py Outdated Show resolved Hide resolved
Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huge fan of this, but I think we can make things a little clearer/more accessible for our users, most importantly where and what the argument that controls the behavior lives.

Comment on lines 868 to 869
Default: None
log=prints out batch/pipeline timing to console.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the only options are "log" or None, I vote that we switch to a boolean flag for this, something like log_timing which defaults to false. It'd make both our lives as the implementers easier (checking a boolean instead of string equality, not having to validate the input) and make it clearer for users to boot.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed as well!

Comment on lines 875 to 878
Returns:
Dict[int, Dict[str, Timestamp]]: Returns dict.
Key=batch #, value=Dict[key=pipeline name, value=timestamp of pipeline].
Inner dict has key called "Total time of batch" with value=total time of batch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really hard to understand without reading closely. I'd refactor it, something more like:

Dict[int, Dict[str, Timestamp]]: Dictionary keyed by batch number that maps to the timings for pipelines run in that batch, 
as well as the total time for each batch. Pipelines within a batch are labeled by pipeline name. 

As a side note, = in docstrings really throws me off. It'd be better to stick to using colons, which maintain consistency with the rest of our docs!

Comment on lines 1031 to 1032
log_title(self.logger, "Batch Time Stats")
log_batch_times(self.logger, batch_times)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move the call to log_title into log_batch_times itself, since we don't need to call log_batch_times without setting the title as well.

@@ -857,16 +858,34 @@ def _handle_keyboard_interrupt(self):
else:
leading_char = ""

def search(self, show_iteration_plot=True):
def search(self, show_iteration_plot=True, timing=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move this to be an argument in AutoMLSearch.__init__ instead of AutoMLSearch.search. Reason being, we have two ways for users to run search. This is one of them, but we're trying to move more over to running the top level search method instead of manually instantiating AutoMLSearch first. With the argument living here, users have no access to the argument.

If we move the arg to AutoMLSearch.__init__ and add it to the top level search methods as well, that will ensure users have full access to controlling this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed - thanks for covering this @eccabay!

Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @eccabay's comments. @MichaelFu512 can you request a re-review once those changes are in? Thanks!

@@ -857,16 +858,34 @@ def _handle_keyboard_interrupt(self):
else:
leading_char = ""

def search(self, show_iteration_plot=True):
def search(self, show_iteration_plot=True, timing=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed - thanks for covering this @eccabay!

Comment on lines 868 to 869
Default: None
log=prints out batch/pipeline timing to console.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed as well!

@MichaelFu512 MichaelFu512 disabled auto-merge July 5, 2022 20:53
Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome Michael, thanks for making all these changes! I just have one small comment, but other than that this is looking great.

@@ -143,6 +145,7 @@ def search(
in time series problems, values should be passed in for the time_index, gap, forecast_horizon, and max_delay variables.
n_splits (int): Number of splits to use with the default data splitter.
verbose (boolean): Whether or not to display semi-real-time updates to stdout while search is running. Defaults to False.
timing (boolean): Whether or not to display pipeline search times to stdout. Defaults to False.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicky clarification: logging info is not guaranteed to display that information in stdout, that will only happen if the logging level is set low enough to expose it. It'd be more accurate to say:

Whether or not to write pipeline search times to the logger. Defaults to False.

By default, if timing is set to True, the user would still not see the timings being logged in stdout since the default logging behavior is at the warning level (will not show info/debug level logs). To see this, they would either need to configure the logger themselves, or set verbose=True. Alternatively, they can choose to dump the log into a file or otherwise redirect that info, in which case it would be logged somewhere but not appearing in stdout.

Sorry for the info dump, I'm just intimately familiar with our logging behavior 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's always good for me to learn more so "info dump(s)" are always great.

@@ -410,6 +414,8 @@ class AutoMLSearch:
If a parallel engine is selected this way, the maximum amount of parallelism, as determined by the engine, will be used. Defaults to "sequential".

verbose (boolean): Whether or not to display semi-real-time updates to stdout while search is running. Defaults to False.

timing (boolean): Whether or not to display pipeline search times to stdout. Defaults to False.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about stdout vs logging holds here.

Also, I think you're missing this argument in search_iterative?

@chukarsten chukarsten merged commit 7610736 into main Jul 8, 2022
@chukarsten chukarsten deleted the 2470-Report-batch-times-automlsearch branch July 8, 2022 16:28
@chukarsten chukarsten mentioned this pull request Jul 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Report Batch/Pipeline Times in AutoMLSearch
5 participants