2470 report batch times automlsearch #3577

MichaelFu512 · 2022-06-21T15:52:14Z

Pull Request Description

Added an option to record batch and pipeline search times from automl.search() by using an optional parameter called "timing". Using timing = True will output the batch timings to stdout. automl.search() now also returns a dictionary that holds individual batches and pipelines times.

There's also a value in the inner dictionary called "Total time of batch" which records how long the batch took in total.

Closes #2470

* Unpin nlp_primitives but disallow v2.60 * Update release_notes.rst * more constraint / version updates

* Throws error on describe if uninstantiated * Added test for Exception * Fixed linting and added to release_notes * Made changes that Becca and Jeremy suggested

codecov · 2022-06-21T17:11:42Z

Codecov Report

Merging #3577 (93aec54) into main (a303470) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #3577     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        335     335             
  Lines      33456   33512     +56     
=======================================
+ Hits       33327   33383     +56     
  Misses       129     129

Impacted Files	Coverage Δ
evalml/automl/automl_search.py	`99.6% <100.0%> (+0.1%)`	⬆️
evalml/tests/automl_tests/test_automl.py	`99.5% <100.0%> (+0.1%)`	⬆️
evalml/tests/utils_tests/test_logger.py	`100.0% <100.0%> (ø)`
evalml/utils/logger.py	`100.0% <100.0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a303470...93aec54. Read the comment docs.

jeremyliweishih

Just some initial comments - lmk if they make sense! I'll give it another final review after your edits. Thanks!

evalml/automl/automl_search.py

jeremyliweishih · 2022-06-21T19:31:40Z

evalml/tests/automl_tests/test_automl.py

@@ -216,6 +223,61 @@ def test_search_results(X_y_regression, X_y_binary, X_y_multi, automl_type, obje
    )


+def test_search_batch_times(caplog):
+    caplog.clear()
+    X, y = load_data(


We can use a smaller dataset here: check out the X_y_binary fixture in contest.py and just search for tests that use it as an example.

jeremyliweishih · 2022-06-21T19:32:41Z

evalml/tests/automl_tests/test_automl.py

@@ -714,7 +776,7 @@ def test_large_dataset_binary(AutoMLTestEnv):
        objective=fraud_objective,
        additional_objectives=["auc", "f1", "precision"],
        max_time=1,
-        max_iterations=1,
+        max_iterations=3,


Why did we have to change this test?

that one was an oops, I think I changed that by accident.

jeremyliweishih · 2022-06-21T19:33:51Z

evalml/tests/automl_tests/test_automl.py

+            "provider": "categorical",
+        },
+    )
+    X_train, _, y_train, _ = evalml.preprocessing.split_data(


We can skip out on all this logic about splitting and datachecks as well. Look at something like test_pipeline_score_raises for an example.

chukarsten

Thanks for tackling this and for doing it so quickly!

evalml/automl/automl_search.py

chukarsten · 2022-06-21T18:51:22Z

evalml/automl/automl_search.py


        Raises:
            AutoMLSearchException: If all pipelines in the current AutoML batch produced a score of np.nan on the primary objective.
+
+        Returns:
+            Optinal Dict[int, Dict[str, str]]: Returns dict if timing is set to "return" or "both".


typo: "Optional"

I think we can also just make it so that search always returns this timing dictionary. I think conditional returns is a little troublesome, sometimes.

Do you think I should get rid of "return" and "both" as options for timing if we want it to always return this dictionary (meaning we only keep "log" as an option for timing)?

I'm sorry this took me so long, but yea, I think yes, it's fine to always return the dictionary with all the things in it.

evalml/automl/automl_search.py

….com/alteryx/evalml into 2470-Report-batch-times-automlsearch

eccabay

Huge fan of this, but I think we can make things a little clearer/more accessible for our users, most importantly where and what the argument that controls the behavior lives.

eccabay · 2022-07-01T16:01:11Z

evalml/automl/automl_search.py

+                Default: None
+                log=prints out batch/pipeline timing to console.


Since the only options are "log" or None, I vote that we switch to a boolean flag for this, something like log_timing which defaults to false. It'd make both our lives as the implementers easier (checking a boolean instead of string equality, not having to validate the input) and make it clearer for users to boot.

agreed as well!

eccabay · 2022-07-01T16:08:15Z

evalml/automl/automl_search.py

+        Returns:
+                Dict[int, Dict[str, Timestamp]]: Returns dict.
+                Key=batch #, value=Dict[key=pipeline name, value=timestamp of pipeline].
+                Inner dict has key called "Total time of batch" with value=total time of batch.


This is really hard to understand without reading closely. I'd refactor it, something more like:

Dict[int, Dict[str, Timestamp]]: Dictionary keyed by batch number that maps to the timings for pipelines run in that batch, as well as the total time for each batch. Pipelines within a batch are labeled by pipeline name.

As a side note, = in docstrings really throws me off. It'd be better to stick to using colons, which maintain consistency with the rest of our docs!

eccabay · 2022-07-01T16:17:12Z

evalml/automl/automl_search.py

+            log_title(self.logger, "Batch Time Stats")
+            log_batch_times(self.logger, batch_times)


I would move the call to log_title into log_batch_times itself, since we don't need to call log_batch_times without setting the title as well.

eccabay · 2022-07-01T16:35:47Z

evalml/automl/automl_search.py

@@ -857,16 +858,34 @@ def _handle_keyboard_interrupt(self):
            else:
                leading_char = ""

-    def search(self, show_iteration_plot=True):
+    def search(self, show_iteration_plot=True, timing=None):


I think we should move this to be an argument in AutoMLSearch.__init__ instead of AutoMLSearch.search. Reason being, we have two ways for users to run search. This is one of them, but we're trying to move more over to running the top level search method instead of manually instantiating AutoMLSearch first. With the argument living here, users have no access to the argument.

If we move the arg to AutoMLSearch.__init__ and add it to the top level search methods as well, that will ensure users have full access to controlling this.

agreed - thanks for covering this @eccabay!

jeremyliweishih

Agreed with @eccabay's comments. @MichaelFu512 can you request a re-review once those changes are in? Thanks!

jeremyliweishih · 2022-07-01T18:14:10Z

evalml/automl/automl_search.py

@@ -857,16 +858,34 @@ def _handle_keyboard_interrupt(self):
            else:
                leading_char = ""

-    def search(self, show_iteration_plot=True):
+    def search(self, show_iteration_plot=True, timing=None):


agreed - thanks for covering this @eccabay!

jeremyliweishih · 2022-07-01T18:14:18Z

evalml/automl/automl_search.py

+                Default: None
+                log=prints out batch/pipeline timing to console.


agreed as well!

eccabay

Awesome Michael, thanks for making all these changes! I just have one small comment, but other than that this is looking great.

eccabay · 2022-07-06T15:38:17Z

evalml/automl/automl_search.py

@@ -143,6 +145,7 @@ def search(
            in time series problems, values should be passed in for the time_index, gap, forecast_horizon, and max_delay variables.
        n_splits (int): Number of splits to use with the default data splitter.
        verbose (boolean): Whether or not to display semi-real-time updates to stdout while search is running. Defaults to False.
+        timing (boolean): Whether or not to display pipeline search times to stdout. Defaults to False.


Nitpicky clarification: logging info is not guaranteed to display that information in stdout, that will only happen if the logging level is set low enough to expose it. It'd be more accurate to say:

Whether or not to write pipeline search times to the logger. Defaults to False.

By default, if timing is set to True, the user would still not see the timings being logged in stdout since the default logging behavior is at the warning level (will not show info/debug level logs). To see this, they would either need to configure the logger themselves, or set verbose=True. Alternatively, they can choose to dump the log into a file or otherwise redirect that info, in which case it would be logged somewhere but not appearing in stdout.

Sorry for the info dump, I'm just intimately familiar with our logging behavior 😅

It's always good for me to learn more so "info dump(s)" are always great.

eccabay · 2022-07-06T15:39:25Z

evalml/automl/automl_search.py

@@ -410,6 +414,8 @@ class AutoMLSearch:
            If a parallel engine is selected this way, the maximum amount of parallelism, as determined by the engine, will be used. Defaults to "sequential".

        verbose (boolean): Whether or not to display semi-real-time updates to stdout while search is running. Defaults to False.
+
+        timing (boolean): Whether or not to display pipeline search times to stdout. Defaults to False.


Same comment about stdout vs logging holds here.

Also, I think you're missing this argument in search_iterative?

….com/alteryx/evalml into 2470-Report-batch-times-automlsearch

MichaelFu512 and others added 10 commits June 16, 2022 16:48

Tried to implement batch/pipeline timing

c0e7daa

Unpin nlp_primitives but disallow v2.6.0 (#3574)

8fe158a

* Unpin nlp_primitives but disallow v2.60 * Update release_notes.rst * more constraint / version updates

2734 Component Graph Instantiation Error (#3569)

52ce328

* Throws error on describe if uninstantiated * Added test for Exception * Fixed linting and added to release_notes * Made changes that Becca and Jeremy suggested

Changed print to log, added function to logger

2dcd340

Linting is correct

df8cd27

Added test into test_logger.py

03bcf9d

Added tests for batch time results

550f248

Updated release note

f87be9e

Tried to fix merge conflict

e9d8e1a

Merge branch 'main' into 2470-Report-batch-times-automlsearch

cc5ced0

MichaelFu512 marked this pull request as ready for review June 21, 2022 17:49

auto-assign bot assigned MichaelFu512 Jun 21, 2022

MichaelFu512 requested review from chukarsten and jeremyliweishih June 21, 2022 17:50

MichaelFu512 enabled auto-merge (squash) June 21, 2022 19:03

jeremyliweishih requested changes Jun 21, 2022

View reviewed changes

jeremyliweishih requested review from eccabay and christopherbunn June 21, 2022 19:34

chukarsten suggested changes Jun 21, 2022

View reviewed changes

MichaelFu512 added 10 commits June 21, 2022 15:02

Added Jeremy's and Karsten's suggestions

095042c

Simplified tests and added new error check

eac697e

Merge branch '2470-Report-batch-times-automlsearch' of https://github…

2e99b50

….com/alteryx/evalml into 2470-Report-batch-times-automlsearch

Merge branch 'main' into 2470-Report-batch-times-automlsearch

349d713

Made it so that the dictionary always returns

6fe4dbc

Merge branch 'main' into 2470-Report-batch-times-automlsearch

3802b8a

Merge branch 'main' into 2470-Report-batch-times-automlsearch

7afc635

Merge branch 'main' into 2470-Report-batch-times-automlsearch

4832fda

Merge branch 'main' into 2470-Report-batch-times-automlsearch

e9b3c19

Merge branch 'main' into 2470-Report-batch-times-automlsearch

fae0bcd

MichaelFu512 added 3 commits June 29, 2022 10:25

Merge branch 'main' into 2470-Report-batch-times-automlsearch

13ab5da

Merge branch 'main' into 2470-Report-batch-times-automlsearch

f477cf7

Merge branch 'main' into 2470-Report-batch-times-automlsearch

092cb07

MichaelFu512 requested review from jeremyliweishih and chukarsten June 30, 2022 18:48

eccabay requested changes Jul 1, 2022

View reviewed changes

jeremyliweishih reviewed Jul 1, 2022

View reviewed changes

MichaelFu512 disabled auto-merge July 5, 2022 20:53

MichaelFu512 added 7 commits July 5, 2022 13:53

Merge branch 'main' into 2470-Report-batch-times-automlsearch

a085d20

Added comments and made code clearer

9e35adf

Fixed tests for search

3694a31

Fixed release_notes

774d6a7

Merge branch 'main' into 2470-Report-batch-times-automlsearch

4ba83d3

Fixed release notes

1caad75

Update automl_search.py

42fc70c

eccabay approved these changes Jul 6, 2022

View reviewed changes

MichaelFu512 added 2 commits July 6, 2022 09:00

Updated docstring

3223f16

Merge branch '2470-Report-batch-times-automlsearch' of https://github…

41695c0

….com/alteryx/evalml into 2470-Report-batch-times-automlsearch

MichaelFu512 requested a review from jeremyliweishih July 7, 2022 15:45

Merge branch 'main' into 2470-Report-batch-times-automlsearch

cf13faf

chukarsten approved these changes Jul 7, 2022

View reviewed changes

Merge branch 'main' into 2470-Report-batch-times-automlsearch

93aec54

chukarsten merged commit 7610736 into main Jul 8, 2022

chukarsten deleted the 2470-Report-batch-times-automlsearch branch July 8, 2022 16:28

chukarsten mentioned this pull request Jul 24, 2022

Release v0.55.0 #3625

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2470 report batch times automlsearch #3577

2470 report batch times automlsearch #3577

MichaelFu512 commented Jun 21, 2022 •

edited

Loading

codecov bot commented Jun 21, 2022 •

edited

Loading

jeremyliweishih left a comment

jeremyliweishih Jun 21, 2022

jeremyliweishih Jun 21, 2022

MichaelFu512 Jun 21, 2022

jeremyliweishih Jun 21, 2022

chukarsten left a comment

chukarsten Jun 21, 2022

MichaelFu512 Jun 21, 2022

chukarsten Jun 30, 2022

eccabay left a comment

eccabay Jul 1, 2022

jeremyliweishih Jul 1, 2022

eccabay Jul 1, 2022

eccabay Jul 1, 2022

eccabay Jul 1, 2022

jeremyliweishih Jul 1, 2022

jeremyliweishih left a comment

jeremyliweishih Jul 1, 2022

jeremyliweishih Jul 1, 2022

eccabay left a comment

eccabay Jul 6, 2022

MichaelFu512 Jul 6, 2022

eccabay Jul 6, 2022

		Default: None
		log=prints out batch/pipeline timing to console.

		log_title(self.logger, "Batch Time Stats")
		log_batch_times(self.logger, batch_times)

2470 report batch times automlsearch #3577

2470 report batch times automlsearch #3577

Conversation

MichaelFu512 commented Jun 21, 2022 • edited Loading

Pull Request Description

codecov bot commented Jun 21, 2022 • edited Loading

Codecov Report

jeremyliweishih left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chukarsten left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eccabay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeremyliweishih left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eccabay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaelFu512 commented Jun 21, 2022 •

edited

Loading

codecov bot commented Jun 21, 2022 •

edited

Loading