Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2470 report batch times automlsearch #3577

Merged
merged 34 commits into from
Jul 8, 2022
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
c0e7daa
Tried to implement batch/pipeline timing
MichaelFu512 Jun 16, 2022
8fe158a
Unpin nlp_primitives but disallow v2.6.0 (#3574)
rwedge Jun 17, 2022
52ce328
2734 Component Graph Instantiation Error (#3569)
MichaelFu512 Jun 17, 2022
2dcd340
Changed print to log, added function to logger
MichaelFu512 Jun 17, 2022
df8cd27
Linting is correct
MichaelFu512 Jun 17, 2022
03bcf9d
Added test into test_logger.py
MichaelFu512 Jun 17, 2022
550f248
Added tests for batch time results
MichaelFu512 Jun 21, 2022
f87be9e
Updated release note
MichaelFu512 Jun 21, 2022
e9d8e1a
Tried to fix merge conflict
MichaelFu512 Jun 21, 2022
cc5ced0
Merge branch 'main' into 2470-Report-batch-times-automlsearch
MichaelFu512 Jun 21, 2022
095042c
Added Jeremy's and Karsten's suggestions
MichaelFu512 Jun 21, 2022
eac697e
Simplified tests and added new error check
MichaelFu512 Jun 21, 2022
2e99b50
Merge branch '2470-Report-batch-times-automlsearch' of https://github…
MichaelFu512 Jun 21, 2022
349d713
Merge branch 'main' into 2470-Report-batch-times-automlsearch
MichaelFu512 Jun 22, 2022
6fe4dbc
Made it so that the dictionary always returns
MichaelFu512 Jun 22, 2022
3802b8a
Merge branch 'main' into 2470-Report-batch-times-automlsearch
MichaelFu512 Jun 22, 2022
7afc635
Merge branch 'main' into 2470-Report-batch-times-automlsearch
MichaelFu512 Jun 23, 2022
4832fda
Merge branch 'main' into 2470-Report-batch-times-automlsearch
MichaelFu512 Jun 24, 2022
e9b3c19
Merge branch 'main' into 2470-Report-batch-times-automlsearch
MichaelFu512 Jun 27, 2022
fae0bcd
Merge branch 'main' into 2470-Report-batch-times-automlsearch
MichaelFu512 Jun 28, 2022
13ab5da
Merge branch 'main' into 2470-Report-batch-times-automlsearch
MichaelFu512 Jun 29, 2022
f477cf7
Merge branch 'main' into 2470-Report-batch-times-automlsearch
MichaelFu512 Jun 30, 2022
092cb07
Merge branch 'main' into 2470-Report-batch-times-automlsearch
MichaelFu512 Jun 30, 2022
a085d20
Merge branch 'main' into 2470-Report-batch-times-automlsearch
MichaelFu512 Jul 5, 2022
9e35adf
Added comments and made code clearer
MichaelFu512 Jul 5, 2022
3694a31
Fixed tests for search
MichaelFu512 Jul 5, 2022
774d6a7
Fixed release_notes
MichaelFu512 Jul 5, 2022
4ba83d3
Merge branch 'main' into 2470-Report-batch-times-automlsearch
MichaelFu512 Jul 5, 2022
1caad75
Fixed release notes
MichaelFu512 Jul 5, 2022
42fc70c
Update automl_search.py
MichaelFu512 Jul 5, 2022
3223f16
Updated docstring
MichaelFu512 Jul 6, 2022
41695c0
Merge branch '2470-Report-batch-times-automlsearch' of https://github…
MichaelFu512 Jul 6, 2022
cf13faf
Merge branch 'main' into 2470-Report-batch-times-automlsearch
MichaelFu512 Jul 7, 2022
93aec54
Merge branch 'main' into 2470-Report-batch-times-automlsearch
MichaelFu512 Jul 8, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ Release Notes
-------------
**Future Releases**
* Enhancements
* Added ability to log how long each batch and pipeline take in automl.search() :pr:`3577`
* Fixes
* Changes
* Documentation Changes
Expand Down
37 changes: 36 additions & 1 deletion evalml/automl/automl_search.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@
from evalml.utils.gen_utils import contains_all_ts_parameters
from evalml.utils.logger import (
get_logger,
log_batch_times,
log_subtitle,
log_title,
time_elapsed,
Expand Down Expand Up @@ -857,16 +858,34 @@ def _handle_keyboard_interrupt(self):
else:
leading_char = ""

def search(self, show_iteration_plot=True):
def search(self, show_iteration_plot=True, timing=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move this to be an argument in AutoMLSearch.__init__ instead of AutoMLSearch.search. Reason being, we have two ways for users to run search. This is one of them, but we're trying to move more over to running the top level search method instead of manually instantiating AutoMLSearch first. With the argument living here, users have no access to the argument.

If we move the arg to AutoMLSearch.__init__ and add it to the top level search methods as well, that will ensure users have full access to controlling this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed - thanks for covering this @eccabay!

"""Find the best pipeline for the data set.

Args:
show_iteration_plot (boolean, True): Shows an iteration vs. score plot in Jupyter notebook.
Disabled by default in non-Jupyter enviroments.
timing (str, None): Shows timing of the batches and the individual timings of each pipeline.
MichaelFu512 marked this conversation as resolved.
Show resolved Hide resolved
MichaelFu512 marked this conversation as resolved.
Show resolved Hide resolved
Default: None
log=prints out batch/pipeline timing to console.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the only options are "log" or None, I vote that we switch to a boolean flag for this, something like log_timing which defaults to false. It'd make both our lives as the implementers easier (checking a boolean instead of string equality, not having to validate the input) and make it clearer for users to boot.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed as well!


Raises:
AutoMLSearchException: If all pipelines in the current AutoML batch produced a score of np.nan on the primary objective.
ValueError: If timing is not set to a correct value

Returns:
Dict[int, Dict[str, Timestamp]]: Returns dict.
Key=batch #, value=Dict[key=pipeline name, value=timestamp of pipeline].
Inner dict has key called "Total time of batch" with value=total time of batch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really hard to understand without reading closely. I'd refactor it, something more like:

Dict[int, Dict[str, Timestamp]]: Dictionary keyed by batch number that maps to the timings for pipelines run in that batch, 
as well as the total time for each batch. Pipelines within a batch are labeled by pipeline name. 

As a side note, = in docstrings really throws me off. It'd be better to stick to using colons, which maintain consistency with the rest of our docs!

"""
batch_times = {}
if timing is not None:
timing = timing.lower()

if timing != "log" and timing is not None:
raise ValueError(
"""Timing isn't set to a correct value! Please try again using "log"."""
)

if self._searched:
self.logger.error(
"AutoMLSearch.search() has already been run and will not run again on the same instance. Re-initialize AutoMLSearch to search again."
Expand Down Expand Up @@ -924,7 +943,10 @@ def search(self, show_iteration_plot=True):
current_batch_pipeline_scores = []
new_pipeline_ids = []
loop_interrupted = False

while self._should_continue():
pipeline_times = {}
start_batch_time = time.time()
computations = []
try:
if not loop_interrupted:
Expand Down Expand Up @@ -952,6 +974,7 @@ def search(self, show_iteration_plot=True):
current_computation_index
]
if computation.done() and not has_been_processed:
start_pipeline_time = time.time()
evaluation = computation.get_result()
data, cached_data, pipeline, job_log = (
evaluation.get("scores"),
Expand All @@ -962,6 +985,9 @@ def search(self, show_iteration_plot=True):
pipeline_id = self._post_evaluation_callback(
pipeline, data, cached_data, job_log
)
pipeline_times[pipeline.name] = time_elapsed(
start_pipeline_time
)
new_pipeline_ids.append(pipeline_id)
computations[current_computation_index] = (computation, True)
computations_left_to_process -= 1
Expand All @@ -983,20 +1009,28 @@ def search(self, show_iteration_plot=True):
current_batch_pipeline_scores = full_rankings[current_batch_idx][
"validation_score"
]

if (
len(current_batch_pipeline_scores)
and current_batch_pipeline_scores.isna().all()
):
raise AutoMLSearchException(
f"All pipelines in the current AutoML batch produced a score of np.nan on the primary objective {self.objective}."
)
if len(pipeline_times) > 0:
pipeline_times["Total time of batch"] = time_elapsed(start_batch_time)
batch_times[self._get_batch_number()] = pipeline_times

self.search_duration = time.time() - self._start
elapsed_time = time_elapsed(self._start)
desc = f"\nSearch finished after {elapsed_time}"
desc = desc.ljust(self._MAX_NAME_LEN)
self.logger.info(desc)

if timing == "log":
log_title(self.logger, "Batch Time Stats")
log_batch_times(self.logger, batch_times)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move the call to log_title into log_batch_times itself, since we don't need to call log_batch_times without setting the title as well.


self._find_best_pipeline()
if self._best_pipeline is not None:
best_pipeline = self.rankings.iloc[0]
Expand All @@ -1006,6 +1040,7 @@ def search(self, show_iteration_plot=True):
f"Best pipeline {self.objective.name}: {best_pipeline['validation_score']:3f}"
)
self._searched = True
return batch_times

def _find_best_pipeline(self):
"""Finds the best pipeline in the rankings If self._best_pipeline already exists, check to make sure it is different from the current best pipeline before training and thresholding."""
Expand Down
55 changes: 55 additions & 0 deletions evalml/tests/automl_tests/test_automl.py
Original file line number Diff line number Diff line change
Expand Up @@ -216,6 +216,61 @@ def test_search_results(X_y_regression, X_y_binary, X_y_multi, automl_type, obje
)


def test_search_batch_times(caplog, X_y_binary, AutoMLTestEnv):
caplog.clear()
X, y = X_y_binary
automl = AutoMLSearch(
X_train=X,
y_train=y,
problem_type="binary",
max_iterations=None,
optimize_thresholds=False,
max_batches=3,
verbose=True,
)
batch_times = None
env = AutoMLTestEnv("binary")
with env.test_context(score_return_value={"Log Loss Binary": 0.3}):
batch_times = automl.search(timing="log")

out = caplog.text
assert isinstance(batch_times, dict)
assert isinstance(list(batch_times.keys())[0], int)
assert isinstance(batch_times[1], dict)
assert isinstance(list(batch_times[1].keys())[0], str)
assert isinstance(batch_times[1]["Total time of batch"], str)
assert isinstance(batch_times[2]["Total time of batch"], str)
assert isinstance(batch_times[3]["Total time of batch"], str)

assert len(batch_times) == 3
assert len(batch_times[1]) == 3
assert len(batch_times[2]) == 3
assert len(batch_times[3]) == 7

assert "Batch Time Stats" in out
assert "Batch 1 time stats" in out
assert "Batch 2 time stats" in out
assert "Batch 3 time stats" in out


def test_search_batch_times_raise_valueerror(caplog, X_y_binary, AutoMLTestEnv):
caplog.clear()
X, y = X_y_binary
automl = AutoMLSearch(
X_train=X,
y_train=y,
problem_type="binary",
max_iterations=None,
optimize_thresholds=False,
max_batches=3,
verbose=True,
)
env = AutoMLTestEnv("binary")
with env.test_context(score_return_value={"Log Loss Binary": 0.3}):
with pytest.raises(ValueError):
automl.search(timing="test")


@pytest.mark.parametrize(
"automl_type",
[ProblemTypes.BINARY, ProblemTypes.MULTICLASS, ProblemTypes.REGRESSION],
Expand Down
15 changes: 15 additions & 0 deletions evalml/tests/utils_tests/test_logger.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
from evalml import AutoMLSearch
from evalml.utils.logger import (
get_logger,
log_batch_times,
log_subtitle,
log_title,
time_elapsed,
Expand Down Expand Up @@ -72,6 +73,20 @@ def test_logger_critical(caplog, logger_env_cleanup):
assert "CRITICAL" in caplog.text


def test_logger_batch_times(caplog, logger_env_cleanup):
logger = get_logger(TEST_LOGGER_NAME)
batch_times = {"1": {"test": "00:01", "tset": "10:00"}, "2": {"pipe": "00:02"}}
log_batch_times(logger, batch_times)
assert "Batch 1 time stats" in caplog.text
assert "test:" in caplog.text
assert "00:01" in caplog.text
assert "tset" in caplog.text
assert "10:00" in caplog.text
assert "Batch 2 time stats" in caplog.text
assert "pipe" in caplog.text
assert "00:02" in caplog.text


@pytest.mark.parametrize(
"time_passed,answer",
[(101199, "28:06:39"), (3660, "1:01:00"), (65, "01:05"), (7, "00:07")],
Expand Down
17 changes: 17 additions & 0 deletions evalml/utils/logger.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,3 +55,20 @@ def time_elapsed(start_time):
return "{0:d}:{1:02d}:{2:02d}".format(h, m, s)
else:
return "{0:02d}:{1:02d}".format(m, s)


def log_batch_times(logger, batch_times):
"""Used to print out the batch times.

Args:
logger: the logger.
batch_times: dict with (batch number, {pipeline name, pipeline time}).
"""
for batch_number in batch_times:
subtitle = "Batch " + str(batch_number) + " time stats:"
log_subtitle(logger, subtitle)
for pipeline_name in batch_times[batch_number]:
logger.info(
"\n" + pipeline_name + ": " + batch_times[batch_number][pipeline_name]
)
logger.info("")