Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[test] #5

Closed
wants to merge 213 commits into from
Closed
Changes from 1 commit
Commits
Show all changes
213 commits
Select commit Hold shift + click to select a range
e557b6a
[tune] Throw on overstepping
richardliaw Nov 5, 2018
b755785
Add Tune Multi-Node Tests
richardliaw Nov 5, 2018
32d1242
Add cluster bookkeeping code
richardliaw Nov 5, 2018
9ec3a60
add test for adding node
richardliaw Nov 5, 2018
44fe1e2
multinode test fixes
richardliaw Nov 5, 2018
d9c9e3b
First pass at allowing updatable values
richardliaw Nov 6, 2018
d6cade1
Fix compilation issues
richardliaw Nov 6, 2018
ac74520
Merge branch 'config_updating' into global_state_multinode
richardliaw Nov 6, 2018
a95c718
Add config file parsing
richardliaw Nov 6, 2018
5814655
Full initialization
richardliaw Nov 6, 2018
f63df3f
Merge branch 'config_updating' into global_state_multinode
richardliaw Nov 6, 2018
2824836
Wrote a good test
richardliaw Nov 6, 2018
6e7bd6a
Merge branch 'config_updating' into tune_cluster
richardliaw Nov 6, 2018
4842481
configuration parsing and stuff
richardliaw Nov 7, 2018
8e52103
docs
richardliaw Nov 7, 2018
83d6947
write some tests, make it good
richardliaw Nov 7, 2018
4349adf
Merge branch 'master' into config_updating
richardliaw Nov 7, 2018
8078967
fixed init
richardliaw Nov 7, 2018
2db9f18
Add all config options and bring back stress tests.
Nov 7, 2018
cc8fca2
Merge branch 'config_updating' into tune_cluster
richardliaw Nov 7, 2018
59480dc
Update python/ray/worker.py
richardliaw Nov 7, 2018
6fa9d7c
Update python/ray/worker.py
richardliaw Nov 7, 2018
856547c
TEMP
richardliaw Nov 7, 2018
25e45cd
Fix internalization
richardliaw Nov 7, 2018
2e2b8b0
Merge branch 'config_updating' of github.com:richardliaw/ray into con…
richardliaw Nov 7, 2018
d3fa8f0
some last changes
richardliaw Nov 7, 2018
233f3ee
Merge branch 'config_updating' into tune_cluster
richardliaw Nov 7, 2018
c3c1c9c
skip for now
richardliaw Nov 7, 2018
3e96ec9
Linting and Java fix
Nov 7, 2018
4081c60
add docstring
richardliaw Nov 7, 2018
a916257
Merge branch 'config_updating' into global_state_multinode
richardliaw Nov 7, 2018
5646982
Merge branch 'config_updating' into tune_cluster
richardliaw Nov 8, 2018
38eda57
Merge branch 'master' into tune_cluster
richardliaw Nov 8, 2018
7d19c9f
Merge branch 'master' into global_state_multinode
richardliaw Nov 8, 2018
2f5861c
Fix test, add assertions
richardliaw Nov 8, 2018
90bc3fb
Merge branch 'master' into tune_cluster
richardliaw Nov 8, 2018
d2dccae
Merge branch 'global_state_multinode' into tune_cluster
richardliaw Nov 8, 2018
7f675f7
fix up tests
richardliaw Nov 8, 2018
af0fe9c
pytest ext
richardliaw Nov 8, 2018
8f942fc
Merge branch 'global_state_multinode' into tune_cluster
richardliaw Nov 8, 2018
6194744
Merge branch 'master' into tune_cluster
richardliaw Nov 11, 2018
d01d80c
code to make requeueing work
richardliaw Nov 12, 2018
1e32227
yapf
richardliaw Nov 12, 2018
9db9d16
lint
richardliaw Nov 12, 2018
2639a98
comments
richardliaw Nov 12, 2018
d45c74b
lint
richardliaw Nov 12, 2018
ee6a800
Update multi_node_test_2.py
richardliaw Nov 12, 2018
3f13bfa
lit
richardliaw Nov 12, 2018
5f0d75e
re-enable
richardliaw Nov 12, 2018
b1793bd
lint
richardliaw Nov 12, 2018
0763004
initial nuke test
richardliaw Nov 12, 2018
a1a05f0
Track last result
richardliaw Nov 12, 2018
5abf9d1
Merge branch 'global_state_multinode' into tune_cluster
richardliaw Nov 12, 2018
6ecc2bb
note
richardliaw Nov 12, 2018
eabb28b
Merge branch 'tune_cluster' into tune_cluster-2
richardliaw Nov 12, 2018
09995fa
add checkpointing to trial_runner
richardliaw Nov 12, 2018
24d6e12
trialrunners
richardliaw Nov 12, 2018
dbd1bbc
logging
richardliaw Nov 12, 2018
e2b8380
Redo checkpointing from trial runner
richardliaw Nov 12, 2018
1239c1a
Merge branch 'master' into tune_cluster
richardliaw Nov 14, 2018
7aab84f
fix up tests and checkpointing
richardliaw Nov 14, 2018
6c3d53e
Merge branch 'tune_cluster' into tune_cluster-2
richardliaw Nov 14, 2018
1e8a33d
import error
richardliaw Nov 14, 2018
026192b
Merge branch 'tune_cluster' into tune_cluster-2
richardliaw Nov 14, 2018
637e707
timeout?
richardliaw Nov 15, 2018
162b308
lint
richardliaw Nov 15, 2018
a65fc45
Merge branch 'master' into tune_cluster
richardliaw Nov 16, 2018
b0d5997
Merge branch 'tune_cluster' into tune_cluster-2
richardliaw Nov 17, 2018
e683224
Checkpoint and tests
richardliaw Nov 19, 2018
ef67acf
one full cluster failure
richardliaw Nov 19, 2018
0541f92
lint
richardliaw Nov 21, 2018
267150d
Merge branch 'tune_cluster' into tune_cluster-2
richardliaw Nov 21, 2018
b884fed
Add better test
richardliaw Nov 21, 2018
81ff30f
Merge branch 'master' into tune_cluster-2
richardliaw Nov 21, 2018
8617f2d
error test
richardliaw Nov 22, 2018
f7e31bd
some docs
richardliaw Nov 24, 2018
a2355f8
Tests and better recovery handling
richardliaw Nov 25, 2018
5513099
Add unit test for restoring (but currently failing
richardliaw Nov 25, 2018
782f194
pickle if needed when you set status
richardliaw Nov 25, 2018
d1d5a56
yapf
richardliaw Nov 25, 2018
be445e8
docs and small test for nosaving
richardliaw Nov 25, 2018
c13270b
doc
richardliaw Nov 25, 2018
8a6ed91
more docs
richardliaw Nov 25, 2018
b1e3bf0
test docs
richardliaw Nov 25, 2018
40248aa
py2mock
richardliaw Nov 26, 2018
22930c8
dirpath from tmpdir
richardliaw Nov 26, 2018
82ff45e
fix tsts?
richardliaw Nov 26, 2018
bde644c
yapf
richardliaw Nov 26, 2018
79197b8
Fix up tests
richardliaw Nov 26, 2018
66be742
nits
richardliaw Nov 26, 2018
25be843
nit
richardliaw Nov 26, 2018
ff7b114
test fixup
richardliaw Nov 26, 2018
defe524
yapf
richardliaw Nov 26, 2018
02a9cf8
no skip
richardliaw Nov 26, 2018
f80e318
cluster tests
richardliaw Nov 26, 2018
07df20b
nit
richardliaw Nov 26, 2018
1ff31cb
Fix counting resources test
richardliaw Nov 27, 2018
6998a01
better test and error msg
richardliaw Nov 27, 2018
8ea0f70
Merge branch 'master' into tune_cluster-2
richardliaw Nov 27, 2018
fcbc6de
Tests and better recovery handling
richardliaw Nov 25, 2018
fc5d407
Merge branch 'tune_cluster-2a' into tune_cluster-2_copy
richardliaw Nov 27, 2018
5d8e414
py2mock
richardliaw Nov 26, 2018
9137de0
nit
richardliaw Nov 27, 2018
4453724
Fix counting resources test
richardliaw Nov 27, 2018
5a24499
Remove extraneous changes
richardliaw Nov 27, 2018
2dcba23
Merge branch 'tune_cluster-2a' into tune_cluster-2_copy
richardliaw Nov 27, 2018
b750d4e
docs
richardliaw Nov 27, 2018
14da6ec
yapf
richardliaw Nov 25, 2018
394c0e9
Lint and small changes to tests
richardliaw Nov 27, 2018
48fd3c3
lint
richardliaw Nov 27, 2018
bcf4051
nit
richardliaw Nov 27, 2018
0f67265
small extraneous removals
richardliaw Nov 27, 2018
74b6a93
fix some merge?
richardliaw Nov 27, 2018
1d6c185
Merge branch 'tune_cluster-2a' into tune_cluster-2_copy
richardliaw Nov 27, 2018
4bd54e3
Merge branch 'master' into tune_cluster-2_copy
richardliaw Dec 4, 2018
3d0a2e3
try recover
richardliaw Nov 30, 2018
4bb938f
merge
richardliaw Dec 4, 2018
ac5d8c0
Removed error raising
richardliaw Nov 30, 2018
83e1a26
Rename checkpoint_mode
richardliaw Dec 4, 2018
d77b934
note for pickling
richardliaw Dec 4, 2018
f3071eb
Better UI
richardliaw Dec 5, 2018
0263e93
Fix lint
richardliaw Dec 5, 2018
71db5df
nit
richardliaw Dec 5, 2018
9b9e771
fix up tests
richardliaw Dec 7, 2018
d669463
note
richardliaw Dec 7, 2018
55562a5
nit
richardliaw Dec 7, 2018
6dd5e59
text
richardliaw Dec 7, 2018
0c3ade9
nit
richardliaw Dec 7, 2018
dd1bb6b
Merge branch 'master' into tune_cluster-2
richardliaw Dec 7, 2018
afcf2b9
Merge branch 'master' into tune_cluster-2
richardliaw Dec 8, 2018
e1e7b4e
fix
richardliaw Dec 8, 2018
8961de1
fix usability
richardliaw Dec 11, 2018
adaaf43
Atomic Movement
richardliaw Dec 15, 2018
1976873
removed checkpoint freq
richardliaw Dec 15, 2018
84434f7
Merge branch 'master' into tune_cluster-2
richardliaw Dec 15, 2018
1d63eea
Merge branch 'master' into tune_cluster-2
richardliaw Dec 15, 2018
c2a578b
tweaks to update
richardliaw Dec 15, 2018
60653c1
move sync function
richardliaw Dec 15, 2018
c0d6db5
fix
richardliaw Dec 15, 2018
96c9f70
fixup
richardliaw Dec 15, 2018
2afe291
small modifications
richardliaw Dec 15, 2018
a5647db
fix
richardliaw Dec 15, 2018
f17a648
fix
richardliaw Dec 15, 2018
52ac11b
fix
richardliaw Dec 18, 2018
8a57a5b
fix assumption
richardliaw Dec 18, 2018
530d5af
better error message
richardliaw Dec 18, 2018
758ed30
fix registration
richardliaw Dec 18, 2018
ac8ba8e
Merge branch 'tune_cluster-2' of github.com:richardliaw/ray into tune…
richardliaw Dec 18, 2018
1ff4814
ok fix
richardliaw Dec 18, 2018
1052d3b
fix add test
richardliaw Dec 18, 2018
e0963e5
cloudpickle
richardliaw Dec 18, 2018
de38782
Merge branch 'tune_cluster-2' of github.com:richardliaw/ray into tune…
richardliaw Dec 18, 2018
5358f0c
classmethod
richardliaw Dec 18, 2018
3c1f5bc
Merge branch 'tune_cluster-2' of github.com:richardliaw/ray into tune…
richardliaw Dec 18, 2018
61034c9
fix
richardliaw Dec 18, 2018
cc336f6
Merge branch 'master' into tune_cluster-2
richardliaw Dec 18, 2018
b301963
lint...
richardliaw Dec 18, 2018
8207846
Revert "better error message"
richardliaw Dec 18, 2018
c6ec396
comment
richardliaw Dec 18, 2018
0db58a5
better handling
richardliaw Dec 18, 2018
4d75a38
nit
richardliaw Dec 18, 2018
da21686
a little confusing
richardliaw Dec 18, 2018
386a9fa
lint
richardliaw Dec 18, 2018
9c1cdb9
fixtest
richardliaw Dec 18, 2018
5b69662
fix mess
richardliaw Dec 20, 2018
355d760
docs
richardliaw Dec 20, 2018
f412130
checkpointmode
richardliaw Dec 20, 2018
64e4a09
some reversions
richardliaw Dec 24, 2018
0e78363
Resolve into experiment dir
richardliaw Dec 25, 2018
9c9cd9d
add example
richardliaw Dec 25, 2018
3e8641e
Trial is pickled
richardliaw Dec 25, 2018
c9ca3e2
small path fix
richardliaw Dec 25, 2018
4e7d438
Fix trial serialization
richardliaw Dec 25, 2018
2318f6a
JSONify state
richardliaw Dec 25, 2018
72d027a
make sure trials even without checkpointing are not duplicated
richardliaw Dec 25, 2018
b63f705
small tweaks
richardliaw Dec 25, 2018
31c99df
Merge branch 'master' into tune_cluster-2
richardliaw Dec 25, 2018
b764ab7
Update doc/source/tune-usage.rst
ericl Dec 25, 2018
07a36fc
fix up some components
richardliaw Dec 25, 2018
c2eddef
remove accidental merge
richardliaw Dec 26, 2018
27fecb1
Fix test for uncheckpointables
richardliaw Dec 26, 2018
bc8e67a
back to dict
richardliaw Dec 26, 2018
de1c14b
__getstate__, remove checkpointmode, trial name
richardliaw Dec 26, 2018
33dac8f
resources_to_json
richardliaw Dec 26, 2018
763e6d4
Fix experiments and final comments.
richardliaw Dec 26, 2018
1a15237
Fix more changes
richardliaw Dec 26, 2018
677e09d
resume to prompt on None
richardliaw Dec 26, 2018
83b1523
turn off prompt for tests
richardliaw Dec 26, 2018
d1180c3
Merge
richardliaw Dec 26, 2018
45d5d90
fix for tests
richardliaw Dec 26, 2018
21481f6
Merge branch 'tune_cluster-2' of github.com:richardliaw/ray into tune…
richardliaw Dec 26, 2018
b89b910
Accidentally removed a file
richardliaw Dec 26, 2018
6995a59
example not needed
richardliaw Dec 26, 2018
86b37f6
no need to change this
richardliaw Dec 26, 2018
f92198e
doc changes and small guard
richardliaw Dec 26, 2018
a096772
renames, remove checkpoint_mode
richardliaw Dec 26, 2018
ec726d2
Update python/ray/tune/trial_runner.py
ericl Dec 26, 2018
bf45f42
Simpify, fix tests, address comments, make expmt mandatory
richardliaw Dec 26, 2018
0fe066a
Merge branch 'tune_cluster-2' of github.com:richardliaw/ray into tune…
richardliaw Dec 26, 2018
3f2e1ae
typo
richardliaw Dec 26, 2018
e7b4f20
some more tests
richardliaw Dec 26, 2018
80ace1d
lint
richardliaw Dec 27, 2018
0348eb4
test fix and move env logic
richardliaw Dec 27, 2018
fc6802b
fix
richardliaw Dec 27, 2018
d1f1c0b
fix tests
richardliaw Dec 27, 2018
7650e9f
Update train.py
ericl Dec 27, 2018
54af15c
Update python/ray/tune/tune.py
ericl Dec 27, 2018
8503e17
Update python/ray/tune/tune.py
ericl Dec 27, 2018
4294f42
fix py2 test
richardliaw Dec 28, 2018
b6083b3
Merge branch 'tune_cluster-2' of github.com:richardliaw/ray into tune…
richardliaw Dec 28, 2018
b8da076
grammar
richardliaw Dec 28, 2018
c92313f
fix
richardliaw Dec 28, 2018
577da9b
fix travis
richardliaw Dec 28, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Simpify, fix tests, address comments, make expmt mandatory
richardliaw committed Dec 26, 2018
commit bf45f4268ae706f9e8e10b5bcd4bbdfae58bbcc9
3 changes: 2 additions & 1 deletion python/ray/tune/test/cluster_tests.py
Original file line number Diff line number Diff line change
@@ -276,7 +276,8 @@ def test_cluster_down_simple(start_connected_cluster, tmpdir):
assert cluster.wait_for_nodes()

dirpath = str(tmpdir)
runner = TrialRunner(BasicVariantGenerator(), checkpoint_dir=dirpath)
runner = TrialRunner(
BasicVariantGenerator(), metadata_checkpoint_dir=dirpath)
kwargs = {
"stopping_criterion": {
"training_iteration": 2
29 changes: 4 additions & 25 deletions python/ray/tune/test/trial_runner_test.py
Original file line number Diff line number Diff line change
@@ -617,29 +617,6 @@ def train(config, reporter):
self.assertEqual(trial.status, Trial.TERMINATED)
self.assertEqual(trial.last_result[TIMESTEPS_TOTAL], 99)

def testSpecifyAlgorithm(self):
"""Tests run_experiments works without specifying experiment."""

def train(config, reporter):
for i in range(100):
reporter(timesteps_total=i)

register_trainable("f1", train)

alg = BasicVariantGenerator()
alg.add_configurations({
"foo": {
"run": "f1",
"config": {
"script_min_iter_time_s": 0
}
}
})
trials = run_experiments(search_alg=alg)
for trial in trials:
self.assertEqual(trial.status, Trial.TERMINATED)
self.assertEqual(trial.last_result[TIMESTEPS_TOTAL], 99)

def testAutoregisterTrainable(self):
def train(config, reporter):
for i in range(100):
@@ -1663,7 +1640,8 @@ def testTrialSaveRestore(self):
ray.init(num_cpus=3)
tmpdir = tempfile.mkdtemp()

runner = TrialRunner(BasicVariantGenerator(), checkpoint_dir=tmpdir)
runner = TrialRunner(
BasicVariantGenerator(), metadata_checkpoint_dir=tmpdir)
trials = [
Trial(
"__fake",
@@ -1722,7 +1700,8 @@ def testTrialNoSave(self):
ray.init(num_cpus=3)
tmpdir = tempfile.mkdtemp()

runner = TrialRunner(BasicVariantGenerator(), checkpoint_dir=tmpdir)
runner = TrialRunner(
BasicVariantGenerator(), metadata_checkpoint_dir=tmpdir)

runner.add_trial(
Trial(
4 changes: 3 additions & 1 deletion python/ray/tune/trial.py
Original file line number Diff line number Diff line change
@@ -427,7 +427,9 @@ def __getstate__(self):
def __setstate__(self, state):
logger_started = state.pop("__logger_started__")
state["resources"] = json_to_resources(state["resources"])
for key in ["_checkpoint", "config", "custom_loggers", "sync_function"]:
for key in [
"_checkpoint", "config", "custom_loggers", "sync_function"
]:
state[key] = cloudpickle.loads(hex_to_binary(state[key]))

self.__dict__.update(state)
6 changes: 3 additions & 3 deletions python/ray/tune/trial_executor.py
Original file line number Diff line number Diff line change
@@ -25,7 +25,7 @@ def __init__(self, queue_trials=False):
automatic scale-up.
"""
self._queue_trials = queue_trials
self._checkpoints = {}
self._cached_trial_state = {}

def set_status(self, trial, status):
"""Sets status and checkpoints metadata if needed.
@@ -53,13 +53,13 @@ def try_checkpoint_metadata(self, trial):
return
try:
logger.debug("Saving trial metadata.")
self._checkpoints[trial.trial_id] = trial.__getstate__()
self._cached_trial_state[trial.trial_id] = trial.__getstate__()
except Exception:
logger.exception("Error checkpointing trial metadata.")

def get_checkpoints(self):
"""Returns a copy of mapping of the trial ID to pickled metadata."""
return self._checkpoints.copy()
return self._cached_trial_state.copy()

def has_resources(self, resources):
"""Returns whether this runner has at least the specified resources."""
30 changes: 19 additions & 11 deletions python/ray/tune/trial_runner.py
Original file line number Diff line number Diff line change
@@ -68,8 +68,8 @@ def __init__(self,
Trial objects.
scheduler (TrialScheduler): Defaults to FIFOScheduler.
launch_web_server (bool): Flag for starting TuneServer
metadata_checkpoint_dir (str): Path where global checkpoints are stored
and restored from.
metadata_checkpoint_dir (str): Path where
global checkpoints are stored and restored from.
server_port (int): Port number for launching TuneServer
verbose (bool): Flag for verbosity. If False, trial results
will not be output.
@@ -103,7 +103,7 @@ def __init__(self,
self._metadata_checkpoint_dir = metadata_checkpoint_dir

def checkpoint(self):
"""Saves execution state to `self._metadata_checkpoint_dir` if provided."""
"""Saves execution state to `self._metadata_checkpoint_dir`."""
if not self._metadata_checkpoint_dir:
return
metadata_checkpoint_dir = self._metadata_checkpoint_dir
@@ -114,12 +114,14 @@ def checkpoint(self):
self.trial_executor.get_checkpoints().values()),
"runner_data": self.__getstate__()
}
tmp_file_name = os.path.join(metadata_checkpoint_dir, ".tmp_checkpoint")
tmp_file_name = os.path.join(metadata_checkpoint_dir,
".tmp_checkpoint")
with open(tmp_file_name, "w") as f:
json.dump(runner_state, f, indent=2)

os.rename(tmp_file_name,
os.path.join(metadata_checkpoint_dir, TrialRunner.CKPT_FILE_NAME))
os.rename(
tmp_file_name,
os.path.join(metadata_checkpoint_dir, TrialRunner.CKPT_FILE_NAME))
return metadata_checkpoint_dir

@classmethod
@@ -134,7 +136,7 @@ def restore(cls,
all ongoing trials.

Args:
metadata_checkpoint_dir (str): Path to checkpoint (previously specified).
metadata_checkpoint_dir (str): Path to metadata checkpoints.
search_alg (SearchAlgorithm): Search Algorithm. Defaults to
BasicVariantGenerator.
scheduler (TrialScheduler): Scheduler for executing
@@ -144,12 +146,18 @@ def restore(cls,
Returns:
runner (TrialRunner): A TrialRunner to resume experiments from.
"""
with open(os.path.join(metadata_checkpoint_dir, TrialRunner.CKPT_FILE_NAME),
"r") as f:
with open(
os.path.join(metadata_checkpoint_dir,
TrialRunner.CKPT_FILE_NAME), "r") as f:
runner_state = json.load(f)

logger.warning("Tune recovery is still experimental. "
"There is limited search algorithm recovery support. ")
logger.warning("".join([
"Attempting to resume experiment from {}. ".format(
metadata_checkpoint_dir),
"This feature is experimental, "
"and may not work with all search algorithms. ",
"This will ignore any new changes to specification."
]))

from ray.tune.suggest import BasicVariantGenerator
runner = TrialRunner(
60 changes: 25 additions & 35 deletions python/ray/tune/tune.py
Original file line number Diff line number Diff line change
@@ -36,17 +36,13 @@ def _make_scheduler(args):


def _find_checkpoint_dir(exp_list):
if exp_list:
exp = exp_list[0]
# TODO(rliaw): Make sure this is resolved earlier.
return os.path.join(exp.spec["local_dir"], exp.name)
else:
return None
assert exp_list, "Experiments must be specified via `run_experiments`"
exp = exp_list[0]
# TODO(rliaw): Make sure this is resolved earlier.
return os.path.join(exp.spec["local_dir"], exp.name)


def try_restore_runner(checkpoint_dir, search_alg, scheduler, trial_executor):
logger.warn("Restoring from previous experiment and "
"ignoring any new changes to specification.")
new_runner = None
try:
new_runner = TrialRunner.restore(checkpoint_dir, search_alg, scheduler,
@@ -56,7 +52,7 @@ def try_restore_runner(checkpoint_dir, search_alg, scheduler, trial_executor):
return new_runner


def run_experiments(experiments=None,
def run_experiments(experiments,
search_alg=None,
scheduler=None,
with_server=False,
@@ -76,8 +72,6 @@ def run_experiments(experiments=None,
scheduler (TrialScheduler): Scheduler for executing
the experiment. Choose among FIFO (default), MedianStopping,
AsyncHyperBand, and HyperBand.
checkpoint_dir (str): Path at which experiment checkpoints are stored
and restored from.
with_server (bool): Starts a background Tune server. Needed for
using the Client API.
server_port (int): Port number for launching TuneServer.
@@ -118,31 +112,27 @@ def run_experiments(experiments=None,
# and it conducts the implicit registration.
experiments = convert_to_experiment_list(experiments)
checkpoint_dir = _find_checkpoint_dir(experiments)
if checkpoint_dir:
logger.info("Using checkpoint dir: {}.".format(checkpoint_dir))

runner = None
restore = False

if os.path.exists(
os.path.join(checkpoint_dir, TrialRunner.CKPT_FILE_NAME)):
if resume:
restore = True
elif resume is None and not os.environ.get("TUNE_RESUME_PROMPT_OFF"):
msg = "Would you like to resume your experiment from '{}'?".format(
checkpoint_dir)
restore = click.confirm(msg, default=True)
else:
logger.info(
"Did not find checkpoint file in {}.".format(checkpoint_dir))

if resume:
if not checkpoint_dir:
raise ValueError(
"checkpoint_dir not detected. "
"Set resume=False or set a local_dir."
)
if not os.path.exists(
os.path.join(checkpoint_dir, TrialRunner.CKPT_FILE_NAME)):
logger.warn(
"Did not find checkpoint file in {}.".format(checkpoint_dir))
else:
runner = try_restore_runner(checkpoint_dir, search_alg, scheduler,
trial_executor)
elif resume is None and not os.environ.get("TUNE_RESUME_PROMPT_OFF"):
if os.path.exists(os.path.join(checkpoint_dir, TrialRunner.CKPT_FILE_NAME)):
if click.confirm("Detected checkpoint dir: {}. Restore?".format(
checkpoint_dir)):
runner = try_restore_runner(checkpoint_dir, search_alg,
scheduler, trial_executor)
else:
logger.info("Overriding checkpoint and restarting experiment.")
if restore:
runner = try_restore_runner(checkpoint_dir, search_alg, scheduler,
trial_executor)
else:
logger.info("Starting a new experiment.")

if not runner:
if scheduler is None:
@@ -156,7 +146,7 @@ def run_experiments(experiments=None,
runner = TrialRunner(
search_alg,
scheduler=scheduler,
checkpoint_dir=checkpoint_dir,
metadata_checkpoint_dir=checkpoint_dir,
launch_web_server=with_server,
server_port=server_port,
verbose=verbose,