Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to sample in interval ... #647

Closed
mgermain opened this issue Aug 30, 2021 · 2 comments
Closed

Failed to sample in interval ... #647

mgermain opened this issue Aug 30, 2021 · 2 comments
Labels
bug Indicates an unexpected problem or unintended behavior medium The bug breaks a feature but it can still be used or causes a confusing user experience reproduced

Comments

@mgermain
Copy link

mgermain commented Aug 30, 2021

Describe the bug
I started 3 workers on the same TPE experiment.

  • After ~15 hours one worker stopped with the following error.
  • After ~55 hours another worker stopped with the same error.
RuntimeError: Failed to sample in interval (-18.420680743952367, -9.210340371976182)

After ~60h I took a look at the workers, the 3rd one was still running and I noticed that 2 had stopped.
I relaunched the 2 that had stopped and they are running fine, no problem sampling new trials.

Expected behavior
When a worker stops because he thinks he has converged we should not be able to relaunch them.
When worker stops shouldn't we expect the other workers to also stop the next time they sample?

Steps to reproduce
Describe the steps to reproduce the bug, if applicable:

Environment (please complete the following information):

  • OS: Linux instance-2 5.4.0-1051-gcp 18.04.1-Ubuntu SMP Sun Aug 1 20:38:04 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Python version: 3.9.6
  • Oríon version: 0.1.16
  • Database: MongoDB, atlas

Additional context

2021-08-30 06:22:22,941::DEBUG::orion.core.worker.consumer::Parsing results from file and fill corresponding Trial object.
2021-08-30 06:22:22,942::INFO::orion.core.worker.experiment::Completed trials with results: [Result(name='objective', type='objective', value=-0.0)]
2021-08-30 06:22:22,976::DEBUG::orion.client.experiment::Trying to reserve a new trial to evaluate.
2021-08-30 06:22:22,976::DEBUG::orion.core.worker.experiment::reserving trial with (score: None)
2021-08-30 06:22:22,983::DEBUG::orion.core.worker.experiment::reserved trial (trial: None)
2021-08-30 06:22:22,993::DEBUG::orion.client.experiment::#### Failed to pull a new trial from database.
2021-08-30 06:22:22,993::DEBUG::orion.client.experiment::#### Fetch most recent completed trials and update algorithm.
2021-08-30 06:22:23,003::DEBUG::orion.core.worker.producer::### Fetch completed trials to observe:
2021-08-30 06:22:23,013::DEBUG::orion.core.worker.producer::### [Trial(experiment=ObjectId('612974a9fba8f7e34bf6e518'), status='completed', params=/data/resize/size:1032,/model/pretrained/base_model_lr:1e-08,/model/pretrained/classif_head/nb_layers:3,/model/pretrained/classif_head/nb_units:75,/optimizer/lr:2e-05,/trainer/seed:4242)]
2021-08-30 06:22:23,013::DEBUG::orion.core.worker.producer::### Convert them to list of points and their results.
2021-08-30 06:22:23,013::DEBUG::orion.core.worker.producer::### Observe them.
2021-08-30 06:22:23,018::DEBUG::orion.core.worker.producer::### Create fake trials to observe:
2021-08-30 06:22:23,018::DEBUG::orion.core.worker.producer::### Fetch active trials to observe:
2021-08-30 06:22:23,018::DEBUG::orion.core.worker.producer::### [Trial(experiment=ObjectId('612974a9fba8f7e34bf6e518'), status='reserved', params=/data/resize/size:1032,/model/pretrained/base_model_lr:1e-08,/model/pretrained/classif_head/nb_layers:3,/model/pretrained/classif_head/nb_units:553,/optimizer/lr:0.004,/trainer/seed:4242)]
2021-08-30 06:22:23,019::DEBUG::orion.core.worker.producer::### Use defined ParallelStrategy to assign them fake results.
2021-08-30 06:22:23,019::DEBUG::orion.core.worker.producer::### Register lie to database: Trial(experiment=ObjectId('612974a9fba8f7e34bf6e518'), status='reserved', params=/data/resize/size:1032,/model/pretrained/base_model_lr:1e-08,/model/pretrained/classif_head/nb_layers:3,/model/pretrained/classif_head/nb_units:553,/optimizer/lr:0.004,/trainer/seed:4242)
2021-08-30 06:22:23,026::DEBUG::orion.core.worker.producer::### [Trial(experiment=ObjectId('612974a9fba8f7e34bf6e518'), status='completed', params=/data/resize/size:1032,/model/pretrained/base_model_lr:1e-08,/model/pretrained/classif_head/nb_layers:3,/model/pretrained/classif_head/nb_units:553,/optimizer/lr:0.004,/trainer/seed:4242)]
2021-08-30 06:22:23,026::DEBUG::orion.core.worker.producer::### Convert them to list of points and their results.
2021-08-30 06:22:23,026::DEBUG::orion.core.worker.producer::### Observe them.
2021-08-30 06:22:23,028::DEBUG::orion.client.experiment::#### Produce new trials.
2021-08-30 06:22:23,038::DEBUG::orion.core.worker.producer::### Algorithm suggests new points.
Traceback (most recent call last):
  File "/home/mathieu.germain/.conda/envs/optina/bin/orion", line 33, in <module>
    sys.exit(load_entry_point('orion==0.1.16', 'console_scripts', 'orion')())
  File "/home/mathieu.germain/.conda/envs/optina/lib/python3.9/site-packages/orion/core/cli/__init__.py", line 37, in main
    return orion_parser.execute(argv)
  File "/home/mathieu.germain/.conda/envs/optina/lib/python3.9/site-packages/orion/core/cli/base.py", line 89, in execute
    returncode = function(args)
  File "/home/mathieu.germain/.conda/envs/optina/lib/python3.9/site-packages/orion/core/cli/hunt.py", line 207, in main
    workon(experiment, ignore_code_changes=ignore_code_changes, **worker_config)
  File "/home/mathieu.germain/.conda/envs/optina/lib/python3.9/site-packages/orion/core/cli/hunt.py", line 163, in workon
    client.workon(
  File "/home/mathieu.germain/.conda/envs/optina/lib/python3.9/site-packages/orion/client/experiment.py", line 723, in workon
    trials = self.executor.wait(
  File "/home/mathieu.germain/.conda/envs/optina/lib/python3.9/site-packages/orion/executor/joblib_backend.py", line 32, in wait
    return joblib.Parallel(n_jobs=self.n_workers)(futures)
  File "/home/mathieu.germain/.conda/envs/optina/lib/python3.9/site-packages/joblib/parallel.py", line 1041, in __call__
    if self.dispatch_one_batch(iterator):
  File "/home/mathieu.germain/.conda/envs/optina/lib/python3.9/site-packages/joblib/parallel.py", line 859, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/mathieu.germain/.conda/envs/optina/lib/python3.9/site-packages/joblib/parallel.py", line 777, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/mathieu.germain/.conda/envs/optina/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/home/mathieu.germain/.conda/envs/optina/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 572, in __init__
    self.results = batch()
  File "/home/mathieu.germain/.conda/envs/optina/lib/python3.9/site-packages/joblib/parallel.py", line 262, in __call__
    return [func(*args, **kwargs)
  File "/home/mathieu.germain/.conda/envs/optina/lib/python3.9/site-packages/joblib/parallel.py", line 262, in <listcomp>
    return [func(*args, **kwargs)
  File "/home/mathieu.germain/.conda/envs/optina/lib/python3.9/site-packages/orion/client/experiment.py", line 745, in _optimize
    with self.suggest() as trial:
  File "/home/mathieu.germain/.conda/envs/optina/lib/python3.9/site-packages/orion/client/experiment.py", line 546, in suggest
    trial = reserve_trial(self._experiment, self._producer)
  File "/home/mathieu.germain/.conda/envs/optina/lib/python3.9/site-packages/orion/client/experiment.py", line 54, in reserve_trial
    producer.produce()
  File "/home/mathieu.germain/.conda/envs/optina/lib/python3.9/site-packages/orion/core/worker/producer.py", line 111, in produce
    new_points = self.suggest()
  File "/home/mathieu.germain/.conda/envs/optina/lib/python3.9/site-packages/orion/core/worker/producer.py", line 97, in suggest
    return self.naive_algorithm.suggest(num)
  File "/home/mathieu.germain/.conda/envs/optina/lib/python3.9/site-packages/orion/core/worker/primary_algo.py", line 67, in suggest
    points = self.algorithm.suggest(num)
  File "/home/mathieu.germain/.conda/envs/optina/lib/python3.9/site-packages/orion/algo/tpe.py", line 301, in suggest
    candidates = self._suggest_bo(max(num - len(samples), 0))
  File "/home/mathieu.germain/.conda/envs/optina/lib/python3.9/site-packages/orion/algo/tpe.py", line 339, in _suggest_bo
    return self._suggest(num, suggest_bo)
  File "/home/mathieu.germain/.conda/envs/optina/lib/python3.9/site-packages/orion/algo/tpe.py", line 316, in _suggest
    for candidate in function(num - len(points)):
  File "/home/mathieu.germain/.conda/envs/optina/lib/python3.9/site-packages/orion/algo/tpe.py", line 337, in suggest_bo
    return [self._suggest_one_bo() for _ in range(num)]
  File "/home/mathieu.germain/.conda/envs/optina/lib/python3.9/site-packages/orion/algo/tpe.py", line 337, in <listcomp>
    return [self._suggest_one_bo() for _ in range(num)]
  File "/home/mathieu.germain/.conda/envs/optina/lib/python3.9/site-packages/orion/algo/tpe.py", line 357, in _suggest_one_bo
    dim_samples = self._sample_real_dimension(
  File "/home/mathieu.germain/.conda/envs/optina/lib/python3.9/site-packages/orion/algo/tpe.py", line 416, in _sample_real_dimension
    return self.sample_one_dimension(
  File "/home/mathieu.germain/.conda/envs/optina/lib/python3.9/site-packages/orion/algo/tpe.py", line 408, in sample_one_dimension
    new_point = sampler(dimension, below_points[j], above_points[j])
  File "/home/mathieu.germain/.conda/envs/optina/lib/python3.9/site-packages/orion/algo/tpe.py", line 465, in _sample_real_point
    candidate_points = gmm_sampler_below.sample(self.n_ei_candidates)
  File "/home/mathieu.germain/.conda/envs/optina/lib/python3.9/site-packages/orion/algo/tpe.py", line 588, in sample
    raise RuntimeError(
RuntimeError: Failed to sample in interval (-18.420680743952367, -9.210340371976182)

** Possible solution**
If you think you know what the problem is, let us know! Your opinion helps us.

@mgermain mgermain added the bug Indicates an unexpected problem or unintended behavior label Aug 30, 2021
@bouthilx
Copy link
Member

bouthilx commented Aug 31, 2021

For reference based on off-line discussions.

A minimal search space representative of the use-case:

space = {
  'a': 'choices([1, 2])',
  'b': 'choices([1, 2, 3])',
  'c': 'choices([1, 2])',
  'd': 'choices([1, 2, 3, 4])',
  'e': 'loguniform(1e-8, 1e-4, precision=1)',
  'f': 'uniform(0, 3, discrete=True, shape=())',
  'g': 'uniform(25, 500, discrete=True, shape=())',
  'h': 'loguniform(1e-5, 1e-2, precision=1)',
  'i': 'choices([1, 2])'
}

TPE configuration:

experiment:
  algorithms:
    tpe:
      seed: 1234

I will try to reproduce the issue. If successful I will see if I can reproduce with a simpler search space.

@bouthilx bouthilx added medium The bug breaks a feature but it can still be used or causes a confusing user experience reproduced labels Sep 1, 2021
@bouthilx
Copy link
Member

bouthilx commented Sep 2, 2021

Fixed in #650

@bouthilx bouthilx closed this as completed Sep 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior medium The bug breaks a feature but it can still be used or causes a confusing user experience reproduced
Projects
None yet
Development

No branches or pull requests

2 participants