Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YAHPO Gym always requries full configuration, also in case of forbidden hyperparameters #94

Open
LukasFehring opened this issue Mar 5, 2025 · 2 comments

Comments

@LukasFehring
Copy link

LukasFehring commented Mar 5, 2025

I observed this behavior, for example, in the case of "rbv2_ranger" on '470'.

The following example is created with a fresh environment and yahpogym . check=false is required because config space 0.6.1 does not contain the needed check_valid_configuration method

from yahpo_gym import benchmark_set

benchmark = benchmark_set.BenchmarkSet(scenario="rbv2_ranger", check=False)
benchmark.set_instance(value="470")

config = {
    "min.node.size": 50,
    "mtry.power": 0.0,
    "num.impute.selected.cpo": "impute.mean",
    "num.trees": 1000,
    "respect.unordered.factors": "ignore",
    "sample.fraction": 0.55,
    "splitrule": "gini",
    "task_id": "470",
}

print(benchmark.objective_function(config))
@sumny sumny changed the title YahpoGym always requries full configuraiton, also in case of forbidden hyperparameters YAHPO Gym always requries full configuration, also in case of forbidden hyperparameters Mar 5, 2025
@sumny
Copy link
Collaborator

sumny commented Mar 5, 2025

Thanks for opening this issue.
Can you maybe elaborate a bit on what exactly the issue is and what the expected behavior should be?
Let me elaborate:

benchmark.get_opt_space()
Configuration space object:
  Hyperparameters:
    min.node.size, Type: UniformInteger, Range: [1, 100], Default: 50
    mtry.power, Type: UniformFloat, Range: [0.0, 1.0], Default: 0.0
    num.impute.selected.cpo, Type: Categorical, Choices: {impute.mean, impute.median, impute.hist}, Default: impute.mean
    num.random.splits, Type: UniformInteger, Range: [1, 100], Default: 1
    num.trees, Type: UniformInteger, Range: [1, 2000], Default: 1000
    repl, Type: Categorical, Choices: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, Default: 10
    replace, Type: Categorical, Choices: {TRUE, FALSE}, Default: TRUE
    respect.unordered.factors, Type: Categorical, Choices: {ignore, order, partition}, Default: ignore
    sample.fraction, Type: UniformFloat, Range: [0.1, 1.0], Default: 0.55
    splitrule, Type: Categorical, Choices: {gini, extratrees}, Default: gini
    task_id, Type: Constant, Value: 470
    trainsize, Type: UniformFloat, Range: [0.03, 1.0], Default: 0.525
  Conditions:
    num.random.splits | splitrule == 'extratrees'

tells you how a configuration should look like based on the optimization space (which sets the instance value to a constant and potentially drops fidelity parameters fixing them at the highest value); i.e. benchmark.config_space is not necessarily the search space that is optimized over but contains more parameters than the actual optimization space which you get with get_opt_space().

if we sample a point from the optimization space we can also see what the benchmark expects to be part of a configuration:

benchmark.get_opt_space().sample_configuration(1)
Configuration(values={
  'min.node.size': 71,
  'mtry.power': 0.9155386229529013,
  'num.impute.selected.cpo': 'impute.hist',
  'num.random.splits': 34,
  'num.trees': 1485,
  'repl': '1',
  'replace': 'TRUE',
  'respect.unordered.factors': 'partition',
  'sample.fraction': 0.16011881922566848,
  'splitrule': 'extratrees',
  'task_id': '470',
  'trainsize': 0.5538487728906515,
})

i.e., a configuration must always contain all parameters that are active.
This is due to the surrogate model being trained over all instances and points for a given scenario and being able to handle missing value imputation and therefore if you disable checks, the surrogate will still return a prediction (even for an incomplete configuration, because it can handle missing values, but it will likely return output that is not sensible).

If you keep check = True

benchmark = benchmark_set.BenchmarkSet(scenario="rbv2_ranger", check=True)

config = {
    "min.node.size": 50,
    "mtry.power": 0.0,
    "num.impute.selected.cpo": "impute.mean",
    "num.trees": 1000,
    "respect.unordered.factors": "ignore",
    "sample.fraction": 0.55,
    "splitrule": "gini",
    "task_id": "470",
}

print(benchmark.objective_function(config))

you will actual be told that your point is not fully specified:

ValueError: Active hyperparameter 'repl' not specified!

So in general, unless you always specify points fully, setting check = False can be misleading because the surrogate still tries to predict values for a configuration that it has never seen and actually also does not exist in this space (as it requires the full specification of all active parameters part of the optimization space).

What I am not sure about is the "check=False is required because ConfigSpace 0.6.1 does not contain the needed check_valid_configuration method" part of your question.
Can you provide more details here? As far as I recall, check = True here will perform an internal check of the configuration that you provide prior to evaluating it with the surrogate model and this should be done with the check_configuration method of the config_space object itself of class ConfigSpace.configuration_space.ConfigurationSpace.
Admittedly, YAHPO Gym still is in need of some overhaul to work with newer versions of ConfigSpace (which hopefully will eventually be done with a v2) but the check itself should work.

@LukasFehring
Copy link
Author

LukasFehring commented Mar 6, 2025

Hi, Thank you for your swift answer :)

Yes, the issue appears to be caused by us using the new configspace. Due to the old ConfigSpace, SMAC and other libraries can, by default, not be used with yahpogym. For that reason, we use an updated form of configspace created by @benjamc and patch local references. This was done for CARP-S.

git clone https://github.com/benjamc/yahpo_gym.git lib/yahpo_gym
$CONDA_RUN_COMMAND $PIP install -e lib/yahpo_gym/yahpo_gym
cd $CARPS_ROOT/carps
mkdir benchmark_data
cd benchmark_data
git clone https://github.com/slds-lmu/yahpo_data.git
cd ../..
$CONDA_RUN_COMMAND python $CARPS_ROOT/scripts/patch_yahpo_configspace.py
$CONDA_RUN_COMMAND $PIP install ConfigSpace --upgrade

In order to still be able to use the library, we start from a default configuration and replace all optimized parameters as indicated as indicated below. Would you suggest setting those parameters differently? We would assume that the surrogate would be trained with the defaults?

def _train(self, config: Configuration, seed: int = 0):
    # Start with default config and replace values. Otherwise YahpoGym fails
    final_config = self.benchmark._get_config_space().get_default_configuration()
    for name, value in config.items():
        final_config[name] = value

    res = self.benchmark.objective_function(configuration=final_config)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants