Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-objective ensemble API #1485

Merged
merged 5 commits into from
May 30, 2022
Merged

Multi-objective ensemble API #1485

merged 5 commits into from
May 30, 2022

Conversation

mfeurer
Copy link
Contributor

@mfeurer mfeurer commented May 24, 2022

On a high-level, this PR adds

  • API to pass custom ensemble classes to Auto-sklearn,
  • API updates for use multi-objective ensembles, and
  • possibility to use a metric that requires access to X.

On a lower level, this PR also adds:

  • Updates to the ensemble building module:
    • New functionality to retrieve the identifiers and weights of an ensemble.
    • Because of that I was able to improve type definitions for the ensemble building submodule. More concretely, there are no more ensemble files exempt from type checking.
    • pass a different seed to the ensemble builder every time it is called
    • make candidate selection (n_best) aware of multiple objectives
    • pass information about all available runs to the ensemble class
    • available ensemble classes are now shown in the docs
    • improved the docstring of ensemble selection
  • estimators API
    • deprecate ensemble_size argument
  • examples
    • add a pareto-front plot to the multi-objective Auto-sklearn example
  • tests
    • add a new case for AutoML tests that checks for multi-objective optimization

@codecov
Copy link

codecov bot commented May 24, 2022

Codecov Report

Merging #1485 (d8a863c) into development (4b21134) will decrease coverage by 0.41%.
The diff coverage is 75.08%.

❗ Current head d8a863c differs from pull request most recent head 0ba05e9. Consider uploading reports for the commit 0ba05e9 to get more accurate results

@@               Coverage Diff               @@
##           development    #1485      +/-   ##
===============================================
- Coverage        84.22%   83.81%   -0.42%     
===============================================
  Files              151      152       +1     
  Lines            11488    11662     +174     
  Branches          1994     2037      +43     
===============================================
+ Hits              9676     9774      +98     
- Misses            1279     1339      +60     
- Partials           533      549      +16     

Impacted file tree graph

Copy link
Contributor

@eddiebergman eddiebergman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is mostly the same as the other code I saw, just brief question on RandomState

@mfeurer mfeurer requested a review from eddiebergman May 24, 2022 17:07
Comment on lines 82 to 100
Expects
-------
* Auto-sklearn can predict and has a model
* _load_pareto_front returns one scikit-learn ensemble
"""
# Check that the predict function works
X = np.array([[1.0, 1.0, 1.0, 1.0]])
print(automl.predict(X))
assert automl.predict_proba(X).shape == (1, 3)
assert automl.predict(X).shape == (1,)

pareto_front = automl._load_pareto_front()
assert len(pareto_front) == 1
for ensemble in pareto_front:
assert isinstance(ensemble, (VotingClassifier, VotingRegressor))
y_pred = ensemble.predict_proba(X)
assert y_pred.shape == (1, 3)
y_pred = ensemble.predict(X)
assert y_pred in ["setosa", "versicolor", "virginica"]
Copy link
Contributor

@eddiebergman eddiebergman May 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not very clear why this should be only one scikit learn ensemble expected but I assume it's because of the default parameter for ensemble selection.

It also seems this test is very specific to this single case (fitted multiobjective iris classifier).

I had the same problem when considering cases and my solution was just to have general tests. We can just push this through for now, knowing it will break if we add any other cases with the "multiobjective" tag.

The longer term solution to this, I have a few ideas:

  • We just use make_automl, make_dataset and construct the specific automl instance for this test where the specifics that are being tested are directly evident in the test. Same as old way of doing things and leads to no caching but at least all relevant setup assumptions are stated clearly in test.
  • We encode these extra specifics somehow:
    • The case just returns extra info
    def case_classifier_fitted_holdout_multiobjective(...):
        ...
        return (model, extra_info)
    • The extra specifics are directly saved and access on the model object. This does add a lot more introspection capabilites to the model which may be helpful for future additions

Happy to hear any other ideas on this though, I admit the caching solution as is, is not perfect for this reason, but it does allow the tests to be a lot more modular.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not very clear why this should be only one scikit learn ensemble expected but I assume it's because of the default parameter for ensemble selection.

Correct.

It also seems this test is very specific to this single case (fitted multiobjective iris classifier).

Correct as well.

I had the same problem when considering cases and my solution was just to have general tests. We can just push this through for now, knowing it will break if we add any other cases with the "multiobjective" tag.

Very glad you see it this way.

Happy to hear any other ideas on this though

Would we for the 2nd idea check whether the AutoML was built on iris and then use it? Besides that, could we maybe add a filter on which dataset(s) were used to build the AutoML system?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we for the 2nd idea check whether the AutoML was built on iris and then use it? Besides that, could we maybe add a filter on which dataset(s) were used to build the AutoML system?

Yup it's definitely possible, the easiest way is to just do so in the test itself, i.e. if extra_info["dataset"] != "iris": pass but I'm not the biggest fan of the solution.

The overarching problem is that you can't use @parametrize and @tags together, i.e. you can't associate a parameter with a tag.

I guess my prefered solution is to include more general things in the extra_info or encode it on the model, meaning the tests don't have to do any filtering.

extra_info = {
    "X_shape": X.shape,
    "y_shape": y.shape,
    "labels": ...
}
return (automl, extra_info)

It's not the cleanest but at least it means this test could theoretically work for any other "multiobjective" tagged case, as long as it provides the necessary extra_info.

@mfeurer mfeurer merged commit 25f0be6 into development May 30, 2022
@mfeurer mfeurer deleted the moo_api branch May 30, 2022 13:39
@eddiebergman eddiebergman linked an issue Jun 10, 2022 that may be closed by this pull request
eddiebergman added a commit that referenced this pull request Aug 18, 2022
* Multi-objective ensemble API

Co-authored-by: eddiebergman <[email protected]>

* update for rebase, add loading of X_data in ensemble builder

* Add unit tests

* Fix unittest?, increase coverage (hopefully)

* Rename methods to be Pareto set methods

Co-authored-by: eddiebergman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Question] Multi-objective auto-sklearn?
2 participants