Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Added AutoMLTablesTrainingJob and tests #62

Merged

Conversation

ivanmkc
Copy link
Contributor

@ivanmkc ivanmkc commented Nov 13, 2020

Added support for AutoMLTablesTraining.

Fixes https://b.corp.google.com/issues/172282518
Depends on #49

Could possibly add more client-side validation, but currently deferring validation to the backend services pending more discussion.

@google-cla google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Nov 13, 2020
@ivanmkc ivanmkc force-pushed the imkc--trainingjob-automl-tables-training-job branch 2 times, most recently from 338063f to 2802dc8 Compare November 17, 2020 07:30
"transformations": self._column_transformations,
"trainBudgetMilliNodeHours": budget_milli_node_hours,
# optional inputs
"weightColumnName": weight_column,
Copy link
Contributor Author

@ivanmkc ivanmkc Nov 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In google3/google/cloud/aiplatform/publicfiles/trainingjob/definition/automl_tables.proto, this column is referred to as "weight_column_name".

However, in gs://google-cloud-aiplatform/schema/trainingjob/definition/automl_tabular_1.0.0.yaml, it gives the name as "weightColumn". This is incorrect and results in an error at training time.

@sasha-gitg why is there a discrepancy here?

I imagine the protos are the source-of-truth and the yaml is just out-of-date? If so, what's our plan to mitigate the users from dealing with this, since they don't have access to the protos AFAIK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced with the team. The yaml is incorrect and will be updated. The field should be weightColumnName. If users are using our Model Builder SDK then hopefully we have caught these issues early during our development. This should become less of an issue for the service as we move forward because it will support all previously versioned yaml schemas.

@ivanmkc ivanmkc force-pushed the imkc--trainingjob-automl-tables-training-job branch from 2759662 to 36d6b7c Compare November 17, 2020 10:38
def __init__(
self,
display_name: str,
optimization_objective: str,
Copy link
Contributor Author

@ivanmkc ivanmkc Nov 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the protos and yaml, the optimization_objective is optional since there is a default if it's not supplied.

Also, the design doc doesn't have the optimization_prediction_type parameter either. Sure, we can support this by inferring that info based on the optimization_objective, but is that what you intended?

My personal preferred way is to create an OptimizationObjective abstract class and create subclasses of each for each optimization_objective type. That would encapsulate the optimization_objective, optimization_prediction_type , optimization_objective_recall_value and optimization_objective_precision_value parameters together into one.

However, from our other convos it seems like you might prefer just passing in a string for optimization_objective and having the other parameters be optional.

Let me know what you prefer, regarding the type for optimization_objective and whether or not to include a optimization_prediction_type parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With respect to optimization_prediction_type we received feedback to include that as the top level argument because it requires less knowledge overhead than optimization_objective. Additionally, we should include all the values flattened in the API surface and validate or ignore arguments based on valid combinations. There's precedence for this pattern:

https://github.com/scikit-learn/scikit-learn/blob/0fb307bf3/sklearn/linear_model/_logistic.py#L1011

So the current state of the input arguments LGTM just elevate optimization_prediction_type over optimization_objective and make optimization_objective optional but add comments to explain the defaults when selecting optimization_prediction_type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context, sgtm!

@ivanmkc ivanmkc force-pushed the imkc--trainingjob-automl-tables-training-job branch 2 times, most recently from 5831100 to c175714 Compare November 18, 2020 12:29
@@ -31,4 +34,11 @@
"""
init = initializer.global_config.init

__all__ = ("gapic", "CustomTrainingJob", "Model", "Dataset", "Endpoint")
__all__ = (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linter did this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good.

@sirtorry sirtorry self-requested a review November 18, 2020 21:30
@ivanmkc ivanmkc force-pushed the imkc--trainingjob-automl-tables-training-job branch 2 times, most recently from c12d2b9 to 7858dfd Compare November 19, 2020 18:53
Copy link
Member

@sasha-gitg sasha-gitg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Minor comments.

google/cloud/aiplatform/training_jobs.py Outdated Show resolved Hide resolved
google/cloud/aiplatform/training_jobs.py Show resolved Hide resolved
google/cloud/aiplatform/training_jobs.py Outdated Show resolved Hide resolved
google/cloud/aiplatform/training_jobs.py Outdated Show resolved Hide resolved
google/cloud/aiplatform/training_jobs.py Outdated Show resolved Hide resolved
google/cloud/aiplatform/training_jobs.py Outdated Show resolved Hide resolved
google/cloud/aiplatform/training_jobs.py Outdated Show resolved Hide resolved
validation_fraction_split: float,
test_fraction_split: float,
weight_column: Optional[str] = None,
budget_milli_node_hours: int = 1000,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing setting this to less than 1000 hours will cause the backend to use 1000 hours instead. Probably same with the maximum value.

I have no idea if this is true though and this seems opaque to users.

Ideally, the backend should respond with a warning if the parameters are out of bounds.

@ivanmkc ivanmkc force-pushed the imkc--trainingjob-automl-tables-training-job branch 3 times, most recently from 2e4dc7b to caf7dc4 Compare November 24, 2020 13:52
Added the training job subclass for tables
@ivanmkc ivanmkc force-pushed the imkc--trainingjob-automl-tables-training-job branch from caf7dc4 to 072850f Compare November 24, 2020 14:07
@ivanmkc ivanmkc merged commit aa5e15d into googleapis:dev Nov 24, 2020
dizcology pushed a commit to dizcology/python-aiplatform that referenced this pull request Nov 30, 2020
Added the training job subclass for tables

Co-authored-by: Ivan Cheung <[email protected]>
dizcology pushed a commit to dizcology/python-aiplatform that referenced this pull request Nov 30, 2020
Added the training job subclass for tables

Co-authored-by: Ivan Cheung <[email protected]>
dizcology pushed a commit to dizcology/python-aiplatform that referenced this pull request Nov 30, 2020
Added the training job subclass for tables

Co-authored-by: Ivan Cheung <[email protected]>
dizcology pushed a commit to dizcology/python-aiplatform that referenced this pull request Nov 30, 2020
Added the training job subclass for tables

Co-authored-by: Ivan Cheung <[email protected]>
dizcology pushed a commit to dizcology/python-aiplatform that referenced this pull request Nov 30, 2020
Added the training job subclass for tables

Co-authored-by: Ivan Cheung <[email protected]>
dizcology pushed a commit to dizcology/python-aiplatform that referenced this pull request Dec 22, 2020
Added the training job subclass for tables

Co-authored-by: Ivan Cheung <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants