refactor: Pull out reusable code in CustomTrainingJob to use in other training jobs #49

ivanmkc · 2020-11-11T03:19:17Z

Summary

There is reusable code in CustomTrainingJob, specifically in the init and run methods. Other training jobs such as AutoMLTablesTrainingJob can use it as well.

The code mainly has to do with creating and running a training pipeline.
The leftover code in CustomTrainingJob will thus only be related to custom training and not generic pipeline code.

In preparation for: https://b.corp.google.com/issues/172282518

Remaining questions

1. In the TrainingJob.run method, training_fraction_split, validation_fraction_split, test_fraction_split: float are all only needed when the dataset parameter is also provided. I've grouped this together into DatasetWithSplits. I'd want to replace the CustomTrainingJob.run parameters with this as well. What do you all think?
~~2. Should we put TrainingJob into its own file and separate out CustomTrainingJob and AutoMLTablesTrainingJob into their own? If not, how should we organize things as to not have massive files?~~
~~3. Should I make TrainingJob an abstract base class to prevent initialization? I can also do this by overriding new:~~

class TrainingJob(object):

    def __new__(cls, *args, **kwargs):
        if cls is TrainingJob:
            raise TypeError("base class may not be instantiated")
        return object.__new__(cls, *args, **kwargs)

~~4. Docstrings: On which functions are they needed? I'll try to look up the Google guidelines.~~

TODO

~~- [ ] Write unit tests for the generic TrainingJob class.~~

Run linters

Testing

Passes all existing unit tests

ivanmkc · 2020-11-11T03:24:09Z

google/cloud/aiplatform/training_job_base.py

+        )
+        self._gca_resource = None
+
+    def _create_managed_model(self, model_display_name: str) -> gca_model.Model:


Encapsulated this logic into a function to emphasize the parameter dependencies to generate a managed model.

Hm, seems like this isn't used by AutoML. I guess it's specific to Custom Training then? Will move there then

I ended up having each subclass create their own Model and pass it in to the base class. This is due to the fact that each training job type has different requirements as defined in the yaml files.

ivanmkc · 2020-11-11T03:24:21Z

google/cloud/aiplatform/training_job_base.py

+
+        return managed_model
+
+    def _create_input_data_config(self,


Encapsulated this logic into a function to emphasize the parameter dependencies to generate a input data config.

google/cloud/aiplatform/training_job_base.py

ivanmkc · 2020-11-11T03:26:21Z

google/cloud/aiplatform/training_job_base.py

+            base_output_dir: Optional[str] = None):
+        """Runs the training job.
+        """
+        if self._has_run:


This is the only functionality that differs from the existing CustomTrainingJob implementation.

This validation was previously performed earlier but I've moved it into the base class (i.e. it occurs later than before) to hide it from the subclasses implementation details.

ivanmkc · 2020-11-11T03:26:48Z

google/cloud/aiplatform/training_job_base.py

+        return model
+
+    @property
+    def state(self) -> gca_pipeline_state.PipelineState:


Copied verbatim from CustomModelTrainingJob

ivanmkc · 2020-11-11T03:27:35Z

google/cloud/aiplatform/training_job_base.py

+        )
+
+    @property
+    def _model_upload_fail_string(self) -> str:


Copied verbatim from CustomModelTrainingJob

ivanmkc · 2020-11-11T03:27:39Z

google/cloud/aiplatform/training_job_base.py

+        )
+
+    @property
+    def _has_run(self) -> bool:


Copied verbatim from CustomModelTrainingJob

ivanmkc · 2020-11-11T03:27:44Z

google/cloud/aiplatform/training_job_base.py

+        """Helper property to check if this training job has been run."""
+        return self._gca_resource is not None
+
+    def _assert_has_run(self):


Copied verbatim from CustomModelTrainingJob

ivanmkc · 2020-11-11T03:28:37Z

google/cloud/aiplatform/training_job_base.py

+    ]
+)
+
+class DatasetWithSplits():


New class to encapsulate all the parameters that need to be passed together.

ivanmkc · 2020-11-11T03:28:59Z

google/cloud/aiplatform/training_job_base.py

+        self.test_fraction_split=test_fraction_split
+
+
+class TrainingJob(base.AiPlatformResourceNoun):


Can potentially make an abc to prevent instantiation. Thoughts?

sasha-gitg · 2020-11-11T15:07:54Z

Remaining questions

In the TrainingJob.run method, training_fraction_split, validation_fraction_split, test_fraction_split: float are all only needed when the dataset parameter is also provided. I've grouped this together into DatasetWithSplits. I'd want to replace the CustomTrainingJob.run parameters with this as well. What do you all think?

The preference from the design is to flatten coupled arguments as long the objects they represent are not deeply nested. There is precedence in ML libraries for this pattern:

keras.Model.fit with validation_split, similar the API surface discussed here.
sklearn.linear_model.LogisticRegression Particular parameter value combinations are invalid.

The aim for this SDK is to feel familiar to practitioners that use these libraries. We'll gate on feedback on whether we should change this.

Should we put TrainingJob into its own file and separate out CustomTrainingJob and AutoMLTablesTrainingJob into their own? If not, how should we organize things as to not have massive files?

We should gate this on whether the organization makes sense to group these classes together in the same module not necessarily that there is some threshold of number of lines we want to stay under. Since these all relate to training using the PipelineService I think we're safe to keep them in a single module.

Should I make TrainingJob an abstract base class to prevent initialization? I can also do this by overriding new:

Looking at the code, I'm not sure it would prevent initialization because it looks like the interface is completely defined.

We should use an ABC to define an interface that concrete classes should implement. If our parent class already fully implements that interface it should be usable. To me, based on the code, it looks like you may want to separate out some of the shared logic as private methods in the ABC and require the concrete classes implement some of the public interface. Either that or derive AutoMLTablesTrainingJob from CustomTrainingJob.

Docstrings: On which functions are they needed? I'll try to look up the Google guidelines.

All methods except for unit tests.

ivanmkc · 2020-11-11T18:20:07Z

Thanks for the clarifications @sasha-gitg, they all make sense to me. I will make the changes.

ivanmkc · 2020-11-12T01:36:34Z

We should gate this on whether the organization makes sense to group these classes together in the same module not necessarily that there is some threshold of number of lines we want to stay under. Since these all relate to training using the PipelineService I think we're safe to keep them in a single module.

Sure thing. One consideration for splitting off the base class into its own module is whether other teams will be building off the base class or will it just be us. Since it looks like its just us, a relatively small group of developers, keeping it in the same module sgtm.

However, putting everything in one module has the issue that we're combining all the import statements.

Examining the import dependencies typically are a great way to check if the coupling of your class. For example, if AutoMLTablesTrainingJob was in its own module, then I can quickly see that it only depends on Tables related classes and not CustomTraining ones. Putting everything into one module makes that hard to see.

However, we also want to be able import training jobs in one line:

from training_jobs import CustomTrainingJob, AutoMLTablesTrainingJob

It seems like we can keep everything in one TrainingJob module but still have multiple files:

https://stackoverflow.com/questions/24100558/how-can-i-split-a-module-into-multiple-files-without-breaking-a-backwards-compa/24100645

Would you be open to this approach or do you still prefer all training_job's in one file?

google/cloud/aiplatform/training_job_base.py

ivanmkc · 2020-11-12T07:15:59Z

@sasha-gitg What do you want me to do with the tests? Your CustomTraining tests already ensure coverage of the super class. Do you want me to refactor the tests in some way as well?

sasha-gitg · 2020-11-12T15:25:42Z

@sasha-gitg What do you want me to do with the tests? Your CustomTraining tests already ensure coverage of the super class. Do you want me to refactor the tests in some way as well?

If test are passing with full coverage after refactoring then we can move forward with the tests we have.

sasha-gitg · 2020-11-12T16:04:59Z

However, putting everything in one module has the issue that we're combining all the import statements.

Examining the import dependencies typically are a great way to check if the coupling of your class. For example, if AutoMLTablesTrainingJob was in its own module, then I can quickly see that it only depends on Tables related classes and not CustomTraining ones. Putting everything into one module makes that hard to see.

This is a more compelling argument to split the files but I would move this decision to the PR that implements AutoMLTablesTrainingJob so we can concretely see the coupling we are trying to avoid.

However, we also want to be able import training jobs in one line:

from training_jobs import CustomTrainingJob, AutoMLTablesTrainingJob

I don't see value in this considering we expose all classes on the SDKs surface and the expectation is that our internal code imports implementations directly. Our style guide also enforces module level imports.

It seems like we can keep everything in one TrainingJob module but still have multiple files:

https://stackoverflow.com/questions/24100558/how-can-i-split-a-module-into-multiple-files-without-breaking-a-backwards-compa/24100645

Would you be open to this approach or do you still prefer all training_job's in one file?

Yes, this is the approach we would take to namespace all these classes to the same module when splitting them into different files. But if we're going to add a level of indirection we should justify that tradeoff with concrete benefits.

As the SDK grows these benefits we'll become more apparent but I would prefer we avoid adding premature complexity. I do think that implementing AutoMLTablesTrainingJob will surface those concrete benefits.

ivanmkc · 2020-11-12T22:57:04Z

Sounds good, thanks for the review! I'll move everything into the training_jobs.py file then.

ivanmkc · 2020-11-12T23:19:03Z

However, we also want to be able import training jobs in one line:
from training_jobs import CustomTrainingJob, AutoMLTablesTrainingJob

I don't see value in this considering we expose all classes on the SDKs surface and the expectation is that our internal code imports implementations directly. Our style guide also enforces module level imports.

Sorry, I meant to say that we want to import CustomTrainingJob and AutoMLTablesTrainingJob as part of the same module. Not necessarily a class-level import.

ivanmkc · 2020-11-12T23:29:17Z

google/cloud/aiplatform/training_jobs.py

+                For tabular Datasets, all their data is exported to
+                training, to pick and choose from.
+            training_fraction_split (float):
+                The fraction of the input data that is to be


@sasha-gitg sometimes I see Required and sometimes Optional and sometimes nothing is written here. Any guidance on why nothing is written for this parameter (I copied it from CustomTrainingJobClass).

These are generally copied from the protos if they represent the same field: https://github.com/googleapis/python-aiplatform/blob/dev/google/cloud/aiplatform_v1beta1/types/training_pipeline.py#L260

So we inherit the arg commenting from the service.

For comments we add, it's only necessary we mark them as Required if they are indeed so.

ivanmkc · 2020-11-13T05:32:26Z

google/cloud/aiplatform/training_jobs.py

+        Args:
+            display_name (str):
+                Required. The user-defined name of this TrainingPipeline.
+            container_uri (str):


Is this only relevant to Custom Training?

@sasha-gitg

Ok removed from base class

Please remove the docstring.

sasha-gitg

The tests are broken in the build. Those should pass to ensure this refactor did not break the current class.

Nit: The PR title should technically be called "refactor:..." as we shouldn't be altering behavior with this change.

sasha-gitg · 2020-11-13T13:52:32Z

google/cloud/aiplatform/training_jobs.py

-            "model_serving_container_image_uri and model_display_name passed in. "
-            "Ensure that your training script saves to model to "
-            "os.environ['AIP_MODEL_DIR']."
+        return super().run_job(


There's no need to call super here since this method is not overridden.

sasha-gitg · 2020-11-13T13:56:11Z

google/cloud/aiplatform/training_jobs.py

+    def _model_upload_fail_string(self) -> str:
+        """Helper property for model upload failure."""
+        return (
+            f"Training Pipeline {self.resource_name} is not configured to upload a "


This message seems to only apply to custom training jobs.

google/cloud/aiplatform/training_jobs.py

ivanmkc · 2020-11-16T16:22:09Z

sgtm @sasha-gitg! will make the changes today

ivanmkc · 2020-11-17T07:06:35Z

@sasha-gitg put in the fixes!

sasha-gitg

Looks good! Requested a few minor changes. Also need to resolve merge conflicts with distributed training PR (those changes should only affect the CustomTrainingJob class).

sasha-gitg · 2020-11-17T13:33:40Z

google/cloud/aiplatform/training_jobs.py

@@ -26,6 +26,7 @@
 import time
 from typing import Callable, List, Optional, Sequence, Union

+from abc import ABC, abstractmethod


Module level import here.

I was wondering about this. So according to the style guide I should be import abc, and then refer use abc.abstractmethod every time in the code?

What about the line above it?
from typing import Callable, List, Optional, Sequence, Union

Yes on abc.abstractmethod.

typing is an exception to the rule typing-imports.

google/cloud/aiplatform/training_jobs.py

sasha-gitg · 2020-11-17T13:38:21Z

google/cloud/aiplatform/training_jobs.py

+        Args:
+            display_name (str):
+                Required. The user-defined name of this TrainingPipeline.
+            container_uri (str):


Please remove the docstring.

google/cloud/aiplatform/training_jobs.py

ivanmkc · 2020-11-17T14:01:45Z

google/cloud/aiplatform/training_jobs.py

+            input_data_config = gca_training_pipeline.InputDataConfig(
+                fraction_split=fraction_split,
+                dataset_id=dataset.name,
+                gcs_destination=gca_io.GcsDestination(


Is the field gcs_destination only used for Custom Training? The site https://cloud.google.com/ai-platform-unified/docs/reference/rest/v1beta1/projects.locations.trainingPipelines#TrainingPipeline.FIELDS.training_task_definition only says:

object (GcsDestination) The Cloud Storage location where the training data is to be written to. In the given directory a new directory is created with name: dataset-<dataset-id>-<annotation-type>-<timestamp-of-training-call> where timestamp is in YYYY-MM-DDThh:mm:ss.sssZ ISO-8601 format. All training input data is written into that directory. The AI Platform environment variables representing Cloud Storage data URIs are represented in the Cloud Storage wildcard format to support sharded data. e.g.: "gs://.../training-*.jsonl" AIP_DATA_FORMAT = "jsonl" for non-tabular data, "csv" for tabular data AIP_TRAINING_DATA_URI = "gcsDestination/dataset---/training-*.${AIP_DATA_FORMAT}" AIP_VALIDATION_DATA_URI = "gcsDestination/dataset---/validation-*.${AIP_DATA_FORMAT}" AIP_TEST_DATA_URI = "gcsDestination/dataset---/test-*.${AIP_DATA_FORMAT}"

What training data is being written? Not too familiar with AutoML yet and it isn't obvious to me why this is needed.

@sasha-gitg

Good point. Yes, I think this is custom training specific. Though, it would be worth confirming by using the API without passing the field.

Moved around some functions Completed refactor to use base class bug: remove requirement for import_schema_uri when passing in gcs_source (googleapis#46) Ran linters Removed model from TrainingJob and moved to CustomTrainingJob Removed DatasetWithSplits Added doc strings and simplified training_job_base code Moved TrainingJob class into training_jobs.py Removed container_uri from base TrainingJob class Addressed comments Fixed managed model Ran linter Fixed issues with abc, doc string and super call Refactored to create input data config separately

… training jobs (googleapis#49) * Extracted reusable CustomTrainingJob code into TrainingJob base class Moved around some functions Completed refactor to use base class bug: remove requirement for import_schema_uri when passing in gcs_source (googleapis#46) Ran linters Removed model from TrainingJob and moved to CustomTrainingJob Removed DatasetWithSplits Added doc strings and simplified training_job_base code Moved TrainingJob class into training_jobs.py Removed container_uri from base TrainingJob class Addressed comments Fixed managed model Ran linter Fixed issues with abc, doc string and super call Refactored to create input data config separately * Ran linter Co-authored-by: Ivan Cheung <[email protected]>

google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Nov 11, 2020

ivanmkc requested review from vinnysenthil, sasha-gitg, dizcology and sirtorry November 11, 2020 03:21

ivanmkc force-pushed the imkc--trainingjob-base-class-refactor branch from 53a91f9 to 79c1a16 Compare November 11, 2020 03:23

ivanmkc commented Nov 11, 2020

View reviewed changes

ivanmkc commented Nov 12, 2020

View reviewed changes

google/cloud/aiplatform/training_job_base.py Outdated Show resolved Hide resolved

ivanmkc force-pushed the imkc--trainingjob-base-class-refactor branch from 8548eb2 to 8a2b53d Compare November 12, 2020 23:16

ivanmkc commented Nov 12, 2020

View reviewed changes

ivanmkc mentioned this pull request Nov 13, 2020

feat: Added AutoMLTablesTrainingJob and tests #62

Merged

ivanmkc commented Nov 13, 2020

View reviewed changes

sasha-gitg requested changes Nov 13, 2020

View reviewed changes

google/cloud/aiplatform/training_jobs.py Outdated Show resolved Hide resolved

ivanmkc changed the title ~~fix: Pull out reusable code in CustomTrainingJob to use in other training jobs~~ refactor: Pull out reusable code in CustomTrainingJob to use in other training jobs Nov 17, 2020

sasha-gitg approved these changes Nov 17, 2020

View reviewed changes

ivanmkc commented Nov 17, 2020

View reviewed changes

ivanmkc force-pushed the imkc--trainingjob-base-class-refactor branch 2 times, most recently from 3e525b7 to e019332 Compare November 18, 2020 09:47

ivanmkc force-pushed the imkc--trainingjob-base-class-refactor branch from e019332 to b9f9c8f Compare November 18, 2020 10:25

Ran linter

8db3cc1

ivanmkc force-pushed the imkc--trainingjob-base-class-refactor branch from 72a78d0 to 8db3cc1 Compare November 18, 2020 10:55

ivanmkc merged commit 52de070 into googleapis:dev Nov 18, 2020

		self.test_fraction_split=test_fraction_split


		class TrainingJob(base.AiPlatformResourceNoun):

refactor: Pull out reusable code in CustomTrainingJob to use in other training jobs #49

refactor: Pull out reusable code in CustomTrainingJob to use in other training jobs #49

Conversation

ivanmkc commented Nov 11, 2020 • edited Loading

Summary

Remaining questions

TODO

Testing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sasha-gitg commented Nov 11, 2020 • edited Loading

Remaining questions

ivanmkc commented Nov 11, 2020

ivanmkc commented Nov 12, 2020 • edited Loading

ivanmkc commented Nov 12, 2020

sasha-gitg commented Nov 12, 2020

sasha-gitg commented Nov 12, 2020

ivanmkc commented Nov 12, 2020

ivanmkc commented Nov 12, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sasha-gitg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivanmkc commented Nov 16, 2020

ivanmkc commented Nov 17, 2020

sasha-gitg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivanmkc Nov 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sasha-gitg Nov 17, 2020 • edited Loading

Choose a reason for hiding this comment

ivanmkc commented Nov 11, 2020 •

edited

Loading

sasha-gitg commented Nov 11, 2020 •

edited

Loading

ivanmkc commented Nov 12, 2020 •

edited

Loading

ivanmkc commented Nov 12, 2020 •

edited

Loading

ivanmkc Nov 17, 2020 •

edited

Loading

sasha-gitg Nov 17, 2020 •

edited

Loading