teamdatatonic · felix-datatonic · May 9, 2023 · Apr 26, 2023 · Apr 26, 2023 · Apr 27, 2023
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -116,16 +116,15 @@ We use End-to-end (E2E) pipeline tests to ensure that our pipelines are running
 - That common tasks(components), which are stored in a dictionary object (`common_tasks`), occurred in the pipeline
 - That if any task in a conditional tasks dictionary object occurred in the pipeline, the remaining tasks based on that condition should have all occurred as well
 - That these pipeline tasks output the correct artifacts, by checking whether they have been saved to a GCS URI or have been generated successfully in Vertex AI.
-  
+
 Note:
 These dictionary objects (`common_tasks`, `conditional_tasks`) are defined in `test_e2e.py` in each pipeline folder e.g (`./pipelines/tests/xgboost/training/test_e2e.py`). 
 The E2E test only allows one common tasks group but the number of conditional tasks group is not limited. To define the correct task group, 
 please go to pipeline job on Vertex AI for more information. 
 For example, in the XGBoost training pipeline, we have two conditional tasks groups that are bounded in the dashed frame. 
 Thus, in `./pipelines/tests/xgboost/training/test_e2e.py`, there are two dictionaries of two conditional tasks group.
 
-
-![Conditional tasks in XGB](docs/images/conditional_tasks_snippet.png)
+- Optionally check for executed tasks and created output artifacts.
 
 #### How to run end-to-end (E2E) pipeline tests
 E2E tests are run on each PR that is merged to the main branch. You can also run them on your local machine: 
@@ -282,6 +281,3 @@ To make sure that assets are available while running the ML pipelines, `make run
 ### Common assets
 
 Within the [assets](./assets/) folder, there are common files stored which need to be uploaded to Google Cloud Storage so that the pipelines running Vertex AI can consume such assets, namely:
-
-- TFDV schema for [detecting input data anomalies](https://www.tensorflow.org/tfx/guide/tfdv#schema_based_example_validation): This schema file can be created using a [sample notebook](pipelines/schema_creation.ipynb) to ensure that new training data complies with our data assumptions and constraints as part of the training pipeline. 
-- TFDV schema for [detecting data skew](https://www.tensorflow.org/tfx/guide/tfdv#training-serving_skew_detection): This schema file is used to detect training-serving skew in the prediction pipeline. It can be created similarly to other schema files. However, it will need to include [skew detection settings](https://www.tensorflow.org/tfx/data_validation/get_started#checking_data_skew_and_drift).
diff --git a/Makefile b/Makefile
@@ -63,7 +63,9 @@ test-all-components: ## Run unit tests for all pipeline components
 	done
 
 sync-assets: ## Sync assets folder to GCS. Must specify pipeline=<training|prediction>
-	@gsutil -m rsync -r -d ./pipelines/pipelines/${PIPELINE_TEMPLATE}/$(pipeline)/assets ${PIPELINE_FILES_GCS_PATH}/$(pipeline)/assets
+	if [ -d "./pipelines/pipelines/${PIPELINE_TEMPLATE}/$(pipeline)/assets/" ] ; then \
+		gsutil -m rsync -r -d ./pipelines/pipelines/${PIPELINE_TEMPLATE}/$(pipeline)/assets ${PIPELINE_FILES_GCS_PATH}/$(pipeline)/assets ; \
+	fi ;
 
 run: ## Compile pipeline, copy assets to GCS, and run pipeline in sandbox environment. Must specify pipeline=<training|prediction>. Optionally specify enable_pipeline_caching=<true|false> (defaults to default Vertex caching behaviour)
 	@ $(MAKE) compile-pipeline && \

diff --git a/README.md b/README.md
@@ -191,7 +191,6 @@ When triggering ad hoc runs in your dev/sandbox environment, or when running the
 ### Assets
 
 In each pipeline folder, there is an `assets` directory (`pipelines/pipelines/<xgboost|tensorflow>/<training|prediction>/assets/`). This can be used for any additional files that may be needed during execution of the pipelines. 
-For the example pipelines, it may contain data schemata (for Data Validation) or training scripts. This [notebook](pipelines/schema_creation.ipynb) gives an example on schema generation. 
 This directory is rsync'd to Google Cloud Storage when running a pipeline in the sandbox environment or as part of the CD pipeline (see [CI/CD setup](cloudbuild/README.md)).
 
 ## Testing
@@ -254,11 +253,11 @@ Below is a diagram of how the files are published in each environment in the `e2
 └── TAG_NAME or GIT COMMIT HASH <-- Git tag used for the release (release.yaml) OR git commit hash (e2e-test.yaml)
     ├── prediction
     │   ├── assets
-    │   │   └── tfdv_schema_prediction.pbtxt
+    │   │   └── some_useful_file.json
     │   └── prediction.json   <-- compiled prediction pipeline
     └── training
         ├── assets
-        │   └── tfdv_schema_training.pbtxt
+        │   └── training_task.py
         └── training.json   <-- compiled training pipeline
 ```
 
@@ -268,9 +267,3 @@ Below is a diagram of how the files are published in each environment in the `e2
 For more details on setting up CI/CD, see the [separate README](cloudbuild/README.md).
 
 For a full walkthrough of the journey from changing the ML pipeline code to having it scheduled and running in production, please see the guide [here](docs/PRODUCTION.md).
-
-### Using Dataflow
-
-The `generate_statistics` pipeline component generates statistics about a given dataset (using the [`generate_statistics_from_csv`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/generate_statistics_from_csv) function in the [TensorFlow Data Validation](https://www.tensorflow.org/tfx/guide/tfdv) package) can optionally be run using [DataFlow](https://cloud.google.com/dataflow/) to scale to huge datasets.
-
-For instructions on how to do this, see the [README](pipeline_components/_tfdv/generate_statistics.md) for this component.
diff --git a/containers/tfdv/Dockerfile b/containers/tfdv/Dockerfile
diff --git a/containers/tfdv/README.md b/containers/tfdv/README.md
diff --git a/docs/images/conditional_tasks_snippet.png b/docs/images/conditional_tasks_snippet.png
diff --git a/docs/images/prediction_pipeline_example.png b/docs/images/prediction_pipeline_example.png
diff --git a/docs/images/tensorflow_component_championmodel.png b/docs/images/tensorflow_component_championmodel.png
diff --git a/docs/images/tensorflow_component_model&metrics_artifact.png b/docs/images/tensorflow_component_model&metrics_artifact.png
diff --git a/docs/images/tensorflow_component_schema.png b/docs/images/tensorflow_component_schema.png
diff --git a/docs/images/tensorflow_component_training.png b/docs/images/tensorflow_component_training.png
diff --git a/docs/images/tensorflow_prediction_component_championmodel.png b/docs/images/tensorflow_prediction_component_championmodel.png
diff --git a/docs/images/tensorflow_prediction_component_datastats.png b/docs/images/tensorflow_prediction_component_datastats.png
diff --git a/docs/images/tensorflow_prediction_component_skew.png b/docs/images/tensorflow_prediction_component_skew.png
diff --git a/docs/images/training_pipeline_example.png b/docs/images/training_pipeline_example.png
diff --git a/docs/images/xgboost_component_model&metrics_artifact.png b/docs/images/xgboost_component_model&metrics_artifact.png
diff --git a/docs/images/xgboost_component_training.png b/docs/images/xgboost_component_training.png
diff --git a/pipeline_components/_tensorflow/.python-version b/pipeline_components/_tensorflow/.python-version
diff --git a/pipeline_components/_tensorflow/Pipfile b/pipeline_components/_tensorflow/Pipfile