BigQuery email exports example #474

nehanene15 · 2020-05-07T15:46:33Z

This PR contains an example of automated scheduling of BigQuery exports to email(s). Main features include:

Source code for a Cloud Function using BigQuery and Storage APIs to execute and export the query
Script to deploy pipeline including creation of service accounts, Cloud Scheduler, Pub/Sub, and Cloud Functions code.

…te-engine-deps Use new signing function

This reverts commit 23f9993.

This reverts commit ea33ceb.

googlebot · 2020-05-07T15:46:49Z

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

ishitashah24 · 2020-05-07T15:52:44Z

@googlebot I consent

bharathkkb · 2020-05-07T16:18:09Z

@googlebot i consent

googlebot · 2020-05-07T16:18:24Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

ishitashah24 · 2020-05-07T16:24:20Z

@googlebot

ishitashah24 · 2020-05-07T16:27:26Z

@googlebot I fixed it.

googlebot · 2020-05-07T16:27:45Z

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

jaketf · 2020-10-12T22:22:21Z

examples/bq-email-exports/source_2/main.py

+    export_compression = "NONE"
+    export_destination_fmt = "NEWLINE_DELIMETED_JSON"
+    export_use_avro = False
+    export_field_delimeter = ","


jaketf · 2020-10-12T22:23:51Z

examples/bq-email-exports/source_2/main.py

+            compression=export_compression,
+            destination_format=export_destination_fmt,
+            field_delimeter=export_field_delimeter,
+            use_avro_logical_types=export_use_avro)


this is a leaky abstraction.
you use avro logical types is more specific than export use avro.
In fact exporting using avro format would be controlled with export_destination_fmt.

jaketf · 2020-10-12T22:24:28Z

examples/bq-email-exports/source_3/main.py

+
+    # Set variables
+    signed_url_expiration_hrs = 24
+    from_email = "[email protected]"


jaketf · 2020-10-12T22:24:34Z

examples/bq-email-exports/source_3/main.py

+    # Set variables
+    signed_url_expiration_hrs = 24
+    from_email = "[email protected]"
+    to_email = "[email protected]"


jaketf · 2020-10-12T22:26:49Z

examples/bq-email-exports/terraform/variables.tf

+  default     = 1
+}
+
+variable "function_bucket_1" {


All 3 functions can probably just put their source in a single "cloud_functions_source_bucket". What's the value of a bucket for each function?

This is because Terraform bundles all of the files inside of the provided GCS bucket into one Cloud Function. This way the respective main.py and requirements.txt file in each bucket can have a unique CF.

jaketf · 2020-10-12T22:27:45Z

examples/bq-email-exports/terraform/variables.tf

+  description = "GCS bucket for function 3 code that sends email."
+}
+
+variable "function_name_1" {


more descriptive name?
perhaps run_query_function_name

jaketf · 2020-10-12T22:28:05Z

examples/bq-email-exports/terraform/variables.tf

+  default     = "bq-email-run-query"
+}
+
+variable "function_name_2" {


more descriptive name?
perhaps export_results_function_name

jaketf · 2020-10-12T22:28:22Z

examples/bq-email-exports/terraform/variables.tf

 }

-variable "function_bucket" {
-  description = "Bucket for function code."
+variable "function_name_3" {


more descriptive name?
perhaps email_results_function_name

jaketf

This architecturally is so much better thank you for the massive refactor!
I think that you can make this solution much more reusable by trying to make your cloud functions accept configuration (e.g. query, bucket, format, to/from email) as environment variables.
This way the user can use the exact same python source and redeploy this solution for many queries by setting env vars.)

jaketf

Cloud functions look nice now thank you for the updates!
Can you please add some unit tests (e.g. mock api calls and assert you're making the expected calls for a given set of env vars)? Ideally we'd have integration tests for each function and the end to end flow but I'm happy to leave those to a future PR. We are trying to have cloudbuild files to run the tests for each asset in this repo.
As an example of a tests for a similar function using GCS and BQ you can look at this

jaketf · 2020-11-05T19:02:24Z

we've update the default branch for this repo to "main"
https://sfconservancy.org/news/2020/jun/23/gitbranchname/
https://github.com/github/renaming
Please change target branch to main.

jaketf · 2020-11-16T20:06:23Z

examples/bq-email-exports/README.md

+
+The functional steps are listed here:
+
+**Cloud Scheduler:** A [Cloud Scheduler](https://cloud.google.com/scheduler) job defines the time frame between periodic email exports.


Not to move the goal post again and perhaps you've already ruled this out (or built this a while ago), but why not use BigQuery Data transfer service Scheduled Queries (native to BQ no Cloud Scheduler dependency)?https://cloud.google.com/bigquery/docs/scheduling-queries#configuration_options

These can be setup w/ Pub/Sub notifications out of the box.
https://cloud.google.com/bigquery-transfer/docs/transfer-run-notifications

This would get rid of Cloud Scheduler, Pub/Sub topic #1 and Cloud Function #1 and instead use BQ scheduled queries to write directly to Pub/Sub topic. Also there will be no need for the logging sink / filter because we can handle this more explicitly.

I think this is a much cleaner solution (and one that will eliminate your VPC-SC caveat!).
This should be a small change as you can reuse Cloud Function #2 as-is and just remove the TF for extra pubsub topics / cloud function / logging sink.

Definitely! I have implemented this in the most recent commit.

jaketf · 2020-11-19T20:10:59Z

examples/bq-email-exports/send_email_function/main.py

+        blob_path = log_entry['protoPayload']['serviceData'][
+            'jobCompletedEvent']['job']['jobConfiguration']['extract'][
+                'destinationUris'][0]
+        bucket_name = get_bucket(blob_path)


It would be cleaner to use storage.Blob.from_string() as you only use this string parsing to construct the blob object anyway. This will do away with the need for those unit tests.

kudos to @danieldeleo for teaching me about this function recently in another code review!

jaketf · 2020-11-19T20:21:44Z

examples/bq-email-exports/export_query_results_function/main.py

+
+def get_destination_uri():
+    """Returns destination GCS URI for export"""
+    return f"gs://{os.environ.get('BUCKET_NAME')}/{os.environ.get('OBJECT_NAME')}"


Note that os.environ.get will silently return None if the env var is not present.
this should raise if these environment variables are not set.

nit: This function is also essentially just a global constant (as these env-vars are not expected to change throughout the lifetime a the function).

I would suggest refactoring this to be a global constant at the top of the file

DESTINATION_URI=f"gs://{os.environ['BUCKET_NAME']}/os.environ['OBJECT_NAME']"

This will raise a KeyError if these env vars don't exist and "fail faster" instantly on module load rather than waiting until you call this function (notably after instantiating BigQuery Client that is doomed to never be usefully used if this destination path is not properly formed)

This seems like this will overwrite the same file on each scheduled run. Should we instead encode some timestamp in the GCS destination path to distinguish between different scheduled exports?

import time DESTINATION_URI=f"gs://{os.environ['BUCKET_NAME']}/{time.monotonic()}/os.environ['OBJECT_NAME']"

Good point. I'll read all the env vars using os.environ['BUCKET_NAME'] so it will fail faster. I'll also be creating a function to read the env vars for unit testing. Since I'll be reading them in via a function, a global constant won't have the scope to call it. Therefore, I'll most likely keep the get_destination_uri function, but open to other suggestions.

That's fine.

jaketf · 2020-11-19T20:28:52Z

examples/bq-email-exports/export_query_results_function/main.py

+    """Entrypoint for Cloud Function"""
+
+    data = base64.b64decode(event['data'])
+    pubsub_message = json.loads(data)


naming nit: event is a pubsub message (attributes plus binary payload).
Once you decode and json.load it it's a dict that represents the bigquery DTS TransferRun object.

Suggestion:
s/pubsub_message/upstream_bq_dts_transfer_run/g
s/pubsub_message/scheduled_query_transfer_run/g

I'm unsure of what you mean by the suggestions. I can update the pubsub_message var to reflect that it's a dict by renaming to transfer_run_dict. Would that address this comment or am I misunderstanding?

Yes that would work.

jaketf · 2020-11-19T20:33:04Z

examples/bq-email-exports/export_query_results_function/main.py

+        table_name = pubsub_message['params'][
+            'destination_table_name_template']
+
+        bq_client = bigquery.Client()


could you set client_info (like this) so usage of API calls made by this tool can be tracked by user agent?

jaketf · 2020-11-19T20:35:35Z

examples/bq-email-exports/export_query_results_function/main.py

+        bq_client = bigquery.Client()
+
+        destination_uri = get_destination_uri()
+        dataset_ref = bigquery.DatasetReference(project_id, dataset_id)


This is safer in case dataset_id includes project e.g. project_id.dataset_id or project_id:dataset_id

Suggested change

dataset_ref = bigquery.DatasetReference(project_id, dataset_id)

dataset_ref = bigquery.DatasetReference.from_string(dataset_id, default_project=project_id)

jaketf · 2020-11-19T20:39:30Z

examples/bq-email-exports/export_query_results_function/main.py

+    pubsub_message = json.loads(data)
+    error = pubsub_message.get('errorStatus')
+    if error:
+        logging.error(RuntimeError(f"Error in upstream query job:{error}"))


nit: also log job id for full context.

The in-built error message from the object returns the job id. The error block looks like this:
{'code': 5, 'message': 'Not found: Dataset nehanene-dev:tf_test; JobID: nehanene-dev:5fb3f4cc-0000-2d86-97ed-94eb2c042718'}

I'll print out the error message instead of the whole error block for prettiness though.

jaketf · 2020-11-19T20:40:50Z

examples/bq-email-exports/send_email_function/main.py

+    else:
+        blob_path = log_entry['protoPayload']['serviceData'][
+            'jobCompletedEvent']['job']['jobConfiguration']['extract'][
+                'destinationUris'][0]


seems like you should check if destinationUris has multiple entries and log a warning that you're only returning the first one.

jaketf · 2020-11-19T20:42:57Z

examples/bq-email-exports/send_email_function/main.py

+        bucket = storage_client.bucket(bucket_name)
+        blob = bucket.blob(object_name)
+
+        # Cloud Functions service account must have Service Account Token Creator role


What happens if this permission is missing?
Can you add try / except block to catch the error and remind the user that os.getenv("FUNCTION_IDENTITY") needs roles/iam.serviceAccountTokenCreator

jaketf · 2020-11-19T20:57:37Z

examples/bq-email-exports/send_email_function/main.py

+            service_account_email=os.getenv("FUNCTION_IDENTITY"),
+        )
+
+        url = blob.generate_signed_url(


it'd be nice if this solution could also send unsigned urls (based on an env var).
this way you could email a URL to GCS object that is only accessible by auth'd users (e.g. they would login with their gcp identity then download)
https://storage.cloud.google.com/{bucket_id}/{object_id}

jaketf · 2020-11-19T21:00:56Z

examples/bq-email-exports/tests/export_query_results_function/test_main.py

+
+
+@pytest.fixture
+def mock_env(monkeypatch):


there are several other env vars you read to construct the export request you can unit test those as well.

It's perhaps most important to unit test the behavior when these env-vars are note properly set. You can assert that the expected exception is raised or you fall back to reasonable defaults.

jaketf · 2020-11-24T21:23:28Z

examples/bq-email-exports/export_query_results_function/main.py

@@ -60,4 +66,10 @@ def main(event, context):

 def get_destination_uri():
    """Returns destination GCS URI for export"""
-    return f"gs://{os.environ.get('BUCKET_NAME')}/{os.environ.get('OBJECT_NAME')}"
+    return (f"gs://{get_env('BUCKET_NAME')}/"
+            f"{time.strftime('%Y%m%d-%H%M%S')}/{get_env('OBJECT_NAME')}")


Nit: This is not a very common way to represent a timestamp https://docs.python.org/3/library/datetime.html#datetime.datetime.timestamp. I'd suggest just using a unix timestamp which is most explicit. or having something like yyyy=%Y/mm=%m/dd=%d/hr=%H/ which will be familiar to Hadoop users.

Also currently this is cloud function execution time. Would it make more sense to use the scheduled time of the query (read from the transfer run)?

Yeah, changed it to use the schedule time from BTS in ISO time format

jaketf · 2020-11-24T21:31:54Z

examples/bq-email-exports/send_email_function/main.py

+        blob_path = destination_uris[0]
+        blob = storage.Blob.from_string(blob_path)
+        url = generate_signed_url(blob) if get_env(
+            'SIGNED_URL') == 'True' else get_auth_url(blob_path)


Nitty nit: I like to use strtobool for this sort of thing as it accepts more possible user inputs and does the right thing.

https://docs.python.org/3/distutils/apiref.html#distutils.util.strtobool

jaketf

LGTM!

* initial commit * config files added * new signing func * Consolidated deploy script and minor edits to main.py * created readme * Updated to generalize variable names * Added exception catching, cleaned up scripts, and updated README * Quick fixes to README * Ran shellcheck * Added comments to main.py * updated the license * Updated code format and licenses * Moved files into directory * Moving gitignore files into dir * Revert "Moving gitignore files into dir" This reverts commit 23f9993. * Revert "Moved files into directory" This reverts commit ea33ceb. * Moved files to examples/ and updated README * lint * yapf * Added Terraform for provisioning, updated Scheduler to take a payload with configs, and updated syntax in main.py * lint * Updated TF code to set APIs, archive files, and templatize payload. Added query and export config options in payload. Set timeouts for query and export jobs. * Major refractor of code to microservices driven architecture * pep8 style reformat * Added env vars for reproducibility, renamed folders/vars for specificity * added unit tests and DTS for scheduled queries * added support for unsigned urls and better error catching * added schedule time to GCS path, strtobool for signed URL env var Co-authored-by: ishitashah <[email protected]> Co-authored-by: bharathkkb <[email protected]> Co-authored-by: ishitashah24 <[email protected]> Co-authored-by: Jacob Ferriero <[email protected]>

ishitashah24 and others added 19 commits March 30, 2020 01:10

initial commit

eceff74

config files added

0bbe304

new signing func

4b1b185

Merge pull request GoogleCloudPlatform#1 from bharathkkb/remove-compu…

e258596

…te-engine-deps Use new signing function

Consolidated deploy script and minor edits to main.py

5acac74

created readme

9d07959

Updated to generalize variable names

1be7735

Added exception catching, cleaned up scripts, and updated README

20e2e78

Quick fixes to README

6a2ec62

Ran shellcheck

8277f70

Added comments to main.py

83b13e3

updated the license

287dd58

Updated code format and licenses

c9e6fb0

Moved files into directory

ea33ceb

Moving gitignore files into dir

23f9993

Revert "Moving gitignore files into dir"

afc5b5f

This reverts commit 23f9993.

Revert "Moved files into directory"

b5e89f3

This reverts commit ea33ceb.

Inital merge from bq-email-exports

0c39d06

Moved files to examples/ and updated README

a1486e8

pull-request-size bot added the size/L Denotes a PR that changes 100-499 lines. label May 7, 2020

googlebot added the cla: no All committers have NOT signed a CLA label May 7, 2020

googlebot added cla: yes All committers have signed a CLA and removed cla: no All committers have NOT signed a CLA labels May 7, 2020

jaketf reviewed Oct 12, 2020

View reviewed changes

examples/bq-email-exports/source_3/main.py Outdated

# Set variables

signed_url_expiration_hrs = 24

from_email = "[email protected]"

Copy link

jaketf Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

env var

jaketf reviewed Oct 12, 2020

View reviewed changes

jaketf suggested changes Oct 12, 2020

View reviewed changes

nehanene15 added 2 commits October 13, 2020 09:24

Merge remote-tracking branch 'upstream/master' into bq-email-exports

addd3eb

Added env vars for reproducibility, renamed folders/vars for specificity

48304e9

nehanene15 requested a review from jaketf October 16, 2020 17:12

jaketf suggested changes Oct 17, 2020

View reviewed changes

jaketf reviewed Nov 16, 2020

View reviewed changes

added unit tests and DTS for scheduled queries

2f68bb9

nehanene15 requested a review from jaketf November 18, 2020 16:41

nehanene15 changed the base branch from master to main November 19, 2020 19:31

jaketf reviewed Nov 19, 2020

View reviewed changes

added support for unsigned urls and better error catching

eac500a

nehanene15 requested a review from jaketf November 24, 2020 15:41

jaketf reviewed Nov 24, 2020

View reviewed changes

added schedule time to GCS path, strtobool for signed URL env var

732df59

nehanene15 requested a review from jaketf December 2, 2020 17:47

jaketf approved these changes Dec 7, 2020

View reviewed changes

Merge branch 'main' into bq-email-exports

83437f2

jaketf merged commit 5f2606d into GoogleCloudPlatform:main Dec 7, 2020


		The functional steps are listed here:

		Cloud Scheduler: A [Cloud Scheduler](https://cloud.google.com/scheduler) job defines the time frame between periodic email exports.

	dataset_ref = bigquery.DatasetReference(project_id, dataset_id)
	dataset_ref = bigquery.DatasetReference.from_string(dataset_id, default_project=project_id)

BigQuery email exports example #474

BigQuery email exports example #474

Conversation

nehanene15 commented May 7, 2020

googlebot commented May 7, 2020

ishitashah24 commented May 7, 2020

bharathkkb commented May 7, 2020

googlebot commented May 7, 2020

ishitashah24 commented May 7, 2020

ishitashah24 commented May 7, 2020

googlebot commented May 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaketf left a comment

Choose a reason for hiding this comment

jaketf left a comment

Choose a reason for hiding this comment

jaketf commented Nov 5, 2020

jaketf Nov 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaketf left a comment

Choose a reason for hiding this comment

jaketf Nov 16, 2020 •

edited

Loading