Google Analytics 4 Flattener. A Google Cloud Platform (GCP) solution that unnests (
flattens) Google Analytics Data 4 stored in Bigquery.
The GCP resources for the solutions are installed via Deployment Manager.
- About the config file
- Intraday flattening
- Controlling the type of output
- Config file examples
- Backfilling steps
- Google Analytics 4 (GA4) is the newest generation of Google Analytics.
- When you export data from GA4 into BigQuery, the schema will
have nested and repeated fields.
- Nested data represents a struct. In BigQuery, it can be described as a column inside of a column.
- Repeated data represents an array. In BigQuery, it can be described as a row inside of a row.
- Schema with nested and repeated data:
- Can make it harder to query the data (it takes a learning curve to get used to it).
- May be incompatible with other data systems. It maybe be impossible to import data into another database system or into a spreadsheet.
- The purpose of GA4 Flattener is to solve the above problems by flattening/unnesting GA4 data.
- Flattening/unnesting means converting all nested and repeated fields to regular fields (i.e., not a struct or an array).
- As a result, our data becomes simpler and more easily compatible with other systems.
- To imagine flat data, picture a Google Spreadsheet / Microsoft Excel table.
- GA4 Flattener keeps your original GA4 data with nested and repeated fiels, but it also writes 4 flat tables into the same BigQuery dataset.
- Google Cloud Platform SDK. Download and install from these instructions: https://cloud.google.com/sdk/docs/install
- Python >= 3.7. Download and install from https://python.org.
- Web browser
- git (optional for cloning this code repository)
-
Browse to https://console.cloud.google.com/ to create Google GCP project or use an existing project that has Google Analytics data flowing to it. Referred to as [PROJECT_ID].
-
Grant the installing user (you most likely) the basic IAM role of
Owner
. -
As the installing user for [PROJECT_ID], grant the following pre-defined IAM roles to [PROJECT_NUMBER]@cloudservices.gserviceaccount.com (built in service account). Otherwise deployment will fail with permission errors. See https://cloud.google.com/deployment-manager/docs/access-control for detailed explanation.
- Logs Configuration Writer
- Cloud Functions Developer
- Pub/Sub Admin
If your GCP project is brand new, you might not have [PROJECT_NUMBER]@cloudservices.gserviceaccount.com yet. To fix this, enable Compute Engine API and then disable it. The service account [PROJECT_NUMBER]@cloudservices.gserviceaccount.com will appear in your GCP project under IAM.
-
As the installing user for [PROJECT_ID], enable the following APIs
- Cloud Build API
- Cloud Deployment Manager V2 API
- Cloud Functions API
- Artifact Registry API
- Identity and Access Management (IAM) API
- Cloud Scheduler API (if you need to flatten intraday tables)
-
As the installing user for [PROJECT_ID], create a bucket or use an existing bucket for staging code, for example: **
[PROJECT_NUMBER]**-function-code-staging
. Referred to as [BUCKET_NAME].We recommend adding a label to this GCS bucket, for example:
key: bucket
value: ga4-flattener-function-code-staging
This label will help you roughly estimate GCS cost related to flattener.
-
Clone this github repo or download the source code from the releases section to your local machine or cloud shell.
-
Edit the ga_flattener.yaml and ga_flattener_colon.yaml files, specifically all occurrences of properties--> codeBucket value. Set the value to [BUCKET_NAME] (see step above)
-
Execute command in Google Cloud SDK Shell:
gcloud config set project **[PROJECT_ID]**
- alternatively, you can use this commands:
# instantiate a configuration (connect to a GCP project) gcloud init # activate the configuration gcloud config configurations activate {configuration_name} # make sure you're in the right GCP project) gcloud config configurations list
-
Execute command:
gcloud config set account <[email protected]>
. Note - This must be the installing user from above prerequisites. Feel free to skip this part unless it's required - I usually skip it as GCP already detects the user. -
Navigate (locally) to root directory of this repository
-
If [PROJECT_ID] does NOT contain a colon (:) execute command:
gcloud deployment-manager deployments create **[deployment_name]** --config ga_flattener.yaml
otherwise follow these steps:
-
execute command:
gcloud deployment-manager deployments create **[deployment_name]** --config ga_flattener_colon.yaml
-
Trigger function (with a blank message) named [deployment_name]-cfconfigbuilderps. It will create the necessary configuration file in the applications Google Coud Storage bucket. An easy method to do this is to browse to https://console.cloud.google.com/functions and click the cloud function named [deployment_name] -cfconfigbuilderps and go to the testing section and click "TEST THIS FUNCTION".
The steps above can be summarized in the template below. When I install the flattener, I like referring to the commands below. I adjust the commands to a specific project run one command at a time.
# INSTALL # initiate the configuration gcloud init # view configurations gcloud config configurations list # activate the right configuration gcloud config configurations activate {configuration} # dir cd {code_directory} # install gcloud deployment-manager deployments create ga4-flattener-deployment --config ga_flattener.yaml # DESCRIBE # verify installation gcloud deployment-manager deployments describe ga4-flattener-deployment # UNINSTALL # delete the flattener config file gsutil rm gs://ga4-flattener-deployment-{project_number}-adswerve-ga-flat-config/config_datasets.json # uninstall gcloud deployment-manager deployments delete ga4-flattener-deployment -q
-
Deployment name must be a match of regex
'[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?'
-
Deployment naming conventions are as follows:
- Starts with a lowercase letter
- Can contain lowercase letters, numbers and hyphens
- Ends with a lowercase letter or number
- Up to 63 characters long (a beginning lowercase letter + 61 chars (hyphens, lowercase letters and number) + the ending lowercase letter or a number)
-
Note that deployment name cannot have underscores in its name, but can have hyphens.
-
An example of a valid name is in this command:
gcloud deployment-manager deployments create ga4-flattener-deployment --config ga_flattener.yaml
-
Please refer to the documentation for more examples of valid deployment names.
-
Below are some ideas, you don't have to do all of them.
-
Important checks:
-
Check to make sure the deployement command gave no errors
-
Check to make sure the config file is correct. See below more information about the config file.
-
Run backfill (at least for a couple of days) to make sure there are no errors in the Cloud Functions and flat data is being written. See the section about backfill.
-
-
Optional checks:
-
Check GCP Cloud Functions to make sure they deployed successfully.
-
Check GCP logs to make sure there are no errors in the flattening Cloud Function.
-
- After installation, a configuration file named
config_datasets.json
exists in gs://[deployment_name] -[PROJECT_NUMBER]-adswerve-ga-flat-config/ (Cloud Storage Bucket within [PROJECT_ID]).
-
This file contains all the datasets that have
events_yyyymmdd
tables and which tables to unnest. -
This configuration is required for this GA4 flattener solution to run daily or to backfill historical data.
-
Edit this file accordingly to include or exclude certain datasets or tables to unnest.
-
You can enable intraday flattening via this config file and specify its frequency in hours or minutes.
-
You can also enable sharded and/or partitioned output in the config file.
If you want non-default config file options, you need to do the following:
-
Manually download the
config_datasets.json
file from GCS (Google Cloud Storage) -
Edit it locally in a text/code upload
-
Upload the config file back to GCS and overwrite the old version.
-
In addition to daily tables
events_yyyymmdd
, you may also have the tableevents_intraday_yyyymmdd
, which has data from today and refreshes in real-time. -
To get the intraday BigQuery table, you need to check
Frequency
:Streaming
in your GA4 BigQuery linking. -
By default, the flattener does not flatten the intraday table.
-
You can enable intraday flattening by editing the config file (see instructions below).
-
There are two kinds of intraday table, you can have either of them, neither of them or both. The flattener can write flat intraday to a BigQuery table and/or to a BigQuery SQL view.
-
About intraday table flattening:
-
If you don't have have a daily table yet for a specific date (usually, today or yesterday), but you have an intraday table for that date, then the intraday table will be flattened. You will have flat tables, such as
flat_events_yyyymmdd
, which are based on a specific date's intraday table ```events_intraday_yyyymmdd``. -
At night, Google deletes the intraday table for a specific date and writes a daily table for that date.
-
The flattener Cloud Function runs and ovewrites the flat table
flat_events_yyyymmdd
. Now you have flat tables based on daily tableevents_yyyymmdd
.
-
-
About intraday view flattening:
- If it is enabled, you will have SQL views, for example
view_flat_events_{todays_date}
. Such views will be automatically created and deleted based on the source intraday tableevents_intraday_yyyymmdd
.
- If it is enabled, you will have SQL views, for example
-
You can enable intraday flattening by editing the config file and supplying the following part:
"intraday_flattening": { "intraday_flat_tables_schedule": { "frequency": 1, "units": "hours" }, "intraday_flat_views": true }
"units"
can be"hours"
or"minutes"
"frequency"
is an integer.- if your units are minutes, then the frequency should be between 1 and 59.
-
The flattened intraday tables will be overwritten at the specified frequency.
-
In the intraday table, user properties are not accurate, until the daily table arrives
-
The flat intraday table doesn't say it's based on intraday data in its name. The flat intraday table is called
flat_events_{date}
. -
therefore, please interpret flattened intraday data with caution
Enabling partitioned output in the config file
-
By default, the flattener produces flat sharded tables, because the original GA tables are also sharded.
-
You can enable the following in the config file:
A) sharded output
B) paritioned output
C) both sharded and partitioned output. It means you have two sets of flat tables: sharded flat tables and partitioned flat tables.
- You configure output type by changing the following part of the config file:
"output": {"sharded": true, "partitioned": false}
. If you only want partitioned output, This default option should be changed to this:"output": {"sharded": false, "partitioned": true}
Example 1 - default config
{
"analytics_222460912": {
"tables_to_flatten": [
"events",
"event_params",
"user_properties",
"items"
],
"output_format": {
"sharded": true,
"partitioned": false
},
"intraday_flattening": {
"intraday_flat_tables_schedule": null,
"intraday_flat_views": true
}
}
}
- This is the default config file.
- 4 flat tables will be created.
- There will be no flattening of the intraday table written to a BigQuery table, however, there will be an intraday SQL view.
- Only sharded output is produced.
Example 2 - excluding tables from flattening.
{
"analytics_222460912": {
"tables_to_flatten": [
"events",
"event_params"
],
"output_format": {
"sharded": true,
"partitioned": false
},
"intraday_flattening": {
"intraday_flat_tables_schedule": null,
"intraday_flat_views": true
}
}
}
- This config file will only create 2 flat tables for one GA4 property.
Example 3 - adding more datasets, intraday flattening and partitioned output.
{
"analytics_222460912": {
"tables_to_flatten": [
"events",
"event_params",
"user_properties",
"items"
],
"output_format": {
"sharded": true,
"partitioned": true
},
"intraday_flattening": {
"intraday_flat_tables_schedule": {
"frequency": 1,
"units": "hours"
},
"intraday_flat_views": true
}
},
"analytics_123456789": {
"tables_to_flatten": [
"events",
"event_params",
"user_properties",
"items"
],
"output_format": {
"sharded": true,
"partitioned": true
},
"intraday_flattening": {
"intraday_flat_tables_schedule": {
"frequency": 30,
"units": "minutes"
},
"intraday_flat_views": true
}
}
}
-
This config file will create all 4 flat tables for each of the 2 GA4 properties.
-
In both properties, we will also do intraday flattening.
-
In both properties, we create both shaded and partitioned output.
-
See the examples in the
./sample_config/*
in this repo.
The following steps are only required if you plan to backfill historical tables.
-
Make sure that you performed the steps from backfilling prerequisites.
-
Install Python >= 3.7
-
From Mac Terminal or Windows Command Prompt, upgrade pip:
Mac:
python3 -m pip install --upgrade pip
Windows:
py -m pip install --upgrade pip
-
Navigate to the root directory of the source code that was downloaded or cloned in step 6 above.
-
From a command prompt, install python virtual environments:
Mac:
python3 -m pip install --user virtualenv
Windows:
py -m pip install --user virtualenv
-
Create a virtual environment for the source code in step 6:
Mac:
python3 -m venv venv_ga_flattener
Windows:
py -m venv venv_ga_flattener
-
Activate the virtual environment in the step above:
Mac:
source venv_ga_flattener/bin/activate
Windows:
.\venv_ga_flattener\Scripts\activate
-
Install the python dependent packages into the virtual environment:
Mac:
pip install -r cf/requirements.txt
Windows:
pip install -r cf/requirements.txt
-
Modify values in the configuration section of
tools/pubsub_message_publish.py
accordingly. Use a small date range to start, like yesterday only. -
From gcloud CLI, authenticate the installing user using command:
gcloud auth application-default login
-
Run tools/pubsub_message_publish.py locally, for example from the package root:
python -m tools.pubsub_message_publish
This will publish a simulated logging event of GA4 data being ingested
into BigQuery. Check dataset(s) that are configured for new date sharded tables such as (depending on what is
configured): flat_event_params_yyyymmdd
, flat_events_yyyymmdd
, flat_items_yyyymmdd
, flat_user_properties_yyyymmdd
.
Tip: if you are having issues running the backfill locally (on your machine) due to some local environment peculiarities, try running the backfill on GCP using Cloud Shell.
Activate GCP Cloud Shell and run the following commands:
git clone https://github.com/adswerve/google_analytics_flattener_ga4
# go into root dir of the project
cd google_analytics_flattener_ga4
# make sure you're in the right directory:
ls -a
# Make the required changes to backfill Python script. Modify values in the configuration section of `tools/pubsub_message_publish.py` accordingly
python3 -m pip install --upgrade pip
python3 -m pip install --user virtualenv
python3 -m venv venv_ga_flattener
source venv_ga_flattener/bin/activate
pip install -r cf/requirements.txt
# you will only need this command on your machine. While using Cloud Shell, you can skip this command
# gcloud auth application-default login
# make sure we're inside the venv and inside the root flattener directory
python -m tools.pubsub_message_publish
# authorize the request in GCP UI if required
# the backfill script will print something like this:
# INFO: Publishing backfill message to topic ga4-flattener-deployment-topic for {project_id}.analytics_{ga4_property_id}.events_{date_shard}
-
Delete the config_datasets.json file from gs://[deployment_name]-[PROJECT_NUMBER]-adswerve-ga-flat-config/ (Cloud Storage Bucket within [PROJECT_ID])
- You can do this in GSC (Google Cloud Storage) UI or via command line:
gsutil rm gs://[deployment_name]-[PROJECT_NUMBER]-adswerve-ga-flat-config/config_datasets.json
- You can do this in GSC (Google Cloud Storage) UI or via command line:
-
Remove solution:
gcloud deployment-manager deployments delete **[deployment_name]** -q
-
If you were doing intraday flattening, manually delete a Cloud Scheduler job(s) in the UI
-
Message: AccessDeniedException: 403 [PROJECT_NUMBER]@cloudbuild.gserviceaccount.com does not have storage.objects.list access to the Google Cloud Storage bucket.
-
Resolution: Ensure the value (Cloud Storage bucket name) configured in "codeBucket" setting of ga_flattener* .yaml is correct. [PROJECT_NUMBER]@cloudbuild.gserviceaccount.com only requires GCP predefined role of Cloud Build Service Account
-
Message: google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started
-
Resolution: Ensure you run the gcloud command
gcloud auth application-default login
as this sets up the required authentication and it'll just work.
- cf : Pub/Sub triggered cloud function that executes a destination query to unnest(flatten)
the
analytics_{ga4_property_id}.events_yyyymmdd
table immediately upon arrival in BigQuery into these tables, depending on the configuration:- flat_event_params_yyyymmdd
- flat_events_yyyymmdd
- flat_items_yyyymmdd
- flat_user_properties_yyyymmdd
- tests : units test for both cloud functions and deployment manager templates
- cfconfigbuilder(ps) : cloud function that finds all BigQuery datasets that have an events table and adds them to the default configuration on Google's Cloud Storage in the following location: [deployment_name]-[PROJECT_NUMBER]-adswerve-ga-flat-config\config_datasets.json
- dm_helper.py: provides consistent names for GCP resources accross solution. Configuration and constants also found in the class in this file
- dmt-*: any files prefixed with dmt_ are python based Deployment Manager templates
- ga_flattener.yaml: Deployment Manager configuration file. The entire solution packages in this file. Used in the deployment manager create command
- tools/pubsub_message_publish.py : python based utility to publish a message to simulate an event that's being monitored in GCP logging. Useful for smoke testing and back-filling data historically.
- LICENSE: BSD 3-Clause open source license
- To make sure BigQuery analysis cost is billed to the right project and the SQL queries run in the correct project, activate the correct gcloud configuration:
gcloud config configurations activate {configuration_name}
- If necessary, create a new gcloud configuration:
gcloud init
- To run unit tests on your local machine, authenticate locally with
gcloud auth application-default login
. You could also save a file sa.json with GCP account credentials in the sandbox directory and specify path to it intest_base.py
. - Set the environment variables in
test_base.py
:"deployment"
,"project"
and"username"
. - Unit tests run locally assume that GA4 flattener is installed into your GCP project.
- GitHub CI/CD pipeline uses automatically saves sa.json into the sandbox directory and
sets
GOOGLE_APPLICATION_CREDENTIALS
to the filepath.