This lab showcases fine-grained access control made possible by BigLake with a minimum viable example of Icecream sales forecasting on a Spark notebook hosted on a personal auth Cloud Dataproc cluster.
Sales forecasting with Prophet
- Just enough knowledge of creating and using BigLake tables on files in Cloud Storage
- Just enough knowledge of Row and Column Level Security setup with BigLake
- Introduction to notebooks on Dataproc in case you are new to Dataproc
- Accessing BigLake through PySpark with the BigQuery Spark connector from Google Cloud
- Just enough Terraform for automating provisioning, that can be repurposed for your workloads
Kaggle dataset for Icecream Sales
Row Level Security (RLS) and Column Level Security (CLS) is showcased.
Three users are created as part of the lab, with finegrained access implemented-
- usa_user@ - RLS & CLS: has access to all columns of data with Country in USA
- aus_user@ - RLS & CLS: has access to all columns of data with Country in Australia
- mkt_user@ - CLS: has access to all columns but Discount and Net_Revenue, but to data from all countries
Through a PySpark notebook that is run for each of the three user personas, we will learn how access varies based on finegrained permissions.
This lab features Dataproc Personal Auth Clusters as the Spark infrastructure, and JupyterLab on Dataproc as the notebook infrastructure.
About Cloud Dataproc personal auth clusters:
- Dataproc Personal Cluster Authentication is intended for interactive jobs run by an individual (human) user. Long-running jobs and operations should configure and use an appropriate service account identity.
- When you create a cluster with Personal Cluster Authentication enabled, the cluster will only be usable by a single identity. Other users will not be able to run jobs on the cluster or access Component Gateway endpoints on the cluster.
- Clusters with Personal Cluster Authentication enabled automatically enable and configure Kerberos on the cluster for secure intra-cluster communication. However, all Kerberos identities on the cluster will interact with Google Cloud resources as the same user. (identity propagation, fine grained auditability)
So effectively, the architecture is as depicted below-
The section covers Column Level Security setup.
1. What's involved
Effectively, only the users, usa_user@ and aus_user@ have access to columns IcecreamSales.Discount and IcecreamSales.Net_Revenue
2. Taxonomy:
3. Policy Tag:
4. Table:
5. Grants:
The section covers Row Level Security setup.
1. What's involved
2. Example
- Cloud IAM - Users, groups, group memberships, roles
- Cloud Storage - raw data & notebook, Dataproc temp bucket and staging bucket
- Dataplex Data Catalog - policy tag taxonomy, policy tag
- Biglake - finegrained row level and column level security on CSV in Cloud Storage
- Cloud Dataproc - Spark on JupyterLab for forecasting icecream sales
- Data preprocessing at scale: Spark, specifically PySpark
- Forecasting: Prophet with Python
~ 90 minutes
Fully scripted, with detailed instructions intended for learning, not necessarily challenging
# | Google Cloud Collaborators | Contribution |
---|---|---|
1. | Dataproc Product Management and Engineering (Antonio Scaramuzzino and Pedro Melendez) | Inspiration, vision and sample |
2. | Jay O' Leary, Customer Engineer | Automation of lab |
3. | Anagha Khanolkar, Customer Engineer | Architecture, Diagrams, Narrative, Lab Guide, Testing, Ux |
Community contribution to improve the lab is very much appreciated.
If you have any questions or if you found any problems with this repository, please report through GitHub issues.
Note the project number and project ID.
We will need this for the rest of the lab.
This is needed for the networking setup.
Go to Cloud IAM and through the UI, grant yourself security admin role.
This is needed to set project level policies
In the UI, set context to organization level (instead of project)
Go to Cloud IAM and through the UI, grant yourself Organization Policy Administrator at an Organization level.
Don't forget to set the project back to the project you created in Step 1 above in the UI.
Go To admin.google.com...
- You will add three users:
1. One user with access to all USA records in the dataset
2. One user with access to all Australia records in the dataset
3. One marketing user with access to both USA and Australia recoreds but restricted to certain columns
- While you can use any usernames you want, we recommend you use the following as we have tested with these:
1. usa_user
2. aus_user
3. mkt_user
To make it easier to use the three different personas (users) we recommend you set up 3 profiles in your browser
- To add a profile
- click on your profile picture at the far right of the screen next to the vertical 3 dots.
- Then click on '+ Add' at the bottom of the screen as shown below:
We recommend you setup three profiles:
- One for the USA User
- One for the Australia User
- And one for the Marketing User
For more information see these instructions --> Add Profile Instructions
The following services and resources will be created via Terraform scripts:
- VPC, Subnetwork and NAT rules
- IAM groups for USA and Australia
- IAM permissions for user principals and Google Managed default service accounts
- GCS buckets, for each user principal and for Dataproc temp bucket
- Dataplex Policy for Column level Access
- BigQuery Dataset, Table and Row Level Policies
- Dataproc 'Personal auth' (kerberized) Clusters: a cluster each for USA, Australia and Marketing Users
- Pre-created Jupyter Notebooks are uploaded to the GCS
- Terraform for automation
- Cloud Shell for executing Terraform
This section covers creating the environment via Terraform from Cloud Shell.
- Launch cloud shell
- Clone this git repo
- Provision foundational resources such as Google APIs and Organization Policies
- Provision the GCP data Analytics services and their dependencies for the lab
Instructions for launching and using cloud shell are available here.
cd ~
git clone https://github.com/j-f-oleary-bigdata/biglake-finegrained-lab
cd ~/biglake-finegrained-lab/
Browse and familiarize yourself with the layout and optionally, review the scripts for an understanding of the constructs as well as how dependencies are managed.
- Define variables for use with Terraform
- Initialize Terraform
- Run a Terraform plan & study it
- Apply the Terraform to create the environment
- Validate the environment created
Modify the below as appropriate for your deployment..e.g. region, zone etc. Be sure to use the right case for GCP region & zone.
Make the corrections as needed below and then cut and paste the text into the Cloud Shell Session.
PROJECT_ID=`gcloud config list --format "value(core.project)" 2>/dev/null`
PROJECT_NBR=`gcloud projects describe $PROJECT_ID | grep projectNumber | cut -d':' -f2 | tr -d "'" | xargs`
PROJECT_NAME=`gcloud projects describe ${PROJECT_ID} | grep name | cut -d':' -f2 | xargs`
GCP_ACCOUNT_NAME=`gcloud auth list --filter=status:ACTIVE --format="value(account)"`
LOCATION="us-central1"
ORG_ID=`gcloud organizations list | grep DISPLAY_NAME | cut -d':' -f2 | xargs`
YOUR_GCP_MULTI_REGION="US"
USA_USERNAME="usa_user"
AUS_USERNAME="aus_user"
MKT_USERNAME="mkt_user"
echo "PROJECT_ID=$PROJECT_ID"
echo "PROJECT_NBR=$PROJECT_NBR"
echo "LOCATION=$LOCATION"
echo "ORG_ID=$ORG_ID"
echo "USA_USERNAME=$USA_USERNAME"
echo "AUS_USERNAME=$AUS_USERNAME"
echo "MKT_USERNAME=$MKT_USERNAME"
Foundational resources in this lab constitute Google APIs and Organizational Policies.
The command below needs to run in cloud shell from ~/biglake-finegrained-lab/org_policy
cd ~/biglake-finegrained-lab/org_policy
terraform init
The terraform below first enables Google APIs needed for the demo, and then updates organization policies. It needs to run in cloud shell from ~/biglake-finegrained-lab/org_policy.
Time taken to complete: <5 minutes
terraform apply \
-var="project_id=${PROJECT_ID}" \
--auto-approve
Needs to run in cloud shell from ~/biglake-finegrained-lab/demo
cd ~/biglake-finegrained-lab/demo
terraform init
Needs to run in cloud shell from ~/biglake-finegrained-lab/demo
terraform plan \
-var="project_id=${PROJECT_ID}" \
-var="project_nbr=${PROJECT_NBR}" \
-var="org_id=${ORG_ID}" \
-var="location=${LOCATION}" \
-var="usa_username=${USA_USERNAME}" \
-var="aus_username=${AUS_USERNAME}" \
-var="mkt_username=${MKT_USERNAME}"
Needs to run in cloud shell from ~/biglake-finegrained-lab/demo.
Time taken to complete: <10 minutes
terraform apply \
-var="project_id=${PROJECT_ID}" \
-var="project_nbr=${PROJECT_NBR}" \
-var="org_id=${ORG_ID}" \
-var="location=${LOCATION}" \
-var="usa_username=${USA_USERNAME}" \
-var="aus_username=${AUS_USERNAME}" \
-var="mkt_username=${MKT_USERNAME}" \
--auto-approve
From your default GCP account (NOT to be confused with the three users we created), go to the Cloud Console, and validate the creation of the following resources-
Validate IAM users in the project, by navigating on Cloud Console to -
- Youself
- usa_user
- aus_user
- mkt_user
- Group: australia-sales with email: australia-sales@YOUR_ORG_NAME with the user usa_user@ in it
- Group: us-sales with email: us-sales@YOUR_ORG_NAME with the user aus_user@ in it
a) User Principles:
- usa_user: Viewer, Dataproc Editor
- aus_user: Viewer and Dataproc Editor
- mkt_user: Viewer and Dataproc Editor
b) Google Managed Compute Engine Default Service Account:
4. [email protected]: Dataproc Worker
c) BigQuery Connection Default Service Account:
Covered below
- dataproc-bucket-aus-YOUR_PROJECT_NUMBER
- dataproc-bucket-mkt-YOUR_PROJECT_NUMBER
- dataproc-bucket-usa-YOUR_PROJECT_NUMBER
- dataproc-temp-YOUR_PROJECT_NUMBER
- dataproc-bucket-aus-YOUR_PROJECT_NUMBER: Storage Admin to aus_user@
- dataproc-bucket-mkt-YOUR_PROJECT_NUMBER: Storage Admin to mkt_user@
- dataproc-bucket-usa-YOUR_PROJECT_NUMBER: Storage Admin to aus_user@
- dataproc-temp-YOUR_PROJECT_NUMBER: Storage Admin to all three users created
Validate the creation of-
- VPC called default
- Subnet called default
- Firewall called subnet-firewall
- Cloud Router called nat-router
- Cloud NAT gateway called nat-config
From your default login (not as the 3 users created above), go to the cloud console and then the Dataproc UI & validate the creation of the following three Dataproc Clusters:
Each of the three buckets (dataproc-bucket-aus/usa/mkt-YOUR_PROJECT_NUMBER) below should have the following in the exact directory structure:
- notebooks/jupyter/IceCream.ipynb
Navigate to Dataplex->Policy Tag Taxonomies and you should see a policy tag called -
- Business-Critical-YOUR_PROJECT_NUMBER
Click on the Policy Tag Taxonomy in Dataplex and you should see a Policy Tag called -
- Financial Data
Each of the two users usa_user@ & aus_user@ are granted datacatalog.categoryFineGrainedReader tied to the Policy Tag created
Navigate to BigQuery in the Cloud Console and you should see, under "External Connections" -
- An external connection called 'us-central1.biglake.gcs'
In the BigQuery console, you should see a dataset called-
- biglake_dataset
bqcx-YOUR_PROJECT_NUMBER@gcp-sa-bigquery-condel.iam.gserviceaccount.com: Storage Object Viewer
A BigLake table called IceCreamSales -
- That uses the Biglake connection 'us-central1.biglake.gcs'
- With CSV configuration
- On CSV file at - gs://dataproc-bucket-aus-YOUR_PROJECT_NUMBER/data/IceCreamSales.csv
- With a set schema
- With column 'Discount' tied to the Policy Tag created -'Financial Data'
- With column 'Net_Revenue' tied to the Policy Tag created -'Financial Data'
Create Row Access Policies, one for each user - aus_user@ and usa_user@ -
- Row Access Policy for the BigLake table IceCreamSales called 'Australia_filter' associated with the IAM group australia-sales@ on filter Country="Australia"
- Row Access Policy for the BigLake table IceCreamSales called 'US_filter' associated with the IAM group us-sales@ on filter Country="United States"
So far, you completed the environment setup and validation. In this sub-module, you will learn the fine grained access control made possible by BigLake.
In your current default user login, navigate to BigQuery on the Cloud Console. You should see a dataset biglake_dataset and a table called "biglake_dataset.IceCreamSales".
Run the query below in the BQ query UI-
SELECT * FROM `biglake_dataset.IceCreamSales` LIMIT 1000
You should not see any results, infact your should see the following error-
Access Denied: BigQuery BigQuery: User has neither fine-grained reader nor masked get permission to get data protected by policy tag "Business-Critical-225879788342 : Financial Data" on columns biglake_dataset.IceCreamSales.Discount, biglake_dataset.IceCreamSales.Net_Revenue.
This is a demonstration of applying principle of least privilege - administrators should not have access to data with in the IceCreamSales table.
This section demonstrates how you can use BigLake to restrict access based on policies from a PySpark notebook. You can also run a query against the table in BigQuery directly and see the same security enforced.
- Row Level Security: "usa_user" can only access data for (Country=)United States in the IceCreamSales table
- Column Level Security: "usa_user" can see the columns Discount and Net_Revenue
What to expect:
- You will log on as the usa_user in an incognito browser
- First, you will launch Cloud Shell in the Cloud Console and create a personal authentication session that you will keep running for the duration of this lab section
- Next, you will go to the Dataproc UI on the Cloud Console, go to "WEB INTERFACES" and launch JupyterLab
- In JupyterLab, you will first launch a terminal session and authenticate yourself and get a Kerberos ticket by running 'kinit'
- Then you will run through the notebook
Switch profiles to the usa_user account in your Chrome browser. Make sure to select the project you created in the step above.
NOTE: If the Chrome profile for the user does not show the user as part of an organization, close that browser and open an incognito browser and login and complete the lab.
In this example, the project is 'biglake-demov4' as shown below:
- Go to console.cloud.google.com
- Launch cloud shell
- Paste the below to create a personal authentication session
PROJECT_ID=`gcloud config list --format "value(core.project)" 2>/dev/null`
USER_PREFIX="usa"
gcloud dataproc clusters enable-personal-auth-session \
--project=${PROJECT_ID} \
--region=us-central1 \
--access-boundary=<(echo -n "{}") \
${USER_PREFIX}-dataproc-cluster
- You will be prompted with the below; Respond with a 'Y', followed by enter
A personal authentication session will propagate your personal credentials to the cluster, so make sure you trust the cluster and the user who created it.
Do you want to continue (Y/n)?
- You will see the following text
Injecting initial credentials into the cluster usa-dataproc-cluster...done.
Periodically refreshing credentials for cluster usa-dataproc-cluster. This will continue running until the command is interrupted...working.
- LEAVE this Cloud Shell RUNNING while you complete the next steps, DO NOT SHUT DOWN
Still signed in as the USA user, in a separate tab in the same browser window, navigate to the cloud console (console.cloud.google.com) and then the Dataproc UI:
- Click on the usa-dataproc-cluster link
- Then click on the 'WEB INTERFACES' link
- Scroll to the bottom of the page and you should see a link for 'Jupyter Lab'
- Click on the 'JupyterLab' link (not to be confused with Jupyter) and this should bring up a new tab as shown below:
- In Jupyter, Click on "File"->New Launcher and then ->Terminal (at bottom of screen under 'Other'
- In terminal screen, we will authenticate, by running kinit; Copy-paste the below into the terminal window:
kinit -kt /etc/security/keytab/dataproc.service.keytab dataproc/$(hostname -f)
- Next validate the creation of the Kerberos ticket by running the below command-
klist
Author's output-
Ticket cache: FILE:/tmp/krb5cc_1001
Default principal: dataproc/gdpsc-usa-dataproc-cluster-m.us-central1-a.c.biglake-dataproc-spark-lab.internal@US-CENTRAL1-A.C.BIGLAKE-DATAPROC-SPARK-LAB.INTERNAL
Valid starting Expires Service principal
10/18/22 14:44:05 10/19/22 00:44:05 krbtgt/US-CENTRAL1-A.C.BIGLAKE-DATAPROC-SPARK-LAB.INTERNAL@US-CENTRAL1-A.C.BIGLAKE-DATAPROC-SPARK-LAB.INTERNAL
renew until 10/25/22 14:44:05
8. You can then close the the terminal screen.
About the notebook:
This notebook demonstrates finegrained BigLake powered permissions, with a Icecream Sales Forecasting forecasting, with PySpark for preprocessing and Python with Prophet for forecasting, with the source data in a BigLake table.
-
From the Jupyter Lab tab you created above, double click on the 'IceCream.ipynb' file as shown below...
-
Then click on the icon on the right that says 'Python 3' with a circle next to it...
-
A dialog box that says 'Select Kernel' will appear, choose 'PySpark' and hit select
- You can now run all cells.
- From the 'Run->Run all Cells' menu.
- Below cell 13, you should see data only for the 'United States' as shown below:
This concludes the exercise of row and column level security powered by Biglake. Lets repeat the same with the user aus_user@
This section demonstrates how you can use BigLake to restrict access based on policies.
- Row Level Security: "aus_user" can only access data for (Country=)Australia in the IceCreamSales table
- Column Level Security: "aus_user" can see the columns Discount and Net_Revenue
Follow steps 5.2.1 through 5.2.4 from above, abbreviated for your convenienc-
- Login to an incognito browser as aus_user
- Use the command below to start a personal auth session in gcloud
PROJECT_ID=`gcloud config list --format "value(core.project)" 2>/dev/null`
USER_PREFIX="aus"
gcloud dataproc clusters enable-personal-auth-session \
--project=${PROJECT_ID} \
--region=us-central1 \
--access-boundary=<(echo -n "{}") \
${USER_PREFIX}-dataproc-cluster
- Log into the aus-dataproc-cluster cluster, and go to "WEB INTERFACES" and click on JupyterLab
- In JupyterLab, open terminal and run kinit to authenticate and get a ticket
kinit -kt /etc/security/keytab/dataproc.service.keytab dataproc/$(hostname -f)
- The major difference is that in cell 12, you should see data only for the 'Australia' as shown below:
5.4. Principle of Least Privilege: Restricted column access for the marketing user (no access to financial data)
This section demonstrates how you can use BigLake to restrict access based on policies.
- Row Level Security: mkt_user@ can access data for any country in the IceCreamSales table (unlike aus_user@ and usa_user@ that could see data only for their country)
- Column Level Security: mkt_user@ can see all the columns except sensitive data columns Discount and Net_Revenue for which the user does not have permissions
Follow steps 5.2.1 through 5.2.4 from above, abbreviated for your convenienc-
- Login to an incognito browser as mkt_user
- Use the command below to start a personal auth session in gcloud
PROJECT_ID=`gcloud config list --format "value(core.project)" 2>/dev/null`
USER_PREFIX="mkt"
gcloud dataproc clusters enable-personal-auth-session \
--project=${PROJECT_ID} \
--region=us-central1 \
--access-boundary=<(echo -n "{}") \
${USER_PREFIX}-dataproc-cluster
- Log into the mkt-dataproc-cluster cluster, and go to "WEB INTERFACES" and click on JupyterLab
- In JupyterLab, open terminal and run kinit to authenticate and get a ticket
kinit -kt /etc/security/keytab/dataproc.service.keytab dataproc/$(hostname -f)
- Cell 6 will throw an error, because mkt_user does not have access to all the columns -> specifically does not have access to Discount and Net_Revenue.
Edit cell 5 as follows and run the rest of the cells. They shoudl execute file.
rawDF = spark.read \
.format("bigquery") \
.load(f"{PROJECT_NAME}.biglake_dataset.IceCreamSales") \
.select("Month", "Country", "Gross_Revenue")
To run the rest of the notebook from cell 5, go to the menu and click on "Run"->"Run Selected Cell And All Below"
- What's different is- mkt_user@
- Cannot see discount and net_revenue
- Can see data for both australia and united states
This concludes the validation of column level security with BigLake for the user, mkt_user@.
Congratulations on completing the lab!
You can (a) shutdown the project altogether in GCP Cloud Console or (b) use Terraform to destroy. Use (b) at your own risk as its a little glitchy while (a) is guaranteed to stop the billing meter pronto.
Needs to run in cloud shell from ~/biglake-finegrained-lab/demo
cd ~/biglake-finegrained-lab/demo
terraform destroy \
-var="project_id=${PROJECT_ID}" \
-var="project_nbr=${PROJECT_NBR}" \
-var="org_id=${ORG_ID}" \
-var="location=${LOCATION}" \
-var="usa_username=${USA_USERNAME}" \
-var="aus_username=${AUS_USERNAME}" \
-var="mkt_username=${MKT_USERNAME}" \
--auto-approve
This concludes the lab.