This repository compiles prescriptive guidance and code samples demonstrating how to operationalize AlphaFold batch inference using Vertex AI Pipelines.
The following diagram depicts the architecture of the solution.
Key design patterns:
- Inference pipelines are implemented using Kubeflow Pipelines (KFP) SDK v2
- Feature engineering, model inference, and protein relaxation steps of AlphaFold inference are encapsulated in reusable KFP components
- Whenever optimal, inference steps are run in parallel
- Each step runs on the most optimal hardware platform. E.g. data preprocessing steps run on CPUs while model predictions and structure relaxations run on GPUs
- High performance NFS file system - Cloud Filestore - is used to manage genetic databases used during the inference workflow
- Artifacts and metadata created during pipeline runs are tracked in Vertex ML Metadata supporting robust experiment management and analysis.
The core of the solution is a set of parameterized KFP components that encapsulate key tasks in the AlphaFold inference workflow. The AlphaFold KFP components can be composed to implement optimized inference workflows. Currently, the repo contains two example pipelines:
- Universal pipeline. This pipeline mirrors the functionality and settings of the AlphaFold inference script but optimizes elapsed time and compute resources utilization. The pipeline orchestrates the inference workflow using three discrete tasks: feature engineering, model prediction, and protein relaxation. The feature engineering task wraps the AlphaFold data pipelines. The model prediction, and protein relaxation wrap the AlphaFold model runner and the AlphaFold Amber relaxation. The feature engineering task is run on a CPU-only compute node. The model predict and relaxation steps are run on GPU compute nodes. The model predict and relaxation steps are parallelized. For example, for a monomer scenario with the default settings, the pipeline will start 5 model predict operations in parallel on 5 GPU nodes.
- Monomer optimized pipeline. The Monomer optimized pipeline demonstrates how to further optimize the inference workflow by parallelizing feature engineering steps. The pipeline uses KFP components that encapsulate genetic database search tools (hhsearch, jackhmmer, hhblits, etc) to execute database searches in parallel. Each tool runs on the most optimal CPU platform. For example, hhblits and hhsearch tools are run on C2 series machines that feature Intel processors with the AVX2 instruction set, while jackhmmer runs on an N2 series machine. This pipeline only supports folding monomers.
The repository also includes a set of Jupyter notebooks that demonstrate how to configure, submit, and analyze pipeline runs.
src/components
- KFP components encapsulating AlphaFold inference tasks
src/pipelines
- Example inference pipelines
env-setup
- Terraform for setting up a sandbox environment
Jupyter notebooks are located in the root of the repo.
AlphaFold inference utilizes a set of genetic databases. To maximize database search performance when running multiple inference pipelines concurrently, the databases are hosted on a high performance NFS file share managed by Cloud Filestore.
Before running the pipelines you need to configure a Cloud Filestore instance and populate it with genetic databases.
The Environment requirements section describes how to configure the GCP environment required to run the pipelines, including Cloud Filestore configuration.
The repo also includes an example Terraform configuration that builds a sandbox environment meeting the requirements. If you intend to use the provided Terraform configuration you need to pre-stage the genetic databases and model parameters in a Google Cloud Storage bucket. When the Terraform configuration is applied, the databases will be copied from the GCS bucket to the provisioned Filestore instance and the model parameters will be copied to the provisioned regional GCS bucket.
Follow the instructions on the AlphaFold repo to download the genetic databases and model parameters. Make sure to download both the full size and the reduced version of BFD.
The below diagram summarizes Google Cloud environment configuration required to run AlphaFold inference pipelines.
- All services should be provisioned in the same project and the same compute region
- To maintain high performance access to genetic databases, the database files are stored on an instance of Cloud Filestore. To integrate the instance with Vertex AI services the following configuration is required:
- A Filestore instance should be provisioned on a VPC that is peered to the Google services network.
- A Filestore instance should be provisioned with the
connect-mode
setting set toPRIVATE_SERVICE_ACCESS
- An NFS file share that hosts genetic databases must be accessible without authentication
- All genetic databases referenced on the AlphaFold repo, including both a full and a reduced size BFD, should be copied to the file share. The database files can be arranged in any folder layout. However, if possible, we recommend using the same directory structure as described on the AlphaFold repo, as this is the default configuration of example inference pipelines. Note that if the different directory structure is preferable the pipelines can be easily modified.
- Vertex Pipelines should be used with a custom service account. The account should be provisioned with the following role settings:
storage.admin
aiplatform.user
- An instance of Vertex Workbench is used as a development environment to customize pipelines and submit and analyze pipeline runs. The instance should be provisioned on the same VPC as the instance of Filestore.
- A regional GCS bucket located in the same region as Vertex AI services is used to managed artifacts created by pipelines.
- AlphaFold model parameters should be copied to the regional bucket. The pipelines assume that the parameters can be retrieved from the bucket. The default location for the parameters configured in the pipelines is
gs://<BUCKET_NAME>/params
.
The repo includes an example Terraform configuration that can be used to provision a sandbox environment that complies with the requirements detailed in the previous section. The configuration builds the sandbox environment as follows:
- Creates a VPC and a subnet to host a Filestore instance and a Vertex Workbench instance
- Configures VPC Peering between the VPC and the Google services network
- Creates a Filestore instance
- Creates a regional GCS bucket
- Creates a Vertex Workbench instance
- Creates service accounts for Vertex AI
- Copies the genetic databases from a pre-staging GCS location to the Filestore file share
- Copies the AlphaFold model parameters from a pre-staging GCS location to the provisioned regional GCS bucket
You need to be a project owner to set up the sandbox environment.
You will be using Cloud Shell to start and monitor the Terraform setup process.
In the Google Cloud Console, navigate to your project and open Cloud Shell. Make sure you are logged on as the project's owner.
Run the following commands to enable the required services.
export PROJECT_ID=<YOUR PROJECT ID>
gcloud config set project $PROJECT_ID
gcloud services enable \
cloudbuild.googleapis.com \
compute.googleapis.com \
cloudresourcemanager.googleapis.com \
iam.googleapis.com \
container.googleapis.com \
cloudtrace.googleapis.com \
iamcredentials.googleapis.com \
monitoring.googleapis.com \
logging.googleapis.com \
notebooks.googleapis.com \
aiplatform.googleapis.com \
file.googleapis.com \
servicenetworking.googleapis.com
First, clone the repo.
git clone https://github.com/GoogleCloudPlatform/vertex-ai-alphafold-inference-pipeline.git
cd vertex-ai-alphafold-inference-pipeline/env-setup
Set the below environment variables to reflect your environment. The Terraform will attempt to create new resources so make sure that the resources with the specified names do not already exist.
REGION
- your compute regionZONE
- your compute zoneNETWORK_NAME
- the name for the VPC networkSUBNET_NAME
- the name for the VPC networkWORKBENCH_INSTANCE_NAME
- the name for the Vertex Workbench instanceFILESTORE_INSTANCE_ID
- the instance ID of the Filestore instance. See Naming your instanceGCS_BUCKET_NAME
- the name of the GCS regional bucket. See Bucket naming guidelinesGCS_DBS_PATH
- the path to the GCS location of the genetic databases and model parameters. Terraform will copy the databases replicating a folder structure on GCS. Terrafom will also copy model parameters to the regional bucket. The parameters should be in the<GCS_DBS_PATH>/params
export REGION=<YOUR REGION>
export ZONE=<YOUR ZONE>
export NETWORK_NAME=<YOUR NETWORK NAME>
export SUBNET_NAME=<YOUR SUBNET NAME>
export WORKBENCH_INSTANCE_NAME=<YOUR WORKBENCH INSTANCE NAME>
export FILESTORE_INSTANCE_ID=<YOUR INSTANCE ID>
export GCS_BUCKET_NAME=<YOUR BUCKET NAME>
export GCS_DBS_PATH=<YOUR GCS LOCATION FOR GENETIC DBS>
Start Terraform configuration. This step may take a few minutes so be patient.
terraform init
terraform apply \
-var=project_id=$PROJECT_ID \
-var=region=$REGION \
-var=zone=$ZONE \
-var=network_name=$NETWORK_NAME \
-var=subnet_name=$SUBNET_NAME \
-var=workbench_instance_name=$WORKBENCH_INSTANCE_NAME \
-var=filestore_instance_id=$FILESTORE_INSTANCE_ID \
-var=gcs_bucket_name=$GCS_BUCKET_NAME \
-var=gcs_dbs_path=$GCS_DBS_PATH
In addition to provisioning and configuring the required services, the Terraform configuration starts a Vertex Training job that copies the reference databases from the GCS location to the provisioned Filestore instance. You can monitor the job using the links printed out by Terraform. The job may take a couple of hours to complete.
In the sandbox environment, an instance of Vertex Workbench is used as a development/experimentation environment to customize, start, and analyze inference pipelines runs. There are a couple of setup steps that are required before you can use example notebooks.
Connect to JupyterLab on your Vertex Workbench instance and start a JupyterLab terminal.
From the JupyterLab terminal:
git clone https://github.com/GoogleCloudPlatform/vertex-ai-alphafold-inference-pipeline.git
Step 2. Build the container image that encapsulates custom KFP components used by the inference pipelines
PROJECT_ID=$(gcloud config list --format 'value(core.project)')
IMAGE_URI=gcr.io/${PROJECT_ID}/alphafold-components
cd vertex-ai-alphafold-inference-pipeline
gcloud builds submit --timeout "2h" --tag ${IMAGE_URI} . --machine-type=e2-highcpu-8
You are now ready to walk through the sample notebooks that demonstrate how to run and customize pipelines.
Before walking through the example notebooks make sure that the Vertex Training job that populates the Filestore has completed
If you want to remove the resource created for the demo execute the following command from Cloud Shell.
cd ~/vertex-ai-alphafold-inference-pipeline/env-setup
terraform destroy \
-var=project_id=$PROJECT_ID \
-var=region=$REGION \
-var=zone=$ZONE \
-var=network_name=$NETWORK_NAME \
-var=subnet_name=$SUBNET_NAME \
-var=workbench_instance_name=$WORKBENCH_INSTANCE_NAME \
-var=filestore_instance_id=$FILESTORE_INSTANCE_ID \
-var=gcs_bucket_name=$GCS_BUCKET_NAME \
-var=gcs_dbs_path=$GCS_DBS_PATH