How to query multi sample VCF data with Hail

Introduction

This page describes how to query the Polaris 1 Diversity Cohort data using Hail.

Pre-requisite: Having access to the data in BaseSpace

Setting up your execution environment

We are setting up an Amazon instance located in the same region as the data for faster data transfer and lower latency.

Launch an Amazon EC2 instance in the Frankfurt region (a.k.a. eu-central-1)
- The size depends on the queries you will do. For a first experiment, you can try the following properties:
  - AMI: Ubuntu Server 16.04 LTS
  - Instance size: r3.2xlarge (8vCPU, 61GB RAM, 160GB SSD, $0.665/hour)
  - Root disk space: 100GB
Install Docker to access our prebuilt image containing Hail and BaseSpace CLI tools (see appendix if you prefer to install them yourself)

 sudo apt update
 sudo apt install -y docker.io`

Launch our prebuilt Docker image and authenticate with BaseSpace to access the data remotely

 sudo docker run -it --rm -v /tmp:/tmp illuminabioinformatics/hail-and-basespace-cli
 <You should now be inside your docker container>
 
 bs authenticate --api-server=https://api.euc1.sh.basespace.illumina.com
 <Open the URL in the browser you usually use to log in to BaseSpace>
 
 # Check that you see the data
 bs list datasets --project-name "Polaris 1 Diversity Cohort"

Copy the hail.vds directory locally (it should take about 5 minutes)

 bscp "config://~default/Projects/Polaris 1 Diversity Cohort/AppResult/Hail_ingest" /tmp

 # Recreate an empty file that bscp forgets to copy (until we fix that bug)
 touch /tmp/hail.vds/rdd.parquet/_SUCCESS

You are now ready to run your first Hail query.

Running Hail

python
<You are now inside the python interpreter>

import hail
hc = hail.HailContext()
vds = hc.read("/tmp/hail.vds")

# Count singletons in chrom 21
print vds.filter_intervals(hail.Interval.parse('21')).filter_variants_expr("va.pass && va.qc.AC==1").count() 

# Count singletons in whole data
#  Line 1 returns immediately, as spark RDD operations are lazily stored and only evaluated when really needed
#  Line 2 triggers the evaluation of the filter_variants_expr query, and should take about 5 minutes
vds_singletons = vds.filter_variants_expr("va.pass && va.qc.AC==1") 
print vds_singletons.count()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to query multi sample VCF data with Hail

Introduction

Setting up your execution environment

Running Hail

Home

Sample Information

Methods

Polaris in BaseSpace

Input Data Sources

Clone this wiki locally