Skip to content

How to query multi sample VCF data with Hail

Mitch Bekritsky edited this page Oct 17, 2018 · 2 revisions

Introduction

This page describes how to query the Polaris 1 Diversity Cohort data using Hail.

Pre-requisite: Having access to the data in BaseSpace

Setting up your execution environment

We are setting up an Amazon instance located in the same region as the data for faster data transfer and lower latency.

  • Launch an Amazon EC2 instance in the Frankfurt region (a.k.a. eu-central-1)

    • The size depends on the queries you will do. For a first experiment, you can try the following properties:
      • AMI: Ubuntu Server 16.04 LTS
      • Instance size: r3.2xlarge (8vCPU, 61GB RAM, 160GB SSD, $0.665/hour)
      • Root disk space: 100GB
  • Install Docker to access our prebuilt image containing Hail and BaseSpace CLI tools (see appendix if you prefer to install them yourself)

 sudo apt update
 sudo apt install -y docker.io`
  • Launch our prebuilt Docker image and authenticate with BaseSpace to access the data remotely
 sudo docker run -it --rm -v /tmp:/tmp illuminabioinformatics/hail-and-basespace-cli
 <You should now be inside your docker container>
 
 bs authenticate --api-server=https://api.euc1.sh.basespace.illumina.com
 <Open the URL in the browser you usually use to log in to BaseSpace>
 
 # Check that you see the data
 bs list datasets --project-name "Polaris 1 Diversity Cohort"
  • Copy the hail.vds directory locally (it should take about 5 minutes)
 bscp "config://~default/Projects/Polaris 1 Diversity Cohort/AppResult/Hail_ingest" /tmp

 # Recreate an empty file that bscp forgets to copy (until we fix that bug)
 touch /tmp/hail.vds/rdd.parquet/_SUCCESS

You are now ready to run your first Hail query.

Running Hail

python
<You are now inside the python interpreter>

import hail
hc = hail.HailContext()
vds = hc.read("/tmp/hail.vds")

# Count singletons in chrom 21
print vds.filter_intervals(hail.Interval.parse('21')).filter_variants_expr("va.pass && va.qc.AC==1").count() 

# Count singletons in whole data
#  Line 1 returns immediately, as spark RDD operations are lazily stored and only evaluated when really needed
#  Line 2 triggers the evaluation of the filter_variants_expr query, and should take about 5 minutes
vds_singletons = vds.filter_variants_expr("va.pass && va.qc.AC==1") 
print vds_singletons.count()