-
Notifications
You must be signed in to change notification settings - Fork 8
How to query multi sample VCF data with Hail
Mitch Bekritsky edited this page Oct 17, 2018
·
2 revisions
This page describes how to query the Polaris 1 Diversity Cohort data using Hail.
Pre-requisite: Having access to the data in BaseSpace
We are setting up an Amazon instance located in the same region as the data for faster data transfer and lower latency.
-
Launch an Amazon EC2 instance in the Frankfurt region (a.k.a. eu-central-1)
- The size depends on the queries you will do. For a first experiment, you can try the following properties:
- AMI: Ubuntu Server 16.04 LTS
- Instance size: r3.2xlarge (8vCPU, 61GB RAM, 160GB SSD, $0.665/hour)
- Root disk space: 100GB
- The size depends on the queries you will do. For a first experiment, you can try the following properties:
-
Install Docker to access our prebuilt image containing Hail and BaseSpace CLI tools (see appendix if you prefer to install them yourself)
sudo apt update
sudo apt install -y docker.io`
- Launch our prebuilt Docker image and authenticate with BaseSpace to access the data remotely
sudo docker run -it --rm -v /tmp:/tmp illuminabioinformatics/hail-and-basespace-cli
<You should now be inside your docker container>
bs authenticate --api-server=https://api.euc1.sh.basespace.illumina.com
<Open the URL in the browser you usually use to log in to BaseSpace>
# Check that you see the data
bs list datasets --project-name "Polaris 1 Diversity Cohort"
- Copy the hail.vds directory locally (it should take about 5 minutes)
bscp "config://~default/Projects/Polaris 1 Diversity Cohort/AppResult/Hail_ingest" /tmp
# Recreate an empty file that bscp forgets to copy (until we fix that bug)
touch /tmp/hail.vds/rdd.parquet/_SUCCESS
You are now ready to run your first Hail query.
python
<You are now inside the python interpreter>
import hail
hc = hail.HailContext()
vds = hc.read("/tmp/hail.vds")
# Count singletons in chrom 21
print vds.filter_intervals(hail.Interval.parse('21')).filter_variants_expr("va.pass && va.qc.AC==1").count()
# Count singletons in whole data
# Line 1 returns immediately, as spark RDD operations are lazily stored and only evaluated when really needed
# Line 2 triggers the evaluation of the filter_variants_expr query, and should take about 5 minutes
vds_singletons = vds.filter_variants_expr("va.pass && va.qc.AC==1")
print vds_singletons.count()