Skip to content
forked from ergo70/vcf_fdw

A multicorn based foreign data wrapper for VCF 4.0 / 4.1 files using PyVCF.

License

Notifications You must be signed in to change notification settings

mulroony/vcf_fdw

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

A multicorn based foreign data wrapper for VCF 4.0 / 4.1 files using PyVCF.

Requirements

  1. PostgreSQL 9.2 or later
  2. Multicorn 1.3.2 or later
  3. PyVCF 0.6.8 or later
  4. One VCF file per sample compressed with bgzip and tabix indexed

The filesystem layout must be like this:

  • <basedir>
    • <species>
      • <sample><suffix>
      • <sample><suffix>.tbi

This example is for data from the 1000 Genomes project, assuming a suffix of .vcf.gz:

  • <basedir>
    • human
      • HG00099.vcf.gz
      • HG00099.vcf.gz.tbi
      • HG00100.vcf.gz
      • HG00100.vcf.gz.tbi
      • ..more..

Installation

  1. Install the multicorn extension
  2. sudo python setup.py install
  3. Execute the following SQL commands in PostgreSQL:
CREATE EXTENSION multicorn;

CREATE SERVER multicorn_vcf FOREIGN DATA WRAPPER multicorn
OPTIONS (wrapper 'vcf_fdw.VCFForeignDataWrapper');

CREATE FOREIGN TABLE vcf.<species>
   (chrom text ,
    pos integer ,
    id text ,
    ref text ,
    alt text[] ,
    qual real ,
    heterozygosity real ,
    sample text ,
    species text ,
    info text ,
    depth integer ,
    partition integer ,
    genotype text ,
    filter text ,
    issnp boolean ,
    issv boolean ,
    isindel boolean ,
    ismonomorphic boolean ,
    isdeletion boolean ,
    issvprecise boolean ,
    istransition boolean ,
    source text )
   SERVER multicorn_vcf
   OPTIONS (basedir '<basedir>', species '<species>', suffix '<suffix>');

Basic Usage

CAUTION: Since tabix indexes only chromosome and position, query performance suffers for queries that do not specify chromosome and/or position in the WHERE clause!

The query below retrieves info and genotype for all samples in the vcf files within the specified region.

SELECT info, genotype FROM vcf.<species> WHERE chrom = '1' AND pos between 500000 AND 10000000;

The query below retrieves info and genotype for all named samples in the vcf files within the specified region.

SELECT info, genotype FROM vcf.<species> WHERE chrom = '1' AND pos between 500000 AND 10000000 AND sample IN ('HG00099','HG00100','HG00101','HG00230');

The query below retrieves info and genotype for all named chromosomes and all named samples in the vcf files within the specified region.

SELECT info, genotype FROM vcf.<species> WHERE chrom in ('1','X') AND pos between 500000 AND 10000000 AND sample IN ('HG00099','HG00100','HG00101','HG00230');

The query below retrieves info and genotype for all named samples in the vcf files within the specified region filtered by SNP.

SELECT info, genotype, issnp FROM vcf.<species> WHERE chrom = '1' AND pos between 500000 AND 10000000 AND sample IN ('HG00099','HG00100','HG00101','HG00230') AND issnp = TRUE;

Disclaimer

This software comes without any warranty whatsoever. Use at your own risk.

About

A multicorn based foreign data wrapper for VCF 4.0 / 4.1 files using PyVCF.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%