forked from iainrb/plinktools
-
Notifications
You must be signed in to change notification settings - Fork 0
Process Plink genotyping files in Python
License
wtsi-npg/plinktools
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Plinktools: Python code to process Plink genotyping files Author: Iain Bancarz, [email protected] 1. INTRODUCTION Plinktools is a Python package for processing Plink format genotyping data. It was developed for the following applications not supported by the standard Plink executable. 1.1 Fast binary merge The standard Plink merge function performs extended cross-checking and recoding of data. This is rather slow, and does not scale well with an increasing number of inputs. To address this issue, Plinktools implements a fast binary merge for two important special cases: Congruent SNP sets with disjoint samples, and congruent samples with disjoint SNPs. In both cases, input and output is in the default SNP-major format. In the case of congruent samples, the fast merge strips off headers and concatenates the input .bed files with no need for recoding. For congruent SNPs, a fast merge can be done if the number of samples in each input is divisible by 4; otherwise recoding is necessary and the merge will be slowed, although still somewhat faster than Plink. 1.2 Heterozygosity calculation by high/low MAF As part of the Wellcome Trust Oxford SOP for exome chip QC, heterozygosity for each sample is calculated separately for SNPs with minor allele frequency above and below 1%. This test has been implemented in Plinktools. 1.3 Extended diff A wrapper for Plink's capability to compare two datasets. It constructs and executes the appropriate command, computes some additional statistics, and records the results in machine-readable JSON format. The script reports all differing calls, and those which differ only by transposition of major and minor alleles. (Such a transposition may be introduced by certain Plink recoding operations.) The extended diff also supports a simple equivalence test on two datasets, via the --brief option. The standard Plink executable does not provide a simple yes/no diff. The extended diff obtains its input solely from the output of the Plink executable. Specifically, it parses text output from Plink 1.0.7 and is not guaranteed to work with any other version. It does not use any of the plinktools Python classes to parse .bed files directly, although it could be adapted to do so. 2. USAGE 2.1 Contents Plinktools includes three front-end scripts: merge_bed.py to merge two or more Plink binary datasets het_by_maf.py to compute heterozygosity for high/low MAF diff.py for extended diff Run any script with --help for more information. 2.2 Prerequisites Python >= 2.7 Plink executable should be on the PATH environment variable: Required for tests and diff functionality. 3. SEE ALSO The Plink data format was created by Shaun Purcell et al. See: http://pngu.mgh.harvard.edu/~purcell/plink Plinktools was created to support the WTSI genotyping pipeline: https://github.com/wtsi-npg/genotyping Plinktools is a prerequisite for the WTSI extension of the zCall genotype caller: https://github.com/wtsi-npg/zCall
About
Process Plink genotyping files in Python
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published
Languages
- Python 100.0%