Skip to content

R Classes for Population Genetic Data

Christine Ewers-Saucedo edited this page Mar 9, 2015 · 12 revisions

The goal of this page is to discuss ways to code population genetic data. Below is a table of the currently available classes to store population genetic data and their strengths and weaknesses.

Class (package) Strengths Weaknesses
DNAbin (ape) stores all sets of sequences (aligned or not) less compact than 2-bit coding (but by a factor 4 at most)
uses matrices (aligned) or lists so usual R's commands (names, rownames, [, [[, $) can be used
many as.DNAbin methods in ape (inc. from BioConductor)
efficient functions in ape (dist.dna, seg.sites, base.freq, read.FASTA) and in pegas (haplotype)
loci (pegas) low memory usage not really appropriate for some analyses (e.g., multivariate analyses)
all levels of ploidy and any number of alleles needs to improve the treatment of NA's (especially when data are read with read.vcf()
genotypes can be phased
any kind of individual data can be associated in the data frame
efficient to compute genotype and allele frequencies
genind (adegenet) stores allelic frequencies; ideal for multivariate analyses requires more memory
  additional slots for individual data less efficient to compute frequencies
all levels of ploidy
genpop (adegenet) equivalent to genind at group level; ideal for multivariate analysis requires more memory
genlight (adegenet) stores binary SNPs using bit-level coding; very memory efficient more computationally intensive to handle; less functionalities
all levels of ploidy assumes bi-allelic loci
genclone (poppr) stores population metadata (@hierarchy slot); users don't have keep track of multiple objects for the same data inherits genind object; all the same weaknesses plus slightly more memory
stores multilocus genotype definitions (@mlg slot) for clonal populations
genambig (polysat) stores microsatellite data with ambiguous ploidy does not handle any other data type
exports to genpop objects cannot easily be transferred to any other object
phyDat (phangorn) very general inspired by R data.frame, factor and contrasts, can contain any discrete data type; nucleotides, amino acids and codons have some more support designed having phylogenetic analysis in mind; requires alignments, where all sequences have same length
can be converted to and from DNAbin objects (as.DNAbin / as.phyDat)
a few generic functions work on it: c, unique, subset and utility functions baseFreq, allSitePattern, etc. data are not necessarily very memory efficient (as integer + contrast matrix), but stores only unique site patterns and their weights (as double)
"efficient" maximum likelihood, maximum parsimony and distance functions in phangorn
gtype (strataG) a simple R list containing a matrix where the first column is a stratification scheme and columns afterward are either haplotypes or diploid loci. If haploid data, the gtype object can also contain a list of DNA sequences. Can likely be made more efficient in terms of storage and preprocessing for other analytical routines in package
can be converted to data.frame or matrix with appropriate as. functions.
has manipulation functions like subset which will select certain strata and/or loci, merge to combine mulitple gtypes, and summary.
can create input files for Genepop, STRUCTURE, fastsimcoal, Arlequin, MEGA, and PHASE
vareff (VarEff) this format is the only input format for the package VarEff, which is the only R package I could find to estimate the effective population sizes from present to ancestral time using microsatellite markers under the coalescence - AKA skyline plots it is very difficult to convert any genotype data format into the vareff format
other

Possible lines of action:

  • improve the efficiency of loci2genind and genind2loci in pegas
  • explain, in the targeted vignette, why it's useful to have different classes
  • consider methods data transfer between of several objects akin to loci2genind
  • improve pegas::read.vcf
  • recode some parts to be more efficient
  • give the possibility to select some loci without reading all the file (support for tabix?)
  • give the possibility to read the extra informations (e.g, genotype likelihoods)
  • support for BCF files?
Clone this wiki locally