Skip to content

Working with Large Files

Vince Buffalo edited this page May 30, 2014 · 3 revisions

Suppose I have a 955,000 x 2,065 file of genotypes. This is from 900,000 loci and 2,060 individuals, with five information columns (name, chromosome, position, strand, alleles). Roughly how much space would it take to hold this?

Let's say that each SNP name is around 50 bytes, each chromosome is around 8 bytes, each allele is 4 bytes, each strand is 1 byte, and each position is four bytes (32 bit integer). For each locus, there will be 8 x 2,000 = 16,000 bytes used for the genotypes. For scripting languages, these estimates will be low: there's overhead per every object stored. The total would be:

> (50 + 8 + 4 + 1 + 4 + 16000)*955000
[1] 15343985000 # ~ 14.3 Gb

Some Real Benchmarks

I have a large file (955,690 x 2065) of genotypes in HapMap format. The size is 417Mb compressed, 3.7Gb uncompressed (with gzip). Skipping some meta information columns, and setting colClasses and nrows this takes:

      user  system elapsed
   767.237   4.066 772.356

to load into R using read.delim — which is around 12 minutes. The code used:

HAPMAP_FIXED_COLS  <- c(rs="character", alleles="character",
                        chrom="character", pos="integer", strand="character",
                        assembly="NULL", center="NULL", 
                        protLSID="NULL", assayLSID="NULL",
                        panelLSID="NULL", QCcode="NULL")

readHapmap <- function(file, nlines) {
                 header <- names(read.delim(file, header=TRUE, nrows=1))
                 samples <- header[-c(1:length(HAPMAP_FIXED_COLS))]
                 # create colclasses 
                 colclasses <- c(HAPMAP_FIXED_COLS, rep("character", length(samples)))
                 system.time({hapmap <- read.delim(file, header=TRUE, stringsAsFactors=FALSE,
                                                   colClasses=colclasses, nrows=nlines,
                                                   col.names=c(names(HAPMAP_FIXED_COLS), samples))})

In R, the object size is 15839231000 bytes (14.8Gb) — similar to our guesstimate earlier.

Serializing this file takes quite a while:

> system.time(save(hapmap, file="hapmap.Rdata"))
    user   system  elapsed
1396.391   29.588 1454.836

Or about 24 minutes wall clock time. The R binary file is roughly 849MB, about twice the size as the gzipped text file. To load back in:

   user  system elapsed
835.837  16.862 868.570

Or about 14 minutes wall clock time. This is not an improvement over parsing the plaintext file.

Dealing with Large Files Efficiently in R

Todo

Clone this wiki locally