-
Notifications
You must be signed in to change notification settings - Fork 15
Working with Large Files
Suppose I have a 955,000 x 2,065 file of genotypes. This is from 900,000 loci and 2,060 individuals, with five information columns (name, chromosome, position, strand, alleles). Roughly how much space would it take to hold this?
Let's say that each SNP name is around 50 bytes, each chromosome is around 8 bytes, each allele is 4 bytes, each strand is 1 byte, and each position is four bytes (32 bit integer). For each locus, there will be 8 x 2,000 = 16,000 bytes used for the genotypes. For scripting languages, these estimates will be low: there's overhead per every object stored. The total would be:
> (50 + 8 + 4 + 1 + 4 + 16000)*955000
[1] 15343985000 # ~ 14.3 Gb
I have a large file (955,690 x 2065) of genotypes in HapMap format. The size is 417Mb compressed, 3.7Gb uncompressed (with gzip). Skipping some meta information columns, and setting colClasses
and nrows
this takes:
user system elapsed
767.237 4.066 772.356
to load into R using read.delim
— which is around 12 minutes. The code used:
HAPMAP_FIXED_COLS <- c(rs="character", alleles="character",
chrom="character", pos="integer", strand="character",
assembly="NULL", center="NULL",
protLSID="NULL", assayLSID="NULL",
panelLSID="NULL", QCcode="NULL")
readHapmap <- function(file, nlines) {
header <- names(read.delim(file, header=TRUE, nrows=1))
samples <- header[-c(1:length(HAPMAP_FIXED_COLS))]
# create colclasses
colclasses <- c(HAPMAP_FIXED_COLS, rep("character", length(samples)))
system.time({hapmap <- read.delim(file, header=TRUE, stringsAsFactors=FALSE,
colClasses=colclasses, nrows=nlines,
col.names=c(names(HAPMAP_FIXED_COLS), samples))})
In R, the object size is 15839231000 bytes (14.8Gb) — similar to our guesstimate earlier.
Serializing this file takes quite a while:
> system.time(save(hapmap, file="hapmap.Rdata"))
user system elapsed
1396.391 29.588 1454.836
Or about 24 minutes wall clock time. The R binary file is roughly 849MB, about twice the size as the gzipped text file. To load back in:
user system elapsed
835.837 16.862 868.570
Or about 14 minutes wall clock time. This is not an improvement over parsing the plaintext file.
Todo