bulk data in RDS format #9

markziemann · 2022-02-16T01:13:29Z

No description provided.

markziemann · 2022-07-14T00:38:15Z

Goal: to provide a RDS object which contains all bulk data in matrix format

The approach below is slow and uses a lot of memory but works

library("reshape2")
library("tictoc")
tic()
l <- read.table("ecoli_se.tsv.bz2")
x <- t(acast(l, V1~V2, value.var="V3"))
saveRDS(object=x,file="ecoli_se.Rds",compress="bzip2")
toc()
92.993 sec elapsed

For ecoli, the data file is 144 MB, while the same Rds file is 73 MB as an Rds file. if Rds is compressed with bz2 algo, then it reduces to 54 MB.

This will not scale for human and mouse as these tabular files are >600GB in size, so I will investigate out-of-memory methods like disk.frame.

markziemann · 2022-07-14T01:54:19Z

potential approach

Estimate the amount of memory available on the system and use half of that by loading data from one run and quantifying memory used. For example if one sample consumes 10 MB, and there is 2000 MB RAM free, then we can load 100 datasets in a chunk at a time.
Load in 100 datasets from the top of the file, convert to wide, and save the chunk
as an RDS object

library("disk.frame")
library(data.table)
library("R.utils")

# figure out chunk size
x <- fread("ecoli_se.tsv.bz2",nrows=200000)

chunk_size <- length(unique(x$V2))

# direct read from compressed file not working
xx <- zip_to_disk.frame("ecoli_se.tsv.bz2", shardby = "V1" , header=FALSE)

# this works
xx <- csv_to_disk.frame("ecoli_se.tsv", in_chunk_size = chunk_size, header=FALSE)

# no need to predetermine chunk size as data can be sharded by column 1
xx <- csv_to_disk.frame("ecoli_se.tsv", shardby = "V1" , header=FALSE)

# get number of runs
nruns <- nrow(xx)/chunk_size

#next determine how many result chunks to write
w <- t(acast(xx[1:(chunk_size*1000),], V1~V2, value.var="V3"))

# chunk ranges
starts <- seq(1,nruns,1000)
ends <- c(starts-1,nruns)
ends <- ends[2:length(ends)]

dir.create("tmp")
i=1
# the dimensions are twice as wide as expected
w <- t(acast(xx[,], V1~V2, value.var="V3"))

mystart <- (starts[i] * chunk_size ) - chunk_size + 1
myend <- ends[i] * chunk_size

Then merge all the RDS files together into a larger RDS

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bulk data in RDS format #9

bulk data in RDS format #9

markziemann commented Feb 16, 2022

markziemann commented Jul 14, 2022 •

edited

Loading

markziemann commented Jul 14, 2022 •

edited

Loading

bulk data in RDS format #9

bulk data in RDS format #9

Comments

markziemann commented Feb 16, 2022

markziemann commented Jul 14, 2022 • edited Loading

Goal: to provide a RDS object which contains all bulk data in matrix format

markziemann commented Jul 14, 2022 • edited Loading

potential approach

markziemann commented Jul 14, 2022 •

edited

Loading

markziemann commented Jul 14, 2022 •

edited

Loading