Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bulk data in RDS format #9

Open
markziemann opened this issue Feb 16, 2022 · 2 comments
Open

bulk data in RDS format #9

markziemann opened this issue Feb 16, 2022 · 2 comments

Comments

@markziemann
Copy link
Owner

No description provided.

@markziemann
Copy link
Owner Author

markziemann commented Jul 14, 2022

Goal: to provide a RDS object which contains all bulk data in matrix format

The approach below is slow and uses a lot of memory but works

library("reshape2")
library("tictoc")
tic()
l <- read.table("ecoli_se.tsv.bz2")
x <- t(acast(l, V1~V2, value.var="V3"))
saveRDS(object=x,file="ecoli_se.Rds",compress="bzip2")
toc()
92.993 sec elapsed

For ecoli, the data file is 144 MB, while the same Rds file is 73 MB as an Rds file. if Rds is compressed with bz2 algo, then it reduces to 54 MB.

This will not scale for human and mouse as these tabular files are >600GB in size, so I will investigate out-of-memory methods like disk.frame.

@markziemann
Copy link
Owner Author

markziemann commented Jul 14, 2022

potential approach

  1. Estimate the amount of memory available on the system and use half of that by loading data from one run and quantifying memory used. For example if one sample consumes 10 MB, and there is 2000 MB RAM free, then we can load 100 datasets in a chunk at a time.
  2. Load in 100 datasets from the top of the file, convert to wide, and save the chunk
    as an RDS object
library("disk.frame")
library(data.table)
library("R.utils")

# figure out chunk size
x <- fread("ecoli_se.tsv.bz2",nrows=200000)

chunk_size <- length(unique(x$V2))

# direct read from compressed file not working
xx <- zip_to_disk.frame("ecoli_se.tsv.bz2", shardby = "V1" , header=FALSE)

# this works
xx <- csv_to_disk.frame("ecoli_se.tsv", in_chunk_size = chunk_size, header=FALSE)

# no need to predetermine chunk size as data can be sharded by column 1
xx <- csv_to_disk.frame("ecoli_se.tsv", shardby = "V1" , header=FALSE)

# get number of runs
nruns <- nrow(xx)/chunk_size

#next determine how many result chunks to write
w <- t(acast(xx[1:(chunk_size*1000),], V1~V2, value.var="V3"))

# chunk ranges
starts <- seq(1,nruns,1000)
ends <- c(starts-1,nruns)
ends <- ends[2:length(ends)]

dir.create("tmp")
i=1
# the dimensions are twice as wide as expected
w <- t(acast(xx[,], V1~V2, value.var="V3"))

mystart <- (starts[i] * chunk_size ) - chunk_size + 1
myend <- ends[i] * chunk_size

  1. Then merge all the RDS files together into a larger RDS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant