Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solution for a large number of files? #163

Open
kendonB opened this issue Apr 7, 2017 · 4 comments
Open

Solution for a large number of files? #163

kendonB opened this issue Apr 7, 2017 · 4 comments

Comments

@kendonB
Copy link

kendonB commented Apr 7, 2017

Hi, I'm new to remake.

I haven't worked out a nice way to deal with a large number of files in remake. I have an application where a function or script reads a large number of files (think daily data with a file per day), which I'd like to track in remake, and outputs a smaller but still large number of files (think monthly aggregates of some subset of the daily data).

As far as I can tell, in the current version, I would have to make individual targets for each raw data download, and only have a single target for each of the monthly aggregations.

This cries out for a loop, but I haven't seen how to do that in remake.

Thoughts? Thanks for any help.

@wlandau
Copy link

wlandau commented Apr 7, 2017

Good question. I think this relates to #2. There are some great templating ideas on that thread. For your use case, you might use the yaml package to write each day'sremake.yml file, though I suppose keeping it up to date could be a pain. The proposed solution by @krlmlr on Nov 25, 2016 seems ideal for you, but I do not think that that functionality has been implemented yet.

Last summer, I wrote remakeGenerator to programmatically generate remake.yml files from data frames of commands. It has some functions to manipulate these data frames (analyses(), summaries(), expand(), evaluate(), gather()) so you do not have to write everything by hand.

@kendonB
Copy link
Author

kendonB commented Apr 7, 2017

@wlandau Do you have thoughts on using remake + remakeGenerator vs Drake for these sorts of biggish data projects? I'm running things which use memory up to around 180GB and total disk usage of around 300-400GB.

@wlandau
Copy link

wlandau commented Apr 7, 2017

@kendonB After my company lets me release the latest drake patch (which fixes issues 16, 17, 18, and 19), I am not sure how the two options will stack up in terms of performance. I do know that remake has been around longer and has stood the test of far more projects, and I have not tested drake on files that large. I could compare remake and drake in more detail, but I do not think this is the place for that. However, I do think this is a good opportunity for some benchmarking. Is your project public? If you use remake and remakeGenerator, maybe I could port it to drake and compare.

@wlandau
Copy link

wlandau commented Apr 7, 2017

@kendonB I thought about your use case a little more, and I think there may be more to say.

  • Speed: I have not done much benchmarking to compare drake to remake, so I cannot really speak to this yet (except that drake issue 18 will be patched relatively soon).
  • Storage: drake and remake both use storr to maintain the cache. For each file target, both packages track the file's fingerprint rather than the file itself. My intuition says that both should have about the same storage efficiency.
  • Memory: I am not sure how remake manages objects in memory (though Reduce memory consumption #156 improves this). As for drake, I try to conserve memory using envir.R. Before each parallelizable stage of targets, drake loads the targets it needs and unloads the targets it will never need again. During each stage, newly-made targets are stored in memory in case they will be needed to make future targets, a decision that prioritizes speed over memory consumption.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants