Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large Data Storage in Drake #6124

Closed
stonier opened this issue May 17, 2017 · 11 comments
Closed

Large Data Storage in Drake #6124

stonier opened this issue May 17, 2017 · 11 comments
Assignees
Labels

Comments

@stonier
Copy link
Contributor

stonier commented May 17, 2017

EDIT (eric), as of 2019-01-29: I've generalized this title to not necessarily be Girder-specific, but just handle large data storage in general.
The text below is relevant to stuff being Girder-specific.


This issue is to land some information on the table and permit discussion for interested parties. The problem being solved here is how to store, retrieve and consume large data (meshes, ...) files in the drake bazel workspace.

Kitware recently presented a demo triggered by previous discussions w/ David and others.


From @jamiesnape:

The backstory is really these issues, and a discussion that we had when we visited Cambridge in March, of which you possibly have the minutes:

#3257

In this demo, we will show using Girder to store large object files referenced from a Git repository. Girder is a scalable, extensible open source, Python based data management framework for the web, developed by Kitware based on years of experience working in the scientific-data-management space. For this demo, we have deployed Girder to Amazon EC2, and are using Amazon S3 as a file storage backend to match the existing type of infrastructure that Kitware maintains for the Drake project.

We have created an example code repository (https://github.com/jcfr/bazel-large-files-with-girder) with a Bazel build system and test files. The test files are STL meshes, and rather than an actual unit test, our “tests" will display a mesh viewer to demonstrate the current contents of the test object file. Using this system, a developer can add a large test file as test data, the test file will be stored in Girder and only a description of the SHA-512 checksum of the full object will be added to the Git repository. The test file can change contents, and Girder will support hosting the multiple versions of the file, along with downloading the full object contents via its SHA-512 checksum. Our approach directly integrates with the Bazel build system to leverage its dependency resolution mechanism in order to selectively download to the sandbox only those files needed to run the specific tests requested, potentially caching the downloaded files.


Interesting features

  • relatively simple bazel integration (dig around the demo repo)
  • only downloads for the bazel run or bazel test as needed
  • existing web frontend to girder that views your collection more sanely than looking up
  • github organisation based authentication for girder

In it's current state, for this particular use case, it could be almost a drop-in. Note that this is separate from the OSRC work to develop a more general solution than can support use cases beyond a bazel workspace (it could feasibly amalgamate with this bazel support once in place).


Current State

We have ~100MB+ of data files in drake. There is an occasional decision paralysis when deciding on whether to add more data files or not. There is not a strict need right now, but an anticipated need.

Proposal (after discussion with @sammy-tri)

Have kitware drop-in a solution (while they're active on it). Trial it for a couple of months and see if there is uptake - this will answer the question of whether the anticipated need is real or otherwise.

@jamiesnape
Copy link
Contributor

Sounds good to me.

@jwnimmer-tri
Copy link
Collaborator

Its important that the Bazel integration offer a way to prefetch the data files without compiling nor running all of the tests, to support road warrior builds. Even better if bazel fetch ... is that way, which would then also allow for prefetching a relevant subset based on the usual label and dependency rules.

@jamiesnape
Copy link
Contributor

bazel fetch ... should work out of the box, yes.

@EricCousineau-TRI
Copy link
Contributor

Quick question about GitHub authentication:
Is this just for the web interface, or would it also permit direct GETing / POSTing (e.g. via a wrapped Python script as in the example repo) using SSH keys, and validate the keys against what are stored in GitHub?

@jamiesnape
Copy link
Contributor

Not sure that functionality is there at the moment, but I imagine we can add it for you.

@jwnimmer-tri
Copy link
Collaborator

All issues must have owners. I'm assigning one arbitrarily. (Fix it up if I'm wrong.)

@EricCousineau-TRI
Copy link
Contributor

As a latent update, we've been using a flavor of this repo in Anzu for about a year now:
https://github.com/RobotLocomotion/bazel-external-data
It's not great, not recommended, but has gotten the job done enough for usage in Anzu (though with pain points for upload / branching workflows).

\cc @RussTedrake

@jwnimmer-tri
Copy link
Collaborator

@EricCousineau-TRI I'm not sure what additional action we should anticipate under the umbrella of this issue? Is there more we should do, or should we close this issue? We have a few other issues open about handling model assets via RobotLocomotion/models, so I'm not sure what else we anticipate here.

@EricCousineau-TRI
Copy link
Contributor

I'd like to move for keeping this open for 2 months. I'd like to come back to this and prototype using external_data from Drake, mainly to see if it does anything for pain points.
(My issue about the "install story" seems moot now; for now, we just shrug and install prod models normally.)
I don't care if we end up using external_data, but it's just a means to test.

I've set a calendar item for me to close this if I do not get back to it.
That work?

@jwnimmer-tri
Copy link
Collaborator

If the only further action is you doing some personal testing, then it seems more suitable to keep that in your personal TODO list instead of the team's collaboratively-maintained TODO list. On the other hand, we've kept this open for several years without any change, so I can't really object to another two months, either.

@EricCousineau-TRI
Copy link
Contributor

Didn't make it happen in time, gonna close. Can re-open later if need be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants