Large Data Storage in Drake #6124

stonier · 2017-05-17T15:35:39Z

EDIT (eric), as of 2019-01-29: I've generalized this title to not necessarily be Girder-specific, but just handle large data storage in general.
The text below is relevant to stuff being Girder-specific.

This issue is to land some information on the table and permit discussion for interested parties. The problem being solved here is how to store, retrieve and consume large data (meshes, ...) files in the drake bazel workspace.

Kitware recently presented a demo triggered by previous discussions w/ David and others.

From @jamiesnape:

The backstory is really these issues, and a discussion that we had when we visited Cambridge in March, of which you possibly have the minutes:

#3257

In this demo, we will show using Girder to store large object files referenced from a Git repository. Girder is a scalable, extensible open source, Python based data management framework for the web, developed by Kitware based on years of experience working in the scientific-data-management space. For this demo, we have deployed Girder to Amazon EC2, and are using Amazon S3 as a file storage backend to match the existing type of infrastructure that Kitware maintains for the Drake project.

We have created an example code repository (https://github.com/jcfr/bazel-large-files-with-girder) with a Bazel build system and test files. The test files are STL meshes, and rather than an actual unit test, our “tests" will display a mesh viewer to demonstrate the current contents of the test object file. Using this system, a developer can add a large test file as test data, the test file will be stored in Girder and only a description of the SHA-512 checksum of the full object will be added to the Git repository. The test file can change contents, and Girder will support hosting the multiple versions of the file, along with downloading the full object contents via its SHA-512 checksum. Our approach directly integrates with the Bazel build system to leverage its dependency resolution mechanism in order to selectively download to the sandbox only those files needed to run the specific tests requested, potentially caching the downloaded files.

Interesting features

relatively simple bazel integration (dig around the demo repo)
only downloads for the bazel run or bazel test as needed
existing web frontend to girder that views your collection more sanely than looking up
github organisation based authentication for girder

In it's current state, for this particular use case, it could be almost a drop-in. Note that this is separate from the OSRC work to develop a more general solution than can support use cases beyond a bazel workspace (it could feasibly amalgamate with this bazel support once in place).

Current State

We have ~100MB+ of data files in drake. There is an occasional decision paralysis when deciding on whether to add more data files or not. There is not a strict need right now, but an anticipated need.

Proposal (after discussion with @sammy-tri)

Have kitware drop-in a solution (while they're active on it). Trial it for a couple of months and see if there is uptake - this will answer the question of whether the anticipated need is real or otherwise.

jamiesnape · 2017-05-17T15:41:48Z

Sounds good to me.

jwnimmer-tri · 2017-05-17T15:45:07Z

Its important that the Bazel integration offer a way to prefetch the data files without compiling nor running all of the tests, to support road warrior builds. Even better if bazel fetch ... is that way, which would then also allow for prefetching a relevant subset based on the usual label and dependency rules.

jamiesnape · 2017-05-17T15:49:03Z

bazel fetch ... should work out of the box, yes.

EricCousineau-TRI · 2017-05-23T19:14:33Z

Quick question about GitHub authentication:
Is this just for the web interface, or would it also permit direct GETing / POSTing (e.g. via a wrapped Python script as in the example repo) using SSH keys, and validate the keys against what are stored in GitHub?

jamiesnape · 2017-05-23T19:19:02Z

Not sure that functionality is there at the moment, but I imagine we can add it for you.

jwnimmer-tri · 2017-06-19T15:02:06Z

All issues must have owners. I'm assigning one arbitrarily. (Fix it up if I'm wrong.)

EricCousineau-TRI · 2019-01-29T22:09:16Z

As a latent update, we've been using a flavor of this repo in Anzu for about a year now:
https://github.com/RobotLocomotion/bazel-external-data
It's not great, not recommended, but has gotten the job done enough for usage in Anzu (though with pain points for upload / branching workflows).

\cc @RussTedrake

jwnimmer-tri · 2020-05-28T17:18:12Z

@EricCousineau-TRI I'm not sure what additional action we should anticipate under the umbrella of this issue? Is there more we should do, or should we close this issue? We have a few other issues open about handling model assets via RobotLocomotion/models, so I'm not sure what else we anticipate here.

EricCousineau-TRI · 2020-05-28T17:53:37Z

I'd like to move for keeping this open for 2 months. I'd like to come back to this and prototype using external_data from Drake, mainly to see if it does anything for pain points.
(My issue about the "install story" seems moot now; for now, we just shrug and install prod models normally.)
I don't care if we end up using external_data, but it's just a means to test.

I've set a calendar item for me to close this if I do not get back to it.
That work?

jwnimmer-tri · 2020-05-28T18:02:24Z

If the only further action is you doing some personal testing, then it seems more suitable to keep that in your personal TODO list instead of the team's collaboratively-maintained TODO list. On the other hand, we've kept this open for several years without any change, so I can't really object to another two months, either.

EricCousineau-TRI · 2020-07-27T21:18:58Z

Didn't make it happen in time, gonna close. Can re-open later if need be.

stonier added the type: idea label May 17, 2017

jwnimmer-tri assigned stonier Jun 19, 2017

jamiesnape mentioned this issue Jun 19, 2017

Create more flexible git hooks system #3171

Closed

RussTedrake mentioned this issue Jul 7, 2017

Bring back the littledog URDF #6523

Closed

EricCousineau-TRI mentioned this issue Jul 25, 2017

Add simple local ICP algorithm w/ unittest #6655

Closed

7 tasks

EricCousineau-TRI mentioned this issue Oct 26, 2017

hashsum_download: Provide queries for existence, file information, etc. w.r.t. an existing collection girder/girder#2446

Open

jwnimmer-tri assigned EricCousineau-TRI and unassigned stonier Jun 9, 2018

jwnimmer-tri added the unused team: kitware label Sep 27, 2018

EricCousineau-TRI added the priority: backlog label Dec 8, 2018

EricCousineau-TRI changed the title ~~Data Storage via Girder~~ Large Data Storage in Drake Jan 29, 2019

This was referenced Jan 29, 2019

Include YCB objects in drake examples #10024

Closed

Fetch remote resources (at runtime, with caching) #9498

Closed

EricCousineau-TRI mentioned this issue Aug 6, 2019

doc: Add instructions for adding model artifacts #10811

Merged

jwnimmer-tri added the component: build system Bazel, CMake, dependencies, memory checkers, linters label Apr 28, 2020

EricCousineau-TRI closed this as completed Jul 27, 2020

jwnimmer-tri mentioned this issue Jul 24, 2021

[tools] Deprecate expose_all_files and remove external_data stubs #15469

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large Data Storage in Drake #6124

Large Data Storage in Drake #6124

stonier commented May 17, 2017 •

edited by EricCousineau-TRI

Loading

jamiesnape commented May 17, 2017

jwnimmer-tri commented May 17, 2017

jamiesnape commented May 17, 2017

EricCousineau-TRI commented May 23, 2017

jamiesnape commented May 23, 2017

jwnimmer-tri commented Jun 19, 2017

EricCousineau-TRI commented Jan 29, 2019

jwnimmer-tri commented May 28, 2020

EricCousineau-TRI commented May 28, 2020

jwnimmer-tri commented May 28, 2020

EricCousineau-TRI commented Jul 27, 2020

Large Data Storage in Drake #6124

Large Data Storage in Drake #6124

Comments

stonier commented May 17, 2017 • edited by EricCousineau-TRI Loading

jamiesnape commented May 17, 2017

jwnimmer-tri commented May 17, 2017

jamiesnape commented May 17, 2017

EricCousineau-TRI commented May 23, 2017

jamiesnape commented May 23, 2017

jwnimmer-tri commented Jun 19, 2017

EricCousineau-TRI commented Jan 29, 2019

jwnimmer-tri commented May 28, 2020

EricCousineau-TRI commented May 28, 2020

jwnimmer-tri commented May 28, 2020

EricCousineau-TRI commented Jul 27, 2020

stonier commented May 17, 2017 •

edited by EricCousineau-TRI

Loading