Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bundler/partitioner middleware #2708

Closed
jorgeorpinel opened this issue Nov 1, 2019 · 3 comments
Closed

bundler/partitioner middleware #2708

jorgeorpinel opened this issue Nov 1, 2019 · 3 comments
Labels
feature request Requesting a new feature question I have a question?

Comments

@jorgeorpinel
Copy link
Contributor

See iterative/dvc.org/issues/682 for context.

It seems like large data sets (in the TBs) tend to get bundled and/or partitioned in different ways and formats such as HDFS/HDF5/TFRecord files. This poses a challenge for DVC data versioning which calculates checksums at the file (or directory) level.

What would be the easiest way to extend DVC support for this kind of dataset storing practice? Perhaps a tool separate to DVC itself even, as some sort of middleware that enables transparency between the actual dataset, however it's organized into bundles and partitions, and DVC commands.

@ghost
Copy link

ghost commented Nov 1, 2019

@jorgeorpinel , I guess one way would be to support chunking but not sure if HDFS has the API to interact with chunks instead of files

@jorgeorpinel
Copy link
Contributor Author

It seems HDF5 supports chunking and there's a HDF5 Connector for Hadoop "to extract metadata and raw data from HDF5 and netCDF4 files on HDFS".

@efiop
Copy link
Contributor

efiop commented Jul 14, 2020

Closing as stale. HDFS supports such files, and provides built-in checksum, so dvc uses it with external outs/deps (though there are some limitations like checksum not being a hash).

@efiop efiop closed this as completed Jul 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requesting a new feature question I have a question?
Projects
None yet
Development

No branches or pull requests

2 participants