bundler/partitioner middleware #2708

jorgeorpinel · 2019-11-01T18:35:32Z

See iterative/dvc.org/issues/682 for context.

It seems like large data sets (in the TBs) tend to get bundled and/or partitioned in different ways and formats such as HDFS/HDF5/TFRecord files. This poses a challenge for DVC data versioning which calculates checksums at the file (or directory) level.

What would be the easiest way to extend DVC support for this kind of dataset storing practice? Perhaps a tool separate to DVC itself even, as some sort of middleware that enables transparency between the actual dataset, however it's organized into bundles and partitions, and DVC commands.

ghost · 2019-11-01T18:52:57Z

@jorgeorpinel , I guess one way would be to support chunking but not sure if HDFS has the API to interact with chunks instead of files

jorgeorpinel · 2019-11-02T16:26:27Z

It seems HDF5 supports chunking and there's a HDF5 Connector for Hadoop "to extract metadata and raw data from HDF5 and netCDF4 files on HDFS".

efiop · 2020-07-14T12:08:04Z

Closing as stale. HDFS supports such files, and provides built-in checksum, so dvc uses it with external outs/deps (though there are some limitations like checksum not being a hash).

jorgeorpinel added question I have a question? feature request Requesting a new feature labels Nov 1, 2019

jorgeorpinel mentioned this issue Nov 1, 2019

best-practice: dataset partitioning and/or data compression? iterative/dvc.org#682

Closed

weekly-digest bot mentioned this issue Nov 4, 2019

Weekly Digest (28 October, 2019 - 4 November, 2019) #2726

Closed

efiop closed this as completed Jul 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bundler/partitioner middleware #2708

bundler/partitioner middleware #2708

jorgeorpinel commented Nov 1, 2019

ghost commented Nov 1, 2019

jorgeorpinel commented Nov 2, 2019

efiop commented Jul 14, 2020

bundler/partitioner middleware #2708

bundler/partitioner middleware #2708

Comments

jorgeorpinel commented Nov 1, 2019

ghost commented Nov 1, 2019

jorgeorpinel commented Nov 2, 2019

efiop commented Jul 14, 2020