You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems like large data sets (in the TBs) tend to get bundled and/or partitioned in different ways and formats such as HDFS/HDF5/TFRecord files. This poses a challenge for DVC data versioning which calculates checksums at the file (or directory) level.
What would be the easiest way to extend DVC support for this kind of dataset storing practice? Perhaps a tool separate to DVC itself even, as some sort of middleware that enables transparency between the actual dataset, however it's organized into bundles and partitions, and DVC commands.
The text was updated successfully, but these errors were encountered:
Closing as stale. HDFS supports such files, and provides built-in checksum, so dvc uses it with external outs/deps (though there are some limitations like checksum not being a hash).
See iterative/dvc.org/issues/682 for context.
It seems like large data sets (in the TBs) tend to get bundled and/or partitioned in different ways and formats such as HDFS/HDF5/TFRecord files. This poses a challenge for DVC data versioning which calculates checksums at the file (or directory) level.
What would be the easiest way to extend DVC support for this kind of dataset storing practice? Perhaps a tool separate to DVC itself even, as some sort of middleware that enables transparency between the actual dataset, however it's organized into bundles and partitions, and DVC commands.
The text was updated successfully, but these errors were encountered: