-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
best-practice: dataset partitioning and/or data compression? #682
Comments
A recent support case involving partitioning (and possibly compression), for example is in this chat conversation, where they have a Parquet Hadoop-style data warehouse in this given structure:
and wanted to track and version it in place (externally) as it amounts to PBs of data. |
@jorgeorpinel What do you mean by partitioning? Could you please elaborate? |
Data files or compressed archives that are divided into several parts like old RAR files allowed you to do, to mention a simple example, or like Apache Parquet seems to support for a more current and data-science-like example. I suppose also Hadoop partitions HDFS data sets to have them distributed in a cluster but honestly I'm not much familiar with Hadoop so I'm not sure whether this would have an impact on DVC usage, or whether that kind of partitioning is transparent to 3rd party tools. |
Notes on compression and bundles (archives) from chat with @shcheklein:
|
Another case of bundling/partitioning was recently mentioned on Discord. Summary:
My thought here is to implement some sort of bundler/partitioner middleware that results transparent to DVC. It would need adaptors to known formats like H5, TFRecords, ZIP, etc. I guess. UPDATE: Extracted to iterative/dvc/issues/2708 |
Should this be a Best Practice instead of a Use Case? See #72 (comment). |
Hi @efiop @dberenbaum . What do you guys think about this idea? Could it be a good general use case/ entry point for new users interested in solving this? If not, it is worth keeping the proposal as a prospective "best practice" instead? Otherwise I'll close this one. Thanks |
It seems like more of a best practice to me, but I'm still unclear on what the suggestion is. For a use case, it seems like this would need to show off some feature of DVC, but I don't really see how DVC does anything to solve issues of partitioning or compression. If it's about how we suggest users partition or compress data for DVC, then it makes sense as a best practice. Do we actually have some standard recommendation that would make sense as a best practice? |
I don't think so. But I know this topic comes up now and then. I think we mainly assume/recommend no compression at least, but some storage platforms can automatically enable it. Related: iterative/dvc/issues/1239 As for partitioning I know some large file formats like Parquet (and maybe HDFS?) allow it (even without compression). Related: iterative/dvc/issues/829 |
Closing as I don't think this has come up much and also it will be something we have to document once/if iterative/dvc#829 gets implemented. |
Both compression and partitioning seem like very relevant topics since we're in the big data field, yet not much of this is covered in our docs. I'm not sure even how many features of DVC support or consider these matters.
Needs more exploration and let's definitely wait until after #674 is done.Also waiting for feedback on iterative/dvc#1239 (comment)Thoughts @iterative/engineering?
The text was updated successfully, but these errors were encountered: