Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

best-practice: dataset partitioning and/or data compression? #682

Closed
jorgeorpinel opened this issue Oct 9, 2019 · 10 comments
Closed

best-practice: dataset partitioning and/or data compression? #682

jorgeorpinel opened this issue Oct 9, 2019 · 10 comments
Labels
A: docs Area: user documentation (gatsby-theme-iterative) status: stale You've been groomed! type: discussion Requires active participation to reach a conclusion.

Comments

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Oct 9, 2019

Both compression and partitioning seem like very relevant topics since we're in the big data field, yet not much of this is covered in our docs. I'm not sure even how many features of DVC support or consider these matters.

Needs more exploration and let's definitely wait until after #674 is done.

Also waiting for feedback on iterative/dvc#1239 (comment)

Thoughts @iterative/engineering?

@jorgeorpinel jorgeorpinel added type: discussion Requires active participation to reach a conclusion. A: docs Area: user documentation (gatsby-theme-iterative) use-cases labels Oct 9, 2019
@jorgeorpinel
Copy link
Contributor Author

A recent support case involving partitioning (and possibly compression), for example is in this chat conversation, where they have a Parquet Hadoop-style data warehouse in this given structure:

/datawarehouse/date=20190101/file1.parquet
/datawarehouse/date=20190101/file2.parquet
/datawarehouse/date=20190102/file1.parquet
/datawarehouse/date=20190102/file2.parquet

and wanted to track and version it in place (externally) as it amounts to PBs of data.

@efiop
Copy link
Contributor

efiop commented Oct 9, 2019

@jorgeorpinel What do you mean by partitioning? Could you please elaborate?

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Oct 9, 2019

Data files or compressed archives that are divided into several parts like old RAR files allowed you to do, to mention a simple example, or like Apache Parquet seems to support for a more current and data-science-like example.

I suppose also Hadoop partitions HDFS data sets to have them distributed in a cluster but honestly I'm not much familiar with Hadoop so I'm not sure whether this would have an impact on DVC usage, or whether that kind of partitioning is transparent to 3rd party tools.

@jorgeorpinel jorgeorpinel changed the title use-cases: new case study about data set partitioning and file compression? use-cases: new case study/ies about data set partitioning and/or data compression? Oct 11, 2019
@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Oct 11, 2019

Notes on compression and bundles (archives) from chat with @shcheklein:

(Compression) ...can work really well for tabular data (CSV/TSV/JSON/text
...we can compress on a file level (not a bundle of files). So all files are always compressed the same way
while preserving "deduplication" (avoidance of file duplicates in cache)

The only reason that comes to mind to use bundles (tar for example to avoid CPU costs of compressing images that are already compressed) is to overcome some DVC problems when it has to work with a lot of files in a single directory
so, it might help performance-wise at a cost of extracting it, managing manually these splits, potentially losing deduplication on the remote end if other projects reuse these files (but split them into bundles differently)

Bundle/unbundle them vs having a simple dir: What if I have just one more file? Should I wait for more or bundle it alone? if I want to remove a file from one of the bundles I will get a new checksum. If I decide to remove half of the files in a bundle, should I merge it into another one or keep?
(Bundle == zip archive in that specific case; Compression is a completely separate topic by itself.)

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Nov 1, 2019

Another case of bundling/partitioning was recently mentioned on Discord. Summary:

have to group the images on NAS2 into H5 files by SKU to get the upload speed ... they each need to be exploded into images again, one folder per SKU ... need horizontal splits (by [random] SKU) and vertical splits (i.e. across h5 files for train/validate/test) and I need this versioned so i can lookup to see which model a SKU was trained against.

My thought here is to implement some sort of bundler/partitioner middleware that results transparent to DVC. It would need adaptors to known formats like H5, TFRecords, ZIP, etc. I guess.

UPDATE: Extracted to iterative/dvc/issues/2708

@jorgeorpinel
Copy link
Contributor Author

Should this be a Best Practice instead of a Use Case? See #72 (comment).

@jorgeorpinel
Copy link
Contributor Author

Hi @efiop @dberenbaum . What do you guys think about this idea? Could it be a good general use case/ entry point for new users interested in solving this?

If not, it is worth keeping the proposal as a prospective "best practice" instead?

Otherwise I'll close this one. Thanks

@dberenbaum
Copy link
Contributor

It seems like more of a best practice to me, but I'm still unclear on what the suggestion is. For a use case, it seems like this would need to show off some feature of DVC, but I don't really see how DVC does anything to solve issues of partitioning or compression. If it's about how we suggest users partition or compress data for DVC, then it makes sense as a best practice. Do we actually have some standard recommendation that would make sense as a best practice?

@jorgeorpinel jorgeorpinel changed the title use-cases: new case study/ies about data set partitioning and/or data compression? best-practice: dataset partitioning and/or data compression? May 17, 2021
@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented May 17, 2021

Do we actually have some standard recommendation that would make sense as a best practice?

I don't think so. But I know this topic comes up now and then.

I think we mainly assume/recommend no compression at least, but some storage platforms can automatically enable it. Related: iterative/dvc/issues/1239

As for partitioning I know some large file formats like Parquet (and maybe HDFS?) allow it (even without compression). Related: iterative/dvc/issues/829

@jorgeorpinel
Copy link
Contributor Author

Closing as I don't think this has come up much and also it will be something we have to document once/if iterative/dvc#829 gets implemented.

@jorgeorpinel jorgeorpinel added the status: stale You've been groomed! label Apr 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: docs Area: user documentation (gatsby-theme-iterative) status: stale You've been groomed! type: discussion Requires active participation to reach a conclusion.
Projects
None yet
Development

No branches or pull requests

3 participants