best-practice: dataset partitioning and/or data compression? #682

jorgeorpinel · 2019-10-09T05:55:58Z

Both compression and partitioning seem like very relevant topics since we're in the big data field, yet not much of this is covered in our docs. I'm not sure even how many features of DVC support or consider these matters.

~~Needs more exploration and let's definitely wait until after #674 is done.~~

~~Also waiting for feedback on iterative/dvc#1239 (comment)~~

Thoughts @iterative/engineering?

jorgeorpinel · 2019-10-09T06:01:39Z

A recent support case involving partitioning (and possibly compression), for example is in this chat conversation, where they have a Parquet Hadoop-style data warehouse in this given structure:

/datawarehouse/date=20190101/file1.parquet
/datawarehouse/date=20190101/file2.parquet
/datawarehouse/date=20190102/file1.parquet
/datawarehouse/date=20190102/file2.parquet

and wanted to track and version it in place (externally) as it amounts to PBs of data.

efiop · 2019-10-09T12:45:10Z

@jorgeorpinel What do you mean by partitioning? Could you please elaborate?

jorgeorpinel · 2019-10-09T18:46:02Z

Data files or compressed archives that are divided into several parts like old RAR files allowed you to do, to mention a simple example, or like Apache Parquet seems to support for a more current and data-science-like example.

I suppose also Hadoop partitions HDFS data sets to have them distributed in a cluster but honestly I'm not much familiar with Hadoop so I'm not sure whether this would have an impact on DVC usage, or whether that kind of partitioning is transparent to 3rd party tools.

jorgeorpinel · 2019-10-11T00:08:01Z

Notes on compression and bundles (archives) from chat with @shcheklein:

(Compression) ...can work really well for tabular data (CSV/TSV/JSON/text
...we can compress on a file level (not a bundle of files). So all files are always compressed the same way
while preserving "deduplication" (avoidance of file duplicates in cache)

The only reason that comes to mind to use bundles (tar for example to avoid CPU costs of compressing images that are already compressed) is to overcome some DVC problems when it has to work with a lot of files in a single directory
so, it might help performance-wise at a cost of extracting it, managing manually these splits, potentially losing deduplication on the remote end if other projects reuse these files (but split them into bundles differently)

Bundle/unbundle them vs having a simple dir: What if I have just one more file? Should I wait for more or bundle it alone? if I want to remove a file from one of the bundles I will get a new checksum. If I decide to remove half of the files in a bundle, should I merge it into another one or keep?
(Bundle == zip archive in that specific case; Compression is a completely separate topic by itself.)

jorgeorpinel · 2019-11-01T18:27:52Z

Another case of bundling/partitioning was recently mentioned on Discord. Summary:

have to group the images on NAS2 into H5 files by SKU to get the upload speed ... they each need to be exploded into images again, one folder per SKU ... need horizontal splits (by [random] SKU) and vertical splits (i.e. across h5 files for train/validate/test) and I need this versioned so i can lookup to see which model a SKU was trained against.

My thought here is to implement some sort of bundler/partitioner middleware that results transparent to DVC. It would need adaptors to known formats like H5, TFRecords, ZIP, etc. I guess.

UPDATE: Extracted to iterative/dvc/issues/2708

jorgeorpinel · 2020-08-28T16:46:36Z

Should this be a Best Practice instead of a Use Case? See #72 (comment).

jorgeorpinel · 2021-05-17T00:04:37Z

Hi @efiop @dberenbaum . What do you guys think about this idea? Could it be a good general use case/ entry point for new users interested in solving this?

If not, it is worth keeping the proposal as a prospective "best practice" instead?

Otherwise I'll close this one. Thanks

dberenbaum · 2021-05-17T12:44:26Z

It seems like more of a best practice to me, but I'm still unclear on what the suggestion is. For a use case, it seems like this would need to show off some feature of DVC, but I don't really see how DVC does anything to solve issues of partitioning or compression. If it's about how we suggest users partition or compress data for DVC, then it makes sense as a best practice. Do we actually have some standard recommendation that would make sense as a best practice?

jorgeorpinel · 2021-05-17T19:59:51Z

Do we actually have some standard recommendation that would make sense as a best practice?

I don't think so. But I know this topic comes up now and then.

I think we mainly assume/recommend no compression at least, but some storage platforms can automatically enable it. Related: iterative/dvc/issues/1239

As for partitioning I know some large file formats like Parquet (and maybe HDFS?) allow it (even without compression). Related: iterative/dvc/issues/829

jorgeorpinel · 2022-04-27T00:15:28Z

Closing as I don't think this has come up much and also it will be something we have to document once/if iterative/dvc#829 gets implemented.

jorgeorpinel added type: discussion Requires active participation to reach a conclusion. A: docs Area: user documentation (gatsby-theme-iterative) use-cases labels Oct 9, 2019

jorgeorpinel changed the title ~~use-cases: new case study about data set partitioning and file compression?~~ use-cases: new case study/ies about data set partitioning and/or data compression? Oct 11, 2019

jorgeorpinel mentioned this issue Nov 1, 2019

bundler/partitioner middleware iterative/dvc#2708

Closed

jorgeorpinel removed the use-cases label Jan 19, 2020

jorgeorpinel mentioned this issue May 7, 2020

use-cases: revise so they're more high level "landing pages" #820

Closed

8 tasks

jorgeorpinel mentioned this issue Aug 28, 2020

guide: add "Best Practices" #72

Closed

6 tasks

jorgeorpinel changed the title ~~use-cases: new case study/ies about data set partitioning and/or data compression?~~ best-practice: dataset partitioning and/or data compression? May 17, 2021

jorgeorpinel closed this as completed Apr 27, 2022

jorgeorpinel added the status: stale You've been groomed! label Apr 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

best-practice: dataset partitioning and/or data compression? #682

best-practice: dataset partitioning and/or data compression? #682

jorgeorpinel commented Oct 9, 2019 •

edited

Loading

jorgeorpinel commented Oct 9, 2019

efiop commented Oct 9, 2019

jorgeorpinel commented Oct 9, 2019 •

edited

Loading

jorgeorpinel commented Oct 11, 2019 •

edited

Loading

jorgeorpinel commented Nov 1, 2019 •

edited

Loading

jorgeorpinel commented Aug 28, 2020

jorgeorpinel commented May 17, 2021

dberenbaum commented May 17, 2021

jorgeorpinel commented May 17, 2021 •

edited

Loading

jorgeorpinel commented Apr 27, 2022

best-practice: dataset partitioning and/or data compression? #682

best-practice: dataset partitioning and/or data compression? #682

Comments

jorgeorpinel commented Oct 9, 2019 • edited Loading

jorgeorpinel commented Oct 9, 2019

efiop commented Oct 9, 2019

jorgeorpinel commented Oct 9, 2019 • edited Loading

jorgeorpinel commented Oct 11, 2019 • edited Loading

jorgeorpinel commented Nov 1, 2019 • edited Loading

jorgeorpinel commented Aug 28, 2020

jorgeorpinel commented May 17, 2021

dberenbaum commented May 17, 2021

jorgeorpinel commented May 17, 2021 • edited Loading

jorgeorpinel commented Apr 27, 2022

jorgeorpinel commented Oct 9, 2019 •

edited

Loading

jorgeorpinel commented Oct 9, 2019 •

edited

Loading

jorgeorpinel commented Oct 11, 2019 •

edited

Loading

jorgeorpinel commented Nov 1, 2019 •

edited

Loading

jorgeorpinel commented May 17, 2021 •

edited

Loading