Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guide: Data Management #4042

Closed
wants to merge 88 commits into from
Closed

guide: Data Management #4042

wants to merge 88 commits into from

Conversation

jorgeorpinel
Copy link
Contributor

@jorgeorpinel jorgeorpinel commented Oct 13, 2022

Goal: Clarify about storage, caching, remotes, and different workflows to handle large data files and dirs DVC's approach to data mgmt early in the guide + possibly concentrate related info from existing docs here.

p.s. closes #1762

Main file to review: content/docs/user-guide/data-management/index.md

In review app: https://dvc-org-guide-data-mgmt-yrwkeh.herokuapp.com/doc/user-guide/data-management


Next PRs:

some updates around the topic in existing docs
@jorgeorpinel jorgeorpinel added A: docs Area: user documentation (gatsby-theme-iterative) p1-important Active priorities to deal within next sprints C: guide Content of /doc/user-guide labels Oct 13, 2022
@jorgeorpinel jorgeorpinel self-assigned this Oct 13, 2022
@shcheklein shcheklein temporarily deployed to dvc-org-guide-data-mgmt-lfnbat October 13, 2022 06:16 Inactive
@github-actions
Copy link
Contributor

github-actions bot commented Oct 13, 2022

ab55389

Link Check Report

All 18 links passed!

CML watermark

@shcheklein shcheklein temporarily deployed to dvc-org-guide-data-mgmt-lfnbat October 14, 2022 20:40 Inactive
@jorgeorpinel jorgeorpinel marked this pull request as ready for review October 14, 2022 20:41
@shcheklein shcheklein temporarily deployed to dvc-org-guide-data-mgmt-lfnbat October 17, 2022 23:19 Inactive
@jorgeorpinel

This comment was marked as resolved.

@shcheklein shcheklein temporarily deployed to dvc-org-guide-data-mgmt-lfnbat October 18, 2022 04:38 Inactive
@jorgeorpinel jorgeorpinel force-pushed the guide/data-mgmt-flows branch from d8afae3 to 0ad8293 Compare October 18, 2022 06:09
@shcheklein shcheklein temporarily deployed to dvc-org-guide-data-mgmt-lfnbat October 18, 2022 06:09 Inactive
add prospective figure titles
@jorgeorpinel jorgeorpinel force-pushed the guide/data-mgmt-flows branch from 0ad8293 to a3687ec Compare October 18, 2022 06:23
@shcheklein shcheklein temporarily deployed to dvc-org-guide-data-mgmt-lfnbat October 18, 2022 06:24 Inactive
@yathomasi yathomasi temporarily deployed to dvc-org-guide-data-mgmt-lfnbat October 18, 2022 06:52 Inactive
@yathomasi

This comment was marked as resolved.

@shcheklein shcheklein temporarily deployed to dvc-org-guide-data-mgmt-lfnbat October 19, 2022 01:10 Inactive
jorgeorpinel added a commit that referenced this pull request Feb 20, 2023
* guide: draft structure of Data Mgmt and
some updates around the topic in existing docs

* guide: full text for draft intro to DM

* guide: hide cloud versioning info
per #4042 (review)

* guide: clarify Data Mgmt parts and
add prospective figure titles

* guide: add figure drafts to Data Mgmt

* guide: SCM->VC (Data Mgmt)

* guide: update 2 figs and add 1 more (Data Mgmt)

* guide: roll back unrelated changes
per #4042 (review)

* guide: mention clouds first (DM) and

and update fig. 1
per #4042 (review)

* guide: flatten DM index
per #4042 (review)

* guide: udpates to DM/ DV
moved from #4053 (review)

* guide: add DM/ Data Versioning page

per #4042 (comment)

* guide: update outdated link

* guide: revert more unrelatedly chaqnged files

per #4042 (review)

* guide: remove unused ref link

* guide: DM/ Remote Storage (not just Setup) and

and some links from cmd refs
and avoid term "data remote"
and some admons nearby...

* guide: remove a comment

* guide: draft for DM/ Remote Storage content

* ref: expand config.remote and link to/from Remotes guide

* ref: fix remote config file examples

* guide: complete Remote Config section and

and add Project config section to DM/ DV guide

* ref: rewrite remote add and modify Descs

* guide: complete list of supported storage types

* ref: rewrite remote index page from

extracted from #4053

* guide: clarify `remote modify` phrase in

in the Remote config section of DM/ Remote Storage

* Update content/docs/user-guide/data-management/data-versioning.md

* guide: update versioning config

per #4058 (review)

* guide: don't call remote storage "additional" here

(in the DM/ Remote Storage guide)
per #4058 (review)

Co-authored-by: Dave Berenbaum <[email protected]>

* guide: pull -> download (DM/ RS intro)

* guide: remove "optional" from Remote Storage nav & title

per #4058 (review)

* guide: splits and notes around Data Mgmt index page

rel. #4042 (comment)

* guide: Data Mgmt intro + note updates

* guide: draft of all contents +

+ remove comments

* guide: small impros to Data Mgmt

in prep for #4042 (review)

* guide: rewrite Data Mgmt index in before/after form

per #4042 (review)

* guide: add draft figure for Data Mgmt

* guide: simplify/refocus data mgmt index

per #4042 (review)

* work around commented header bug

* guide: drop DM/ DV page

* guide: rewrite DM intro and

- hide benefits (for now)
- remove codification comment block

* guide: use DM table instead of figure for now

* guide: rewrite Data Mgmt story

* guide: add draft figures to Data Mgmt

* guide: simplify Data Mgmt story and benefits

* guide: remove unused images (DM)

* guide: update Data Mgmt figures (v1)

* guide: rewrite text of Data Mgmt index

* guide: update Data Mgmt figures

* guide: iterate on Data Mgmt again

* guide: update Data Mgmt figs

* guide: more supporting info about Data Mgmt

* guide: update figures (much more concrete) and

and matching text updates

* guide: edits to How it works (Data Mgmt)

* guide: update Data Mgmt figures

Rel. #4042 (comment)

* guide: emphaisze dataset versions in UG fig 1

Rel. #4042 (comment)

* guide: update Data Mgmt figures (with notes),

expand img captions,
and update text accordingly.

* guide: more updates to text and figure styles,

esp. to the first half
and comment some stuff out (temporary)

* guide: update figures and text (Data Mgmt) ...

Using a tabs toggle for the 2nd fig.

* guide: Data Management text (section 1)

finalized for this version of figures

* guide: Data Management (main text)

finalized for this version of figures

* guide: Data Management (secondary text)

pending diagram and code sample(s)

* guide: add DVC data mgmt technical diagram &

dummy sample CLI blocks

* guide: update Data Mgmt text

* guide: udpate text and 2nd figure (Data Mgmt)

* guide: draft 2nd and 3rd figures

* guide: rewrite Data Mgmt/ How it works &

and Benefits/ Tradeoffs

Probably still unfinished... Missing more data versioning info? See HTML comments.

* guide: update drafts of Data Mgmt figures 2, 3

* guide: Data Mgmt improvements and

hide the benefits list for now

* guide: separate from Data Mgmt work

Rel. #4042

* Apply suggestions from code review

* Merge branch main +

* other: links to Remotes guide

* install: Remote Storage guide links

* start: Remote Storage guide links +

* guide: links to Remote Storage page

* Restyled by prettier (#4323)

Co-authored-by: Restyled.io <[email protected]>

---------

Co-authored-by: Dave Berenbaum <[email protected]>
Co-authored-by: rogermparent <[email protected]>
Co-authored-by: restyled-io[bot] <32688539+restyled-io[bot]@users.noreply.github.com>
Co-authored-by: Restyled.io <[email protected]>
@shcheklein shcheklein temporarily deployed to dvc-org-guide-data-mgmt-yrwkeh February 21, 2023 20:38 Inactive
@dberenbaum
Copy link
Contributor

@shcheklein Are you waiting for me to review this one?

TBH I have some reservations:

  • It's long for an index page
  • The diagrams are complex
  • It ignores cloud versioning
  • Both the intro and "Tradeoffs" sections read more like our use cases than our UG

@shcheklein
Copy link
Member

@dberenbaum thanks for the feedback. Arguably (@jorgeorpinel can confirm) this PR was one the most complicated in making.

Overall, the intention was to explain somehow to users about the general changes to the way the deal with DVC. In our experience the whole concept of content-addressable storage + a new layer of indirection that comes from Git and codification is not something that would a person expect. It makes a lot of mental effort to understand the implications of this changes.

The intention here was to:

  • state a general problem (something that every team solves - e.g. they have to come up with some data conventions, data workflow, etc)
  • show an image that would resonate with them (kinda "before") - a messy cloud + people using data directly with AWS CLI, boto, etc
  • show an image that shows the same situation with DVC (codified, less mess, different way to access)
  • highlight in one paragraph benefits
  • here on in the next section start explaining technical details, cache, how exactly storage is organized, etc. (actual UG content, but we need to set the ground).

Saying all that, I agree on the points you mentioned. Things that come to my mind:

  • Try to shorten the intro (remove the third diagram)
  • Try to simplify them - here I would love to hear your take
  • Tradeoffs - remove them? Move them into a separate subsection?
  • Same for the How it works? Move into a separate page?

Both the intro and "Tradeoffs" sections read more like our use cases than our UG

Yes, they overlap. But we need to set the ground as I mentioned and the biggest difference is that we go technical (e.g. we show CLI commands on diagrams, etc). Anyways, any other idea on the intro?

@dberenbaum
Copy link
Contributor

  • Try to shorten the intro (remove the third diagram)

👍

  • Try to simplify them - here I would love to hear your take

Are there versions of the diagrams I can modify?

  • Tradeoffs - remove them? Move them into a separate subsection?

I would vote to remove this section. If there's anything you think is crucial, can we try to incorporate it into the sections above?

  • Same for the How it works? Move into a separate page?

I think it's fine as is if we make the other changes.

@dberenbaum
Copy link
Contributor

dberenbaum commented Feb 27, 2023

Following up with specific suggestions for the diagrams.

Diagram 1

I think this one is pretty good, but I have some minor suggestions:

  • scp ssh://...: I guess the idea is to show the raw data comes from somewhere else? Not sure it's worth it and may add more confusion. If you think it's important, we probably need a way to show it outside the S3/"Mass storage" box. Otherwise, I would suggest changing to aws s3 cp ....
  • I get the initial impression that the "ML project" workspace is less organized than the "Mass storage" because the "Mass storage" is shorter, but I think we are trying to show the opposite. For example (see a proposed update to "Mass storage" below):
    • training and validation are expanded in "ML project" but collapsed in "Mass storage." Why not expand in "Mass storage" also?
    • There are not that many versions in "Mass storage." Even where there is a training_split_2, there's no corresponding validation_split_2.
    • On the other hand, the Sarah/ directory doesn't correspond to anything from the left. That may help show how the storage becomes a mess, but I think it adds more confusion since it's not clear what it is.
Mass storage
---
dataset.zip
training_split/
  image_1
  ...
  image 1000
training_split_2/
  image_1
  image_3
  ...
validation_split/
  image_1001
  ...
  image_1100
validation_split_2/
  image_2
  image_7
  ...
models/
  model_linear
  model_sarah
  model_final

Diagram 2

Can we abstract away the cloud storage box on the right with something like "Your cloud storage," similar to the "Git hosting" box?

My reasoning:

  • We already have detail of how the cache works on the left, and this would help simplify the diagrams.
  • This would cover both cache and cloud-versioned remotes (and be more future-proof).
  • It emphasizes the indirection and similarity to Git: like Git, it doesn't matter the structure of what you push because you interact via the DVC interface.

Other suggestions for diagram 2:

  • Can we use the same title where we have "Mass storage" in diagram 1 and "Storage" in diagram 2? Can we use the same color for them?
  • For the "ML project," can we use the same color in both diagrams and use the same basic project structure?
  • For the push/pull arrows, can we make them the same for DVC and Git? It seems like there's something different between them since the DVC section mentions DVC, has two arrows, and labels them push and pull. The Git section only has a single bidirectional arrow. Can we make them identical except for one mentioning DVC and the other mentioning Git so it's clear they work similarly?
  • Do we need "Git repository v1" at the bottom? It looks like it also applies the DVC cache.

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Mar 8, 2023

long for an index page

I don't think we officially made a difference between index pages and the first page in each section (which sometimes are the first to load - example). But anyway, an easy fix is to move How it works + maybe Tradeoff to a separate page (as suggested by @shcheklein).

diagrams are complex

Try to simplify them... versions of the diagrams I can modify?

We did so many iterations that I was just using pen and paper by the end haha (faster). The designer produced them from there.

ignores cloud versioning

Not entirely @dberenbaum (See the 💡 Remote storage admon) but I did avoid going into that in detail since it's closer to backing up/ sharing data (having it's own guide page); And trying to include it further complicated the doc.

intro and "Tradeoffs" sections read more like our use cases

The main difference is the level of technicality (ditto) bu another option to consider, esp. if we manage to produce less complex diagrams, is to move the intro + maybe Benefits to a Use Case and start this page with the content of How it works (and maybe some of the current diagrams).

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Mar 8, 2023

On the 1st diagram:

the "ML project" workspace is less organized

Initially there were file counts next to the directory sizes to avoid opening trees. Maybe reinstating that or skipping that aspect altogether is better than expanding dirs everywhere (too long).

not that many versions in "Mass storage."

If you think adding a few more makes it more realistic (a problem readers can relate to) then I'd agree.

the Sarah/ directory ... may help show how the storage becomes a mess

Direct storage access leads to different people doing whatever they want/can (messy, insecure, no trace, etc).

On the 2nd diagram:

abstract away the cloud storage box on the right

Good idea! Would def. make it simpler and better cover cloud versioning. It's only purpose was to show more than one version controlled by DVC (while only one is in the left side) but a) it's kind of hard to get that without analyzing the details and b) maybe that can be expressed in a different way, like stacking Git versions on the left side.

"Git repository v1" at the bottom... also applies the DVC cache

Not sure I get how it applies to the cache.

@jorgeorpinel
Copy link
Contributor Author

Q: Is there a subset of actions from above (except diagram updates) I can take to make this mergeable?

shcheklein pushed a commit that referenced this pull request Mar 9, 2023
* guide: draft structure of Data Mgmt and
some updates around the topic in existing docs

* guide: full text for draft intro to DM

* guide: hide cloud versioning info
per #4042 (review)

* guide: clarify Data Mgmt parts and
add prospective figure titles

* guide: add figure drafts to Data Mgmt

* guide: SCM->VC (Data Mgmt)

* guide: update 2 figs and add 1 more (Data Mgmt)

* guide: roll back unrelated changes
per #4042 (review)

* guide: mention clouds first (DM) and

and update fig. 1
per #4042 (review)

* guide: flatten DM index
per #4042 (review)

* guide: udpates to DM/ DV
moved from #4053 (review)

* guide: add DM/ Data Versioning page

per #4042 (comment)

* guide: update outdated link

* guide: revert more unrelatedly chaqnged files

per #4042 (review)

* guide: remove unused ref link

* guide: DM/ Remote Storage (not just Setup) and

and some links from cmd refs
and avoid term "data remote"
and some admons nearby...

* guide: remove a comment

* guide: draft for DM/ Remote Storage content

* ref: expand config.remote and link to/from Remotes guide

* ref: fix remote config file examples

* guide: complete Remote Config section and

and add Project config section to DM/ DV guide

* ref: rewrite remote add and modify Descs

* guide: complete list of supported storage types

* ref: rewrite remote index page from

extracted from #4053

* guide: clarify `remote modify` phrase in

in the Remote config section of DM/ Remote Storage

* Update content/docs/user-guide/data-management/data-versioning.md

* guide: update versioning config

per #4058 (review)

* guide: don't call remote storage "additional" here

(in the DM/ Remote Storage guide)
per #4058 (review)

Co-authored-by: Dave Berenbaum <[email protected]>

* guide: pull -> download (DM/ RS intro)

* guide: remove "optional" from Remote Storage nav & title

per #4058 (review)

* guide: splits and notes around Data Mgmt index page

rel. #4042 (comment)

* guide: Data Mgmt intro + note updates

* guide: draft of all contents +

+ remove comments

* guide: small impros to Data Mgmt

in prep for #4042 (review)

* guide: rewrite Data Mgmt index in before/after form

per #4042 (review)

* guide: add draft figure for Data Mgmt

* guide: simplify/refocus data mgmt index

per #4042 (review)

* work around commented header bug

* guide: drop DM/ DV page

* guide: rewrite DM intro and

- hide benefits (for now)
- remove codification comment block

* guide: use DM table instead of figure for now

* guide: rewrite Data Mgmt story

* guide: add draft figures to Data Mgmt

* guide: simplify Data Mgmt story and benefits

* guide: remove unused images (DM)

* guide: update Data Mgmt figures (v1)

* guide: rewrite text of Data Mgmt index

* guide: update Data Mgmt figures

* guide: iterate on Data Mgmt again

* guide: update Data Mgmt figs

* guide: more supporting info about Data Mgmt

* guide: update figures (much more concrete) and

and matching text updates

* guide: edits to How it works (Data Mgmt)

* guide: update Data Mgmt figures

Rel. #4042 (comment)

* guide: emphaisze dataset versions in UG fig 1

Rel. #4042 (comment)

* guide: update Data Mgmt figures (with notes),

expand img captions,
and update text accordingly.

* guide: more updates to text and figure styles,

esp. to the first half
and comment some stuff out (temporary)

* guide: update figures and text (Data Mgmt) ...

Using a tabs toggle for the 2nd fig.

* guide: Data Management text (section 1)

finalized for this version of figures

* guide: Data Management (main text)

finalized for this version of figures

* guide: Data Management (secondary text)

pending diagram and code sample(s)

* guide: add DVC data mgmt technical diagram &

dummy sample CLI blocks

* guide: update Data Mgmt text

* guide: udpate text and 2nd figure (Data Mgmt)

* guide: draft 2nd and 3rd figures

* guide: rewrite Data Mgmt/ How it works &

and Benefits/ Tradeoffs

Probably still unfinished... Missing more data versioning info? See HTML comments.

* guide: update drafts of Data Mgmt figures 2, 3

* guide: Data Mgmt improvements and

hide the benefits list for now

* guide: separate from Data Mgmt work

Rel. #4042

* Apply suggestions from code review

* Merge branch main +

* ref: update links from API to Remotes guide

* guide: update links around Remote Storage and

and other updates to nearby Markdown (e.g. proper admons)

* Roll back unrelated changes

* Restyled by prettier (#4261)

Co-authored-by: Restyled.io <[email protected]>

* ref: bring cloud versioning copy edits of import-url

from
https://github.com/iterative/dvc.org/pull/4260/files#diff-ef95e18c4bd039757695065a23946dc27e28b4727ce07c670cdc096e34dbe3b3

* ref: clarify import-url with cloud versioning

per #4142 (review)

* ref: updates to import-url --version-aware and

update --rev

* ref: add import-url --version aware to Synopsis

per #4089 (comment)

* Restyled by prettier (#4266)

Co-authored-by: Restyled.io <[email protected]>

* Restyled by prettier (#4322)

Co-authored-by: Restyled.io <[email protected]>

* Update content/docs/command-reference/remote/modify.md

Co-authored-by: Oded Messer <[email protected]>

* Update content/docs/command-reference/remote/modify.md

Co-authored-by: Oded Messer <[email protected]>

* Update content/docs/command-reference/push.md

Co-authored-by: Oded Messer <[email protected]>

* yarn format-all

---------

Co-authored-by: Dave Berenbaum <[email protected]>
Co-authored-by: rogermparent <[email protected]>
Co-authored-by: restyled-io[bot] <32688539+restyled-io[bot]@users.noreply.github.com>
Co-authored-by: Restyled.io <[email protected]>
Co-authored-by: Oded Messer <[email protected]>
@dberenbaum
Copy link
Contributor

dberenbaum commented Mar 10, 2023

Q: Is there a subset of actions from above (except diagram updates) I can take to make this mergeable?

  • Try to shorten the intro (remove the third diagram)
  • Tradeoffs - remove them?

I'd like to make those changes and update the other diagrams to make it mergeable.

Here are some proposed polishing of the diagrams that incorporate my feedback above:

Diagram 1

Data versioning - Frame 1

Diagram 2

Data versioning - Frame 2

After spending some time on diagram 2, I think we should drop all the cache/hash details. I think we are trying to do too much in one diagram (showing the value prop and how it works), and dropping those details stresses the simplification and indirection that DVC provides.

If we want a diagram for how it works, I would put it in the next section. Here's one idea for that diagram (not sure it's needed since it's explained well in that section already):

Data versioning - Frame 3

Source: https://miro.com/app/board/uXjVP0xtE7A=/ (I know it doesn't help you @jorgeorpinel 😉 )

@shcheklein
Copy link
Member

I can do the diagrams (working on it). I need a better representation of a messy mass storage atm.

@shcheklein
Copy link
Member

shcheklein commented Mar 11, 2023

The first diagram, feedback @dberenbaum @jorgeorpinel

source

@dberenbaum
Copy link
Contributor

@shcheklein I think it does a better job of showing a messy mass storage, although I worry that we focus so much on creating a messy storage that it becomes incomprehensible to readers.

I'd like to see how it connects to diagram 2, like a before and after of the same project.

@shcheklein
Copy link
Member

although I worry that we focus so much on creating a messy storage that it becomes incomprehensible to readers.

yep, but that's the whole point of this - I want this to resonate with some pain points. I agree that it gets complicated though. I don't know a better way for now.

What I definitely don't like is when it gets too simple to the point of not delivering the message at all, or when it's artificial and you "don't trust" it, etc.

May be we can do a completely different approach, I don't know.

One more point here to remember - this is User Guide, it's fine for it to be more technical / detail-oriented.

@shcheklein shcheklein temporarily deployed to dvc-org-guide-data-mgmt-1g8c0d March 24, 2023 04:38 Inactive
@dberenbaum
Copy link
Contributor

@shcheklein Do you plan to get back to updating the 2nd diagram? Do you want me to unblock and merge as is? Do you want me to take it over?

@shcheklein
Copy link
Member

@dberenbaum I plan to get back to it. Can't promise any deadlines though. If you have some ideas for the second diagram - I can try them.

@@ -0,0 +1,38 @@
# Storage locations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this needs its own page, but just noticed it is not included in the sidebar.

@omesser omesser marked this pull request as draft April 24, 2023 12:24
@dberenbaum dberenbaum mentioned this pull request May 3, 2023
@shcheklein
Copy link
Member

Clearly we don't have enough capacity for this. Closing and we can get back to it later.

@shcheklein shcheklein closed this May 16, 2023
@yathomasi yathomasi deleted the guide/data-mgmt-flows branch July 11, 2023 07:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: docs Area: user documentation (gatsby-theme-iterative) C: guide Content of /doc/user-guide p1-important Active priorities to deal within next sprints
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ref: clarify about obfuscated remote file structure (same as cache)
5 participants