-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
guide: Data Management #4042
guide: Data Management #4042
Conversation
some updates around the topic in existing docs
Link Check ReportAll 18 links passed! |
This comment was marked as resolved.
This comment was marked as resolved.
d8afae3
to
0ad8293
Compare
add prospective figure titles
0ad8293
to
a3687ec
Compare
This comment was marked as resolved.
This comment was marked as resolved.
* guide: draft structure of Data Mgmt and some updates around the topic in existing docs * guide: full text for draft intro to DM * guide: hide cloud versioning info per #4042 (review) * guide: clarify Data Mgmt parts and add prospective figure titles * guide: add figure drafts to Data Mgmt * guide: SCM->VC (Data Mgmt) * guide: update 2 figs and add 1 more (Data Mgmt) * guide: roll back unrelated changes per #4042 (review) * guide: mention clouds first (DM) and and update fig. 1 per #4042 (review) * guide: flatten DM index per #4042 (review) * guide: udpates to DM/ DV moved from #4053 (review) * guide: add DM/ Data Versioning page per #4042 (comment) * guide: update outdated link * guide: revert more unrelatedly chaqnged files per #4042 (review) * guide: remove unused ref link * guide: DM/ Remote Storage (not just Setup) and and some links from cmd refs and avoid term "data remote" and some admons nearby... * guide: remove a comment * guide: draft for DM/ Remote Storage content * ref: expand config.remote and link to/from Remotes guide * ref: fix remote config file examples * guide: complete Remote Config section and and add Project config section to DM/ DV guide * ref: rewrite remote add and modify Descs * guide: complete list of supported storage types * ref: rewrite remote index page from extracted from #4053 * guide: clarify `remote modify` phrase in in the Remote config section of DM/ Remote Storage * Update content/docs/user-guide/data-management/data-versioning.md * guide: update versioning config per #4058 (review) * guide: don't call remote storage "additional" here (in the DM/ Remote Storage guide) per #4058 (review) Co-authored-by: Dave Berenbaum <[email protected]> * guide: pull -> download (DM/ RS intro) * guide: remove "optional" from Remote Storage nav & title per #4058 (review) * guide: splits and notes around Data Mgmt index page rel. #4042 (comment) * guide: Data Mgmt intro + note updates * guide: draft of all contents + + remove comments * guide: small impros to Data Mgmt in prep for #4042 (review) * guide: rewrite Data Mgmt index in before/after form per #4042 (review) * guide: add draft figure for Data Mgmt * guide: simplify/refocus data mgmt index per #4042 (review) * work around commented header bug * guide: drop DM/ DV page * guide: rewrite DM intro and - hide benefits (for now) - remove codification comment block * guide: use DM table instead of figure for now * guide: rewrite Data Mgmt story * guide: add draft figures to Data Mgmt * guide: simplify Data Mgmt story and benefits * guide: remove unused images (DM) * guide: update Data Mgmt figures (v1) * guide: rewrite text of Data Mgmt index * guide: update Data Mgmt figures * guide: iterate on Data Mgmt again * guide: update Data Mgmt figs * guide: more supporting info about Data Mgmt * guide: update figures (much more concrete) and and matching text updates * guide: edits to How it works (Data Mgmt) * guide: update Data Mgmt figures Rel. #4042 (comment) * guide: emphaisze dataset versions in UG fig 1 Rel. #4042 (comment) * guide: update Data Mgmt figures (with notes), expand img captions, and update text accordingly. * guide: more updates to text and figure styles, esp. to the first half and comment some stuff out (temporary) * guide: update figures and text (Data Mgmt) ... Using a tabs toggle for the 2nd fig. * guide: Data Management text (section 1) finalized for this version of figures * guide: Data Management (main text) finalized for this version of figures * guide: Data Management (secondary text) pending diagram and code sample(s) * guide: add DVC data mgmt technical diagram & dummy sample CLI blocks * guide: update Data Mgmt text * guide: udpate text and 2nd figure (Data Mgmt) * guide: draft 2nd and 3rd figures * guide: rewrite Data Mgmt/ How it works & and Benefits/ Tradeoffs Probably still unfinished... Missing more data versioning info? See HTML comments. * guide: update drafts of Data Mgmt figures 2, 3 * guide: Data Mgmt improvements and hide the benefits list for now * guide: separate from Data Mgmt work Rel. #4042 * Apply suggestions from code review * Merge branch main + * other: links to Remotes guide * install: Remote Storage guide links * start: Remote Storage guide links + * guide: links to Remote Storage page * Restyled by prettier (#4323) Co-authored-by: Restyled.io <[email protected]> --------- Co-authored-by: Dave Berenbaum <[email protected]> Co-authored-by: rogermparent <[email protected]> Co-authored-by: restyled-io[bot] <32688539+restyled-io[bot]@users.noreply.github.com> Co-authored-by: Restyled.io <[email protected]>
@shcheklein Are you waiting for me to review this one? TBH I have some reservations:
|
@dberenbaum thanks for the feedback. Arguably (@jorgeorpinel can confirm) this PR was one the most complicated in making. Overall, the intention was to explain somehow to users about the general changes to the way the deal with DVC. In our experience the whole concept of content-addressable storage + a new layer of indirection that comes from Git and codification is not something that would a person expect. It makes a lot of mental effort to understand the implications of this changes. The intention here was to:
Saying all that, I agree on the points you mentioned. Things that come to my mind:
Yes, they overlap. But we need to set the ground as I mentioned and the biggest difference is that we go technical (e.g. we show CLI commands on diagrams, etc). Anyways, any other idea on the intro? |
👍
Are there versions of the diagrams I can modify?
I would vote to remove this section. If there's anything you think is crucial, can we try to incorporate it into the sections above?
I think it's fine as is if we make the other changes. |
Following up with specific suggestions for the diagrams. Diagram 1I think this one is pretty good, but I have some minor suggestions:
Diagram 2Can we abstract away the cloud storage box on the right with something like "Your cloud storage," similar to the "Git hosting" box? My reasoning:
Other suggestions for diagram 2:
|
I don't think we officially made a difference between index pages and the first page in each section (which sometimes are the first to load - example). But anyway, an easy fix is to move How it works + maybe Tradeoff to a separate page (as suggested by @shcheklein).
We did so many iterations that I was just using pen and paper by the end haha (faster). The designer produced them from there.
Not entirely @dberenbaum (See the 💡 Remote storage admon) but I did avoid going into that in detail since it's closer to backing up/ sharing data (having it's own guide page); And trying to include it further complicated the doc.
The main difference is the level of technicality (ditto) bu another option to consider, esp. if we manage to produce less complex diagrams, is to move the intro + maybe Benefits to a Use Case and start this page with the content of How it works (and maybe some of the current diagrams). |
On the 1st diagram:
Initially there were file counts next to the directory sizes to avoid opening trees. Maybe reinstating that or skipping that aspect altogether is better than expanding dirs everywhere (too long).
If you think adding a few more makes it more realistic (a problem readers can relate to) then I'd agree.
Direct storage access leads to different people doing whatever they want/can (messy, insecure, no trace, etc). On the 2nd diagram:
Good idea! Would def. make it simpler and better cover cloud versioning. It's only purpose was to show more than one version controlled by DVC (while only one is in the left side) but a) it's kind of hard to get that without analyzing the details and b) maybe that can be expressed in a different way, like stacking Git versions on the left side.
Not sure I get how it applies to the cache. |
Q: Is there a subset of actions from above (except diagram updates) I can take to make this mergeable? |
* guide: draft structure of Data Mgmt and some updates around the topic in existing docs * guide: full text for draft intro to DM * guide: hide cloud versioning info per #4042 (review) * guide: clarify Data Mgmt parts and add prospective figure titles * guide: add figure drafts to Data Mgmt * guide: SCM->VC (Data Mgmt) * guide: update 2 figs and add 1 more (Data Mgmt) * guide: roll back unrelated changes per #4042 (review) * guide: mention clouds first (DM) and and update fig. 1 per #4042 (review) * guide: flatten DM index per #4042 (review) * guide: udpates to DM/ DV moved from #4053 (review) * guide: add DM/ Data Versioning page per #4042 (comment) * guide: update outdated link * guide: revert more unrelatedly chaqnged files per #4042 (review) * guide: remove unused ref link * guide: DM/ Remote Storage (not just Setup) and and some links from cmd refs and avoid term "data remote" and some admons nearby... * guide: remove a comment * guide: draft for DM/ Remote Storage content * ref: expand config.remote and link to/from Remotes guide * ref: fix remote config file examples * guide: complete Remote Config section and and add Project config section to DM/ DV guide * ref: rewrite remote add and modify Descs * guide: complete list of supported storage types * ref: rewrite remote index page from extracted from #4053 * guide: clarify `remote modify` phrase in in the Remote config section of DM/ Remote Storage * Update content/docs/user-guide/data-management/data-versioning.md * guide: update versioning config per #4058 (review) * guide: don't call remote storage "additional" here (in the DM/ Remote Storage guide) per #4058 (review) Co-authored-by: Dave Berenbaum <[email protected]> * guide: pull -> download (DM/ RS intro) * guide: remove "optional" from Remote Storage nav & title per #4058 (review) * guide: splits and notes around Data Mgmt index page rel. #4042 (comment) * guide: Data Mgmt intro + note updates * guide: draft of all contents + + remove comments * guide: small impros to Data Mgmt in prep for #4042 (review) * guide: rewrite Data Mgmt index in before/after form per #4042 (review) * guide: add draft figure for Data Mgmt * guide: simplify/refocus data mgmt index per #4042 (review) * work around commented header bug * guide: drop DM/ DV page * guide: rewrite DM intro and - hide benefits (for now) - remove codification comment block * guide: use DM table instead of figure for now * guide: rewrite Data Mgmt story * guide: add draft figures to Data Mgmt * guide: simplify Data Mgmt story and benefits * guide: remove unused images (DM) * guide: update Data Mgmt figures (v1) * guide: rewrite text of Data Mgmt index * guide: update Data Mgmt figures * guide: iterate on Data Mgmt again * guide: update Data Mgmt figs * guide: more supporting info about Data Mgmt * guide: update figures (much more concrete) and and matching text updates * guide: edits to How it works (Data Mgmt) * guide: update Data Mgmt figures Rel. #4042 (comment) * guide: emphaisze dataset versions in UG fig 1 Rel. #4042 (comment) * guide: update Data Mgmt figures (with notes), expand img captions, and update text accordingly. * guide: more updates to text and figure styles, esp. to the first half and comment some stuff out (temporary) * guide: update figures and text (Data Mgmt) ... Using a tabs toggle for the 2nd fig. * guide: Data Management text (section 1) finalized for this version of figures * guide: Data Management (main text) finalized for this version of figures * guide: Data Management (secondary text) pending diagram and code sample(s) * guide: add DVC data mgmt technical diagram & dummy sample CLI blocks * guide: update Data Mgmt text * guide: udpate text and 2nd figure (Data Mgmt) * guide: draft 2nd and 3rd figures * guide: rewrite Data Mgmt/ How it works & and Benefits/ Tradeoffs Probably still unfinished... Missing more data versioning info? See HTML comments. * guide: update drafts of Data Mgmt figures 2, 3 * guide: Data Mgmt improvements and hide the benefits list for now * guide: separate from Data Mgmt work Rel. #4042 * Apply suggestions from code review * Merge branch main + * ref: update links from API to Remotes guide * guide: update links around Remote Storage and and other updates to nearby Markdown (e.g. proper admons) * Roll back unrelated changes * Restyled by prettier (#4261) Co-authored-by: Restyled.io <[email protected]> * ref: bring cloud versioning copy edits of import-url from https://github.com/iterative/dvc.org/pull/4260/files#diff-ef95e18c4bd039757695065a23946dc27e28b4727ce07c670cdc096e34dbe3b3 * ref: clarify import-url with cloud versioning per #4142 (review) * ref: updates to import-url --version-aware and update --rev * ref: add import-url --version aware to Synopsis per #4089 (comment) * Restyled by prettier (#4266) Co-authored-by: Restyled.io <[email protected]> * Restyled by prettier (#4322) Co-authored-by: Restyled.io <[email protected]> * Update content/docs/command-reference/remote/modify.md Co-authored-by: Oded Messer <[email protected]> * Update content/docs/command-reference/remote/modify.md Co-authored-by: Oded Messer <[email protected]> * Update content/docs/command-reference/push.md Co-authored-by: Oded Messer <[email protected]> * yarn format-all --------- Co-authored-by: Dave Berenbaum <[email protected]> Co-authored-by: rogermparent <[email protected]> Co-authored-by: restyled-io[bot] <32688539+restyled-io[bot]@users.noreply.github.com> Co-authored-by: Restyled.io <[email protected]> Co-authored-by: Oded Messer <[email protected]>
I'd like to make those changes and update the other diagrams to make it mergeable. Here are some proposed polishing of the diagrams that incorporate my feedback above: Diagram 1 Diagram 2 After spending some time on diagram 2, I think we should drop all the cache/hash details. I think we are trying to do too much in one diagram (showing the value prop and how it works), and dropping those details stresses the simplification and indirection that DVC provides. If we want a diagram for how it works, I would put it in the next section. Here's one idea for that diagram (not sure it's needed since it's explained well in that section already): Source: https://miro.com/app/board/uXjVP0xtE7A=/ (I know it doesn't help you @jorgeorpinel 😉 ) |
I can do the diagrams (working on it). I need a better representation of a messy mass storage atm. |
The first diagram, feedback @dberenbaum @jorgeorpinel |
@shcheklein I think it does a better job of showing a messy mass storage, although I worry that we focus so much on creating a messy storage that it becomes incomprehensible to readers. I'd like to see how it connects to diagram 2, like a before and after of the same project. |
yep, but that's the whole point of this - I want this to resonate with some pain points. I agree that it gets complicated though. I don't know a better way for now. What I definitely don't like is when it gets too simple to the point of not delivering the message at all, or when it's artificial and you "don't trust" it, etc. May be we can do a completely different approach, I don't know. One more point here to remember - this is User Guide, it's fine for it to be more technical / detail-oriented. |
hide 3rd diagam and Tradeoff sectyion
@shcheklein Do you plan to get back to updating the 2nd diagram? Do you want me to unblock and merge as is? Do you want me to take it over? |
@dberenbaum I plan to get back to it. Can't promise any deadlines though. If you have some ideas for the second diagram - I can try them. |
@@ -0,0 +1,38 @@ | |||
# Storage locations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this needs its own page, but just noticed it is not included in the sidebar.
Clearly we don't have enough capacity for this. Closing and we can get back to it later. |
Goal: Clarify
about storage, caching, remotes, and different workflows to handle large data files and dirsDVC's approach to data mgmt early in the guide + possibly concentrate related info from existing docs here.Main file to review: content/docs/user-guide/data-management/index.md
In review app: https://dvc-org-guide-data-mgmt-yrwkeh.herokuapp.com/doc/user-guide/data-management
Next PRs:
Remote storage guide: remote storage #4058