Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the option for imports to be backed up to the remote #4581

Closed
wants to merge 1 commit into from

Conversation

charlesbaynham
Copy link
Contributor

@charlesbaynham charlesbaynham commented Sep 20, 2020

Closes #4527 - see this and Discord at https://discordapp.com/channels/485586884165107732/563406153334128681/751199400931491900 and https://discordapp.com/channels/485586884165107732/565699007037571084/751203534351106119 for some discussion.

@shcheklein
Copy link
Member

@charlesbaynham thanks for the PR!! πŸ™

@efiop @charlesbaynham two questions to clarify:

  1. Why don't we use push: true/false or save: true/false ? To make it a bit more generic since it can be applied to other outputs as well, right? I remember we had tickets/requests to turn off pushing some artifacts. This PR can cover that to my mind.

  2. Do we take into account this new field in the dvc gc -c?

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Sep 21, 2020

Hi! I also have some questions (left in the issue: #4527 (comment)) β€” mostly arguing against having an explicit backup/push/save in all outs going fwd.

@charlesbaynham
Copy link
Contributor Author

charlesbaynham commented Sep 21, 2020

@jorgeorpinel from your comments in the linked issue:

Does update reset backup to false? What about re-importing (running import again but without --backup?

update does not reset backup to false, it leaves it as it was. Re-importing (without the --backup flag), however, does remove the flag. So you can use import to alter the settings of the import, or update to retrigger it with whatever settings it was set up with.

@charlesbaynham
Copy link
Contributor Author

charlesbaynham commented Sep 21, 2020

So if we're still having to differentiate .dvc files based on contents of deps (regular vs. import) for backward compatibility, doesn't that defeat the purpose of adding the field to all outs going fwd for consistency? I'm of the inclination that backup should only have an effect on import .dvc files...

In a way yes, we are still having to go to deps in order to find out what to do with our outs. But the difference is that we're only doing in the dvc code once - when outs gets loaded from the file - rather than every time we need to use it. That keeps concerns separate (i.e. dependencies vs. outputs) since code which only deals with outputs can now be decoupled from the deps logic fully.

In terms of clutter in the .dvc file, the only files which would be altered by this change would be those that explicitly use the --backup option. And if a canny user wants to manually add a backup: false field to their dvc added file then I don't see why not let them: the logic is cleaner and easier and it won't cause them any surprises.

Also, what about outs in dvc.yaml files? would those also always have the backup field? (This would affect docs BTW) β€” again, feels like a big change for a small use case.

Again, outs could have the field if you wanted to add it, but they wouldn't get it by default.

@charlesbaynham
Copy link
Contributor Author

charlesbaynham commented Sep 21, 2020

@shcheklein

@charlesbaynham thanks for the PR!! πŸ™

You're welcome!

@efiop @charlesbaynham two questions to clarify:

1. Why don't we use `push: true/false` or `save: true/false` ? To make it a bit more generic since it can be applied to other outputs as well, right? I remember we had tickets/requests to turn off pushing some artifacts. This PR can cover that to my mind.

Yes, I'm not totally sold on the naming yet. We discussed push but decided that it would be confusing as a flag for the import command. (dvc import --push ... implies an action to me, but actually it's marking the file for an action in the future). cache is taken already. save is quite good. I quite like having the same flag in outs and in the import command, but that's not essential.

Very open to suggestions for better names!

2. Do we take into account this new field in the `dvc gc -c`?

That's not something I've thought about at all I'm afraid. It might "just work" since the alterations to get_used_cache should mark the imports files as being part of the cache. But I don't know anything about the dvc gc code so I'm not sure.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Sep 22, 2020

the difference is that we're only doing in the dvc code once - when outs gets loaded from the file - rather than every time we need to use it

OK, but it's too low level IMO. Implementation should probably not define requirements πŸ™‚. The question is how should it work so it provides the desired utility, without complicating the other 95% of use cases (that may not need it)? I still think the backup field only makes sense for imports, at least for now. Up to @shcheklein and @efiop though.

the only files which would be altered by this change would be those that explicitly use the --backup option

Sorry, I'm confused. I thought that "The backup entry in outs should be added for all outs" (from #4527 (comment))

I don't see why not let them: the logic is cleaner and easier and it won't cause them any surprises.

It's not about not letting users do something hypothetical. It's just that applying this globally seems out of scope for #4527. It's also speculative to assume that such change would be welcome generally. We don't even know if there's a need for backup: false β€” you can just not use remote storage, for example.

what about outs in dvc.yaml files?

Again, outs could have the field if you wanted

So this has been tested with dvc.yaml files?

Thanks

@jorgeorpinel
Copy link
Contributor

Very open to suggestions for better names!

I don't like save. How about store (as in "remote storage")?

@jorgeorpinel
Copy link
Contributor

Oh and finally, from iterative/dvc.org#1788 (review):

what happens to a regular output after it has been pushed if you then set it to backup:false manually, and run pull?

you get an error for pull/fetch if the file isn't already in your local cache:

WARNING: Cache 'HashInfo(name='md5', value='bbe02f946d5455d74616fc9777557c22', dir_info=None)' not found. > > File 'file.txt' won't be created.                                          
1 file failed                                                                                                                                                                        
ERROR: failed to pull data from the cloud - Checkout failed for following targets:
file.txt
Is your cache up to date?

Could the error message be less cryptic and provide a hint on what's the cause? (backup: false) But see, this change can get complicated very quickly πŸ˜‹

@charlesbaynham
Copy link
Contributor Author

charlesbaynham commented Sep 22, 2020

Sorry, I'm confused. I thought that "The backup entry in outs should be added for all outs" (from #4527 (comment))

Ah, I realise I've been unclear. When I wrote that issue, I was expecting to go through adding backup: true to all outs. However, this PR doesn't do that: it only adds it to files where the presence of the --backup flag changes the backup setting away from its default. Its default is true for all files except imports, where it's false. So the only dvc files that will be altered are imports which were made with the --backup flag, so no changes to existing repos.

What I mean by "added to all outs" is that it's part of the schema, and will be understood by dvc if it's manually added to an out of something that isn't an import. But I haven't added anything on the CLI to achieve this, since it's an obscure feature with cryptic error messages (see #4581 (comment)). So for someone only using dvc via the cli, they'll never see backup in their dvc files unless they call import --backup.

@charlesbaynham
Copy link
Contributor Author

charlesbaynham commented Sep 22, 2020

Could the error message be less cryptic and provide a hint on what's the cause? (backup: false)

Can do if you think it's worth it. I didn't write that error message, it's coming from checkout - it's the normal error if you try to checkout when your cache is missing a file.

I feel like if a power user wants to alter the dvc files under-the-hood then they shouldn't expect finely crafted error messages. But if that's something you'd usually support, I'm sure we can add some code to checkout which identifies backup: false stages.

@charlesbaynham
Copy link
Contributor Author

OK, name poll! @efiop , @shcheklein , @jorgeorpinel could you please vote with emojis for the name of the flag to be added to import/entry to outs:

backup = πŸš€
push = πŸŽ‰
save = πŸ˜„
store = ❀️

@shcheklein
Copy link
Member

We don't even know if there's a need for backup: false β€” you can just not use remote storage, for example.

Good point. The point is, that we do have tickets/requests for this. Something like: "I want to push everything except my models ... or except data." Usually because of some regulations/security.

Another potentially relevant issue- support remote per output (type). I.e. push models into one remote, data into another etc.

Before we introduce any flags into DVC files (that we'll have to support, document etc), I would def spend a few hours trying to see if we can generalize this.

@charlesbaynham
Copy link
Contributor Author

@shcheklein Do you have any examples of these requests? I had a browse through the issues but I couldn't find any. I did find this one (#4040) which is the separate-remotes-by-data-type issue.

@shcheklein
Copy link
Member

@charlesbaynham sure, some of them that I was able to find within a few minutes:

#2095
#4040
https://stackoverflow.com/questions/58952962/how-to-use-different-remotes-for-different-folders

A lot of requests come from Discord or "use-case" discussion, os it's not that easy to find them usually.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Sep 23, 2020

Thanks for the clarifications @charlesbaynham !

it's an obscure feature with cryptic error messages
if a power user wants to alter the dvc files under-the-hood then they shouldn't expect finely crafted error messages

It won't be that obscure after we document it πŸ™‚ (merging iterative/dvc.org/pull/1788)

I didn't write that error ... it's the normal error if you try to checkout when your cache is missing a file.
we can add some code to checkout which identifies backup: false stages

Got it. I think it would be worth handling the specific case (if we do implement backup mode for all outs), if it's not too hard (implementation can dictate this req. πŸ˜‰)

we do have tickets/requests for this. Something like: "I want to push everything except my models
support remote per output

True @shcheklein! But like you mentioned, most such requests are for configuring default remotes per output (#4519 is another example). If that supports remote: null/false, then it's equivalent to the proposed store: false here. The Q is: in that case, should store: false have the same effect? Or be ignored by non-imports

I still incline to only affect imports, but I no longer have a very strong opinion. Again, up to you guys.

@charlesbaynham
Copy link
Contributor Author

charlesbaynham commented Sep 23, 2020

@jorgeorpinel You wrote in #2095 (comment):

so, add a flag per output, something like --out-local to avoid pushing it to remote at all? it should be pretty straightforward. My only concern that number of different options for different types of outputs double every time. We should think about some general mechanism to specify options per output?

I completely agree about the doubling options in

https://github.com/iterative/dvc/blob/0cef951257f0d670c321c44364cfd0cdb2339c55/dvc/stage/utils.py#L43-L68

and in the associated CLI. That's exactly what I had to do here, although I didn't add persist or no_cache variants since import won't create these, and that's currently the only way of making no_backup files via the CLI. I also didn't add --outs-no-backup to dvc run, although I could do. But a more scalable method would definitely be better, especially if we're considering adding differing default remotes etc.

For the dvc file, how about a field called remoteor push_remote for each out? This could take the special values true to mean "push like normal" or false to mean "don't push at all" (equivalent to backup: false). In the future, it could also take strings (e.g. "data") to mean "only push to the remote called "data".

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Sep 23, 2020

(Ivan wrote that, not me πŸ˜‹)

My take on this is the scope keeps growing and getting complicated. Probably just implement backup mode for imports as a first step, then decide about the rest in #2095?

@charlesbaynham
Copy link
Contributor Author

charlesbaynham commented Sep 23, 2020

(Ivan wrote that, not me πŸ˜‹)

So he did, sorry @shcheklein !

The true / false setting for "backup mode" in the dvc file (name tbc) could later be extended to also accept (a list of) named remotes for #2095 later on. If that's the plan, we should choose the name accordingly now.

@shcheklein
Copy link
Member

The true / false setting for "backup mode" in the dvc file (name tbc) could later be extended to also accept (a list of) named remotes for #2095 later on. If that's the plan, we should choose the name accordingly now.

yep, that's exactly my concerns/points! Expanding YAML schema is one of those things that we should be doing carefully- since we'll have to support it, it will complicate code potentially (e.g. if import_stage and some_flag = true is not as good as a general code if out.push: add_to_array_to_push)

Also, even if we don't cover different remote names here (it can be stretching it too far), we should at least consider covering a very close case- optionally disable push/save for regular outputs.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Sep 23, 2020

I think the name store still works for all cases. For now it can just accept true/false (and apply to imports only). Later it can accept remote names and apply to all outs. remote can be added as an alias too.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Sep 23, 2020

BTW if/when store applies to all outs, I think the default value should be true everywhere including for imports, but dvc import(-url) should generate a .dvc file with explicit store: false by default. That would address my earlier consistency concerns (no need to differentiate import .dvc files from other outs).

@charlesbaynham
Copy link
Contributor Author

charlesbaynham commented Sep 24, 2020

That's how I originally wanted to do it, but this causes a problem for backwards compatibility. You need a way to migrate existing import .dvc files without changing their behaviour, which means you still need some code which uses the old logic.

@charlesbaynham
Copy link
Contributor Author

I've renamed backup to store as I think we've agreed. Current status is that store :

  • Defaults to true for normal files and to false for imports
  • Is only added to DVC files by the CLI when you call dvc import --store (otherwise, the defaults reproduce current behaviour)
  • Will affect normal files if store: false is added manually to the DVC file, preventing them from being pushed to any remote

@jorgeorpinel I've also added a hint to the error message. You now get e.g.:

$ dvc pull
WARNING: Cache 'HashInfo(name='md5', value='31ebdfce8b77ac49d7f5506dd1495830', dir_info=None)' not found. File 'no_store.txt' won't be created.                                
1 file failed_not_stored                                                                                                                                                       
ERROR: failed to pull data from the cloud - The following files are marked as `store: false` so the remote was not searched, but are not present in local cache:
no_store.txt

Checkout failed for following targets:
no_store.txt
Is your cache up to date?
<https://error.dvc.org/missing-files>

@charlesbaynham charlesbaynham force-pushed the backup-imports branch 2 times, most recently from d021cc4 to 2fe25b1 Compare September 28, 2020 12:04
Comment on lines 266 to 271
prepend = (
"The following files are marked as `store: false` so the "
"remote was not searched, but are not present in local "
"cache:\n{}".format("\n".join(failed_not_stored))
)
m = prepend + "\n\n" + m
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
prepend = (
"The following files are marked as `store: false` so the "
"remote was not searched, but are not present in local "
"cache:\n{}".format("\n".join(failed_not_stored))
)
m = prepend + "\n\n" + m
m = (
"The following outputs are marked as `store: false` so "
"remote storage was disabled, and are not present in the "
"cache:\n{}\n\n".format("\n".join(failed_not_stored))
) + m

May need wrapping?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And how about also introducing a link to somewhere in the docs where this will be explained? (In anticipation of iterative/dvc.org/pull/1788)

Copy link
Contributor

@jorgeorpinel jorgeorpinel Oct 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From #4581 (comment):

ERROR: failed to pull data from the cloud - The following files...
no_store.txt

Checkout failed for following targets:
no_store.txt

Those 2 new lines and repeated list items may be confusing though. Not sure how to improve the UI here and it's not a blocking problem but would be nice to try πŸ™‚

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a docs link like this:

The following outputs are marked as `store: false` so remote storage was disabled, and are not present in the cache:
foo
<https://error.dvc.org/checkout-failed-no-store>
Checkout failed for following targets:
foo
Is your cache up to date?
<https://error.dvc.org/missing-files>

and a section to the docs in iterative/dvc.org#1788.

Comment on lines 58 to 63
msg = f"""Incompatible options for output '{path}'.
The dvc file specifies an output with both "cache == False" and
"store == True" which is not possible to satisfy.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the indentation is off but not sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's right: I copied the other examples I saw. That'll produce:

Incompatible options for output 'path'.
    The dvc file specifies an output with both "cache == False" and
    "store == True" which is not possible to satisfy.

dvc/output/base.py Outdated Show resolved Hide resolved
@@ -311,6 +325,9 @@ def dumpd(self):
if self.persist:
ret[self.PARAM_PERSIST] = self.persist

if self.store == self.stage.is_repo_import:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably

Suggested change
if self.store == self.stage.is_repo_import:
if self.stage.is_repo_import and self.store:

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, actually that is right (although unclear): the purpose of this is to only add store: <something> to the DVC file if it differs from the default. Default is store: true for most outputs and store: false for repo imports. So this code writes the value of store to the DVC file only if

  • it's a repo import and store: true
  • or it's a normal import and store: false

dvc/output/base.py Outdated Show resolved Hide resolved
dvc/output/base.py Outdated Show resolved Hide resolved
@efiop
Copy link
Contributor

efiop commented Nov 7, 2020

@charlesbaynham Thanks a lot for your patience. Really sorry for this is taking so long.

This PR looks great, just a few minor details to adjust. Let me see if we can take it from here...

@efiop
Copy link
Contributor

efiop commented Nov 7, 2020

@charlesbaynham One more thing, could you please check the committer email in your commits and add it to https://github.com/settings/emails so that github is able to associate the great work you've done with your github account?

@efiop efiop force-pushed the backup-imports branch 3 times, most recently from 12f4f88 to 95eeb25 Compare November 8, 2020 16:29
@charlesbaynham
Copy link
Contributor Author

@charlesbaynham Thanks a lot for your patience. Really sorry for this is taking so long.

This PR looks great, just a few minor details to adjust. Let me see if we can take it from here...

Not at all, and sorry I haven't got around to this: it's on my list! Although I'm very happy for you to take it from here if you want to, of course.

SCHEMA = output.SCHEMA.copy()
del SCHEMA[BaseOutput.PARAM_CACHE]
del SCHEMA[BaseOutput.PARAM_METRIC]
del SCHEMA[BaseOutput.PARAM_STORE]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines are not coupling outputs with dependencies. Maybe some refactoring is needed. But this is not related to this PR.

@pared pared removed their assignment Dec 1, 2020
@efiop
Copy link
Contributor

efiop commented Dec 2, 2020

@charlesbaynham Thanks again for the PR and all the effort you've put into it! πŸ™ Even though the functionality is pretty simple and nicely implemented by you, the main reason why this hasn't been merged yet is that store option really highlighted the current mess of terminology and defaults that we have going on between dvcfiles created by dvc add, dvc import, dvc import-url and Stage class. This is apparent in other areas as well, but this might be the final nail πŸ™‚

We are currently starting to work on 2.0 changes and there are some plans to at least reconsider dvc file format as well as internal Stage class split, which might help set this store functionality straight. We'll be having some discussions about it this week, so still keeping this PR on hold for now. πŸ™

@charlesbaynham
Copy link
Contributor Author

Hey @efiop, thanks for letting me know! Do say if there's anything I can do to help

@efiop
Copy link
Contributor

efiop commented Sep 8, 2021

It has been a while, but we now have more proper mechanisms to handle multiple remotes, as well as remote option for outputs. E.g. imports effectively mean smth like

remote:
   pull: source_remote
   push: false

and "backups" could mean just setting push to the remote you want to backup to. We will be soon ready to come back to this. Thank you so much for your patience! πŸ™

@skshetry
Copy link
Member

Closing as stale. Thanks @charlesbaynham.

@skshetry skshetry closed this Jun 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

import: Allow pushing imported files to remote
7 participants