Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add: support rollback/recovery from partial/failed dvc add #6387

Closed
pmrowla opened this issue Aug 4, 2021 · 17 comments
Closed

add: support rollback/recovery from partial/failed dvc add #6387

pmrowla opened this issue Aug 4, 2021 · 17 comments
Labels
A: data-management Related to dvc add/checkout/commit/move/remove bug Did we break something? p1-important Important, aka current backlog of things to do

Comments

@pmrowla
Copy link
Contributor

pmrowla commented Aug 4, 2021

If the user does a long running dvc add directory that dies somewhere in the middle, there is no obvious (to the user) way to recover from the state where

  1. DVC has not generated a .dvc file
  2. half of the users files are still in the workspace (and have not been moved into cache)
  3. half of the user’s files have been moved into cache, but have not been checked out/linked (and now appear to be lost)

Currently it's possible to recover from this state manually as long as we have the .dir file for the original complete directory, but not in a straightforward way

discord context https://discord.com/channels/485586884165107732/485596304961962003/872423982463340555

@pmrowla pmrowla added the p1-important Important, aka current backlog of things to do label Aug 4, 2021
@skshetry
Copy link
Member

skshetry commented Aug 4, 2021

Related: #6252.

DVC when committing the cache, moves the files to the cache and then after that is complete, relinks the file back to the workspace. This might result in data loss if it happens to fail in between.

@efiop
Copy link
Contributor

efiop commented Aug 4, 2021

As noted by @karajan1001 , we could try to use hardlink instead of move() so that we have files in both odb and workspace and even if checkout breaks - the workspace will be intact. And if hardlink is not supported it means that rename() will also not be supported and we'll need to copy stuff anyway, so there is no reason to use move()(copy + rename) and we could again keep the workspace intact.

@efiop
Copy link
Contributor

efiop commented Aug 4, 2021

@isidentical Btw, notice how ^^^ is related to our copy/upload/download/link unification discussion 🙂

@karajan1001
Copy link
Contributor

As noted by @karajan1001 , we could try to use hardlink instead of move() so that we have files in both odb and workspace and even if checkout breaks - the workspace will be intact. And if hardlink is not supported it means that rename() will also not be supported and we'll need to copy stuff anyway, so there is no reason to use move()(copy + rename) and we could again keep the workspace intact.

But, the hardlink for default is a bit dangerous. Users might modify it manually without knowing it is a hardlink, and make a corrupt cache.

@efiop
Copy link
Contributor

efiop commented Aug 6, 2021

@karajan1001 Good point! I meant that we could use hardlink temporarily, just so that we have an easier time rolling back if something goes wrong.

@skshetry skshetry mentioned this issue Aug 17, 2021
1 task
@skshetry
Copy link
Member

skshetry commented Aug 27, 2021

I'm bumping the priority here, we already have a guideline against breaking the user's workspace, so we need to realize the correct implementation.

Another user nearly lost their ~140GB of data. We should take data loss seriously. Even if we don't add a support for transactions (rollback/commit), we should not break user's workspace by moving files to the cache.

@skshetry skshetry added bug Did we break something? p0-critical Critical issue. Needs to be fixed ASAP. and removed p1-important Important, aka current backlog of things to do labels Aug 27, 2021
@skshetry
Copy link
Member

Here's the suggestion on how to recover from a failed dvc add from @pmrowla, added for better discovery:

for the record, in this situation where dvc add directory fails or is aborted midway through the copy/move files to cache operation, but before we generate a .dvc file, no data loss has actually occurred. Files which appear to be missing are in .dvc/cache but have not been re-linked back into the workspace.

recovering from this state and completing the partial add can be done manually:

the .dir file for the directory is generated and saved in .dvc/cache prior to moving any files, and the relevant file can be found (eventually) by inspecting any dir files which are present in the cache:

ls .dvc/cache/**/*.dir

at this point, the users workspace will contain some portion of files which have not yet been moved into the DVC cache, while the DVC cache contains the remaining portion (which appear to have been missing or lost in the workspace)

to resolve the issue and "complete" the original add, you can just do

dvc add directory

which will add the remaining workspace ("non-missing") files to the DVC cache, and then generate a directory.dvc file which contains a hash for that partial workspace state

at this point, you can just manually edit directory.dvc so that it only contains

outs:
- md5: <original_dir_hash>.dir
  path: directory

where original_dir_hash is the directory hash which was identified at the beginning of the recovery process

from here

dvc checkout

will checkout the entire/original dataset back into the workspace

@efiop efiop added p2-medium Medium priority, should be done, but less important and removed p0-critical Critical issue. Needs to be fixed ASAP. labels Oct 14, 2021
@efiop
Copy link
Contributor

efiop commented Oct 21, 2021

Another user on discord running into this: https://discord.com/channels/485586884165107732/485596304961962003/900792694023028766

Also, note that we've stopped poisoning our local cache with .dir, but our staging db is still memory-only, so you are no longer able to recover the metadata 🙁

Unfortunately, we are not ready to make staging persistent right now, and we clearly shouldn't bring back poisoning, so instead we should take a closer look at the approach with trying to reflink/hardlink/copy(in this order and without symlink) instead of moving during odb.add in dvc/fs/utils.py::transfer(). These days our object transferring (talking about dvc/objects/transfer.py one) and checkingout are separate (Kudos @pmrowla for unifying object save() and object transfer() 🙏 ) and in more-or-less good shape that they should be able to handle the aforementioned approach instead of move. This will allow us to be failure-proof during transfer(before we've saved .dir into .dvc/cache, user's workspace will be intact) and during later failures (e.g. checkout) we will already have .dir that could be used for recovery (by us in the midterm, but as a hack in the shortterm). So this might be a good partial solution.

@efiop efiop added p0-critical Critical issue. Needs to be fixed ASAP. p1-important Important, aka current backlog of things to do and removed p2-medium Medium priority, should be done, but less important p0-critical Critical issue. Needs to be fixed ASAP. labels Oct 22, 2021
@skshetry
Copy link
Member

skshetry commented Oct 22, 2021

We don't save .dir files now before transfer?

@efiop
Copy link
Contributor

efiop commented Oct 22, 2021

@skshetry Correct, because that would be poisoning the cache. We keep it in memfs-based staging odb.

@skshetry
Copy link
Member

skshetry commented Oct 22, 2021

Why would a complete save poison the cache? The original issue was about the partial commit/save poisoning the cache, not the full save iirc.

@efiop
Copy link
Contributor

efiop commented Oct 22, 2021

@skshetry It is not a complete save though. Complete is when .dir and all files inside of it are all in cache. But before transfer we don't have those files, so we can't save .dir before we complete transfering them.

The approach I was talking about is about keeping users workspace intact, before and during transfer. Currently we use move, which is destructive.

efiop added a commit to efiop/dvc that referenced this issue Oct 24, 2021
If supported, reflink, will provide the fastest transfer possible.
That being said, right now we use `move` in most of the places,
which still takes presidence.

Pre-requisite for iterative#6387
efiop added a commit that referenced this issue Oct 24, 2021
If supported, reflink, will provide the fastest transfer possible.
That being said, right now we use `move` in most of the places,
which still takes presidence.

Pre-requisite for #6387
@dberenbaum
Copy link
Collaborator

@efiop @skshetry Any thoughts on priority or how to solve this one?

@emremrah
Copy link

emremrah commented Feb 17, 2022

Suggestion: at least prioritize to implement a progress bar to show the ETA. Lack of pb was the reason I was interupted the add process. Thanks again for the support for getting the data back.

@daavoo daavoo added the A: data-management Related to dvc add/checkout/commit/move/remove label Feb 22, 2022
@karajan1001
Copy link
Contributor

karajan1001 commented Apr 22, 2022

A Related case but in dvc remove instead of dvc add. https://discord.com/channels/485586884165107732/485596304961962003/966686784081756180

@dberenbaum
Copy link
Collaborator

@efiop @skshetry Any thoughts on priority or how to solve this one?

Ping @iterative/dvc. It sounds severe but I can't recall it coming up frequently. Thoughts on priority? Are we likely to solve it soon?

@efiop
Copy link
Contributor

efiop commented Apr 2, 2023

Seems like this was solved by #6860 for the most part. Let's close for now.

@efiop efiop closed this as completed Apr 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-management Related to dvc add/checkout/commit/move/remove bug Did we break something? p1-important Important, aka current backlog of things to do
Projects
None yet
Development

No branches or pull requests

7 participants