Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

push: Unnotified error when pushing data into HTTP remote #7564

Closed
themaikelman opened this issue Apr 10, 2022 · 16 comments
Closed

push: Unnotified error when pushing data into HTTP remote #7564

themaikelman opened this issue Apr 10, 2022 · 16 comments
Assignees
Labels
A: data-sync Related to dvc get/fetch/import/pull/push bug Did we break something? fs: http Related to the HTTP filesystem research

Comments

@themaikelman
Copy link

themaikelman commented Apr 10, 2022

Bug Report

Issue name

push: Unnotified error when pushing data into HTTP remote

Description

This issue happens when pushing a bulk of files into a HTTP dvc remote. dvc push reports that everything is correct. However, when downloading the files, some of them have not been uploaded correactly and thus, they do not exist on the remote.

Reproduce

  • download a dataset
  • adds it to dvc
  • push the data (-> No error)
  • remove the cache and tmp folder inside .dvc to assure we will download the data from remote
  • pull the data again (-> Some files missing)
  • Push the data again (-> Everything updated!)

More detailed:

  1. export DATASET_FOLDER=cars_train
  2. export REMOTE_NAME=my_http_remote
    // Download random dataset
  3. wget http://ai.stanford.edu/~jkrause/car196/cars_train.tgz
  4. tar -xf cars_train.tgz
  5. rm cars_train.tgz
  6. export REMOTE_NAME_FILE="${REMOTE_NAME/"-"/"_"}"
    // Try with HTTP remote:
    // Add and push the data
  7. dvc remote default ${REMOTE_NAME}
  8. dvc add $DATASET_FOLDER
  9. dvc push -v $DATASET_FOLDER
    // Download the data and check
  10. dvc remote default ${REMOTE_NAME}
  11. rm -rf $DATASET_FOLDER
  12. rm -rf .dvc/cache
  13. rm -rf .dvc/tmp
  14. dvc pull -v $DATASET_FOLDER

// Try to push again the data

$ dvc push -v $DATASET_FOLDER
Everything is up to date.

Expected

All the data in the remote, of course ;)

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.10.1 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-5.4.0-104-generic-x86_64-with-glibc2.29
Supports:
        azure (adlfs = 2021.10.0, knack = 0.9.0, azure-identity = 1.7.1),
        webhdfs (fsspec = 2022.1.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        ssh (sshfs = 2021.11.2)

Additional Information (if any):
We had the "Session is Closed" problem prevously:
pull: Using jobs>1 fails with RuntimeError: Session is closed in http remote #7421

Solved with:
fs.http: prevent hangs under some network conditions #7460

And we have proposed this:
dvc push doesn't recognise that files are missing in remote storage #4164
Force push option #7268
push: add --force option to force push without .dir optimization #7532

but the problem is more serious because you don't really know that it had failed (we would have to ask the users to try it at least twice to ensure that the data has been uploaded correctly...)

Additionally, when you try to push the files again, the .dir optimization precludes to upload again the files and dvc thinks that everything is uploaded. If the dataset have subfolders, the problem is even worse, as re-adding the files do not correct the issue due to .dir optimization.

@daavoo daavoo added bug Did we break something? A: data-sync Related to dvc get/fetch/import/pull/push fs: http Related to the HTTP filesystem labels Apr 11, 2022
@daavoo
Copy link
Contributor

daavoo commented Apr 11, 2022

cc @dtrifiro could you take a look?

@dtrifiro dtrifiro self-assigned this Apr 12, 2022
@dtrifiro dtrifiro added this to DVC Apr 12, 2022
@dtrifiro dtrifiro moved this to Backlog in DVC Apr 12, 2022
@dtrifiro dtrifiro moved this from Backlog to Todo in DVC Apr 12, 2022
@efiop efiop added the research label Apr 12, 2022
@efiop efiop moved this from Todo to Backlog in DVC Apr 12, 2022
@pmrowla pmrowla removed this from DVC Apr 19, 2022
@dtrifiro dtrifiro added this to DVC May 3, 2022
@dtrifiro dtrifiro moved this to Backlog in DVC May 3, 2022
@efiop efiop assigned dtrifiro and unassigned dtrifiro May 3, 2022
@dtrifiro dtrifiro removed this from DVC May 17, 2022
@themaikelman
Copy link
Author

Hello! We have finally got the necessary permissions and we have put a version of the HTTP Remote here https://github.com/atekoa/dvc-http-remote
We have tried to add two examples to help its use, with the complex example it is easier to reproduce the error.
Greetings and sorry for the delay.

@dtrifiro
Copy link
Contributor

dtrifiro commented Jun 7, 2022

Thanks! I will look into it asap

@dtrifiro
Copy link
Contributor

dtrifiro commented Jun 10, 2022

Hi @atekoa,

I tried following your instructions but it seems I cannot get the remote (simple case) to work:

failed to transfer 'md5: 878750719fec346635c5beb4c2132a46': ClientOSError: [Errno None] Can not write request body for http://localhost:8080/remote?remote=0/87/8750719fec346635c5beb4c2132a46

As a note, it seems it's trying to write to /remote in the container, but the folder does not exist, but even by creating it, I still get the above message

@skshetry
Copy link
Member

Also, note that DVC http does not support path in query part or fragment part, it has to be a part of the path.

@themaikelman
Copy link
Author

I have re-launched everything from the beginning and it has worked correctly for me. I have corrected some paths in the Readme that were wrong, but that would not be the problem.
Maybe it's a permissions issue. I'm running this example on Ubuntu 20.04 with docker-compose version 1.29.2, build 5becea4c and Docker version 20.10.12, build e91ed57

@themaikelman
Copy link
Author

sequenceDiagram
participant Terminal
participant http_remote
participant StorageSite

Terminal->>Terminal: dvc add/git add/git commit
Terminal->>http_remote: dvc push (http://localhost:8080/remote?remote=0/87/8750719fec346635c5beb4c2132a46)
http_remote->>StorageSite: (local or azure) io.Copy


Loading

The http url contains the remote, so dvc appends the folder/file to the URL

cat .dvc/config 
[core]
    remote = localhost
['remote "localhost"']
    url = http://localhost:8080/remote?remote=0
    ssl_verify = false

The URL that we recieve is parsed with gorilla.mux

dvcV1 := 
	Path(pathPrefix+"/{folder}/{file}").
	Queries("remote", "{remote}").
	Subrouter()

dvcV2 := r.
	Path(pathPrefix).
	Queries("remote", "{remote}/{folder}/{file}").
	Subrouter()

So, in this case with DVC2, the URL will be parsed as
{remote} = 0
{folder}=87
{file}=8750719fec346635c5beb4c2132a46

@themaikelman
Copy link
Author

The folder is created when you launch the docker-compose, this should not be the problem

https://github.com/atekoa/dvc-http-remote/blob/main/main.go#L20

@themaikelman
Copy link
Author

@dtrifiro maybe I can give you some temporary access (via email?) to our remote http dev environment so you can focus on the dvc calling part, keeping in mind we had other bugs before and maybe related
#7421
#7460

@themaikelman
Copy link
Author

Hello @dtrifiro ,
First of all I wanted to thank you for the effort you are making with this issue. I was wondering if you have been able to work it or if I can do something to help you.
Thanks in advance

@dtrifiro
Copy link
Contributor

dtrifiro commented Jul 6, 2022

Hi @atekoa,
I've been a bit busy in the past few weeks, so I haven't had time to work on this. I should be able to have another go sometime next week though 🙂

@dtrifiro
Copy link
Contributor

I could not reproduce the issue. Closing this as it's likely related to the custom remote being used.

@guysmoilov
Copy link
Contributor

Possibly related: #8100

@dtrifiro
Copy link
Contributor

Hey @atekoa, would you mind trying the fix suggested in the above issue to see if it solves your issue? pip install dvc-data==0.1.15

@samtzai
Copy link

samtzai commented Aug 22, 2022

Hey @atekoa, would you mind trying the fix suggested in the above issue to see if it solves your issue? pip install dvc-data==0.1.15

I have tried the fix over a multi-folder dataset against our http remote and we do not get the error now. We will check this during next week and will provide additional info.

@themaikelman
Copy link
Author

dvc-data==0.1.15

It works!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push bug Did we break something? fs: http Related to the HTTP filesystem research
Projects
None yet
Development

No branches or pull requests

7 participants