-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dvc push
fails with "Timeout on reading data from socket" with HTTP storage
#27
Comments
@sisp, the Gitlab generic packages repository is new to me, does it support uploads to arbitrary prefix? Also, would it be possible to try uploading with |
Thanks for taking the time to reply to my issue, @skshetry! 🙏
You've actually brought up a limitation of GitLab's generic packages repository for usage as a DVC storage backend that I hadn't realized at first. The generic packages repository - from a general point of view - has exactly 3 nesting levels:
That's because the URL path is structured like this:
When I use it as a DVC data storage (i.e. not run-cache storage, I'll get to that in a second), I define those 3 nesting levels as follows (according to the cache directory structure spec):
When you asked about "arbitrary prefix", I stumbled over the run-cache structure spec and realized the run-cache may also be pushed which seems to have more nesting levels, so this might need some thinking - possibly a topic for a dedicated issue on using GitLab's generic packages repository as a storage type including why I think there are some nice benefits if this was officially possible.
It doesn't work either. Here's the code snippet for reproduction: import fsspec
fs = fsspec.filesystem("http", headers={"DEPLOY-TOKEN": "<token>"})
with open("data/data.txt", "rb") as src:
fs.put_file(src, "https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data/test/data.txt", method="PUT") Running it gives the following error after (in my case) ~5 min:
|
@skshetry I've been doing some digging and think the error originates from the AIOHTTP client timeout configuration in What do you think? And can you confirm that my understanding of the internals is correct? If that's the cause and |
@sisp, we added timeout because large downloads were freezing. See iterative/dvc#7414 and iterative/dvc#7460. However, we are open to introducing |
@skshetry Thanks for the pointers, I hadn't seen those issues. In that case, disabling all timeouts seems to be only a workaround for cases that don't suffer from the random hangs and it may not be robust. Nevertheless, it would at least enable large file uploads to HTTP remote storage when network conditions are good which simply doesn't work right now when the transfer time of a file exceeds 5 minutes. I think introducing data blocks (see iterative/dvc#829) could avoid the problem (in most cases, i.e. when a certain minimum upload bandwidth is available) as well. When each block is sufficiently small such that its transfer time is significantly less than 5 minutes, the timeout would not kick in. Besides, it might be more storage-efficient due to block-level deduplication. That said, would you like the configurable timeout you suggested in any case? I could send some PRs that add this feature. After all, an implementation of iterative/dvc#829 might not become available for a while. |
@sisp, my understanding about How much time does it take if you don't set the timeout? Is it possible for you to try with small files and see if the issue still occurs? Also, can you try uploading with import requests
with open("data/data.txt", "rb") as src:
requests.put(
"https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data/test/data.txt",
data=src,
headers={"DEPLOY-TOKEN": "<token>"},
timeout=60,
) If you can create a PR for the timeouts, we can review and merge that, but I am not convinced it is the main issue here. |
@skshetry When I don't set the timeout, it takes ~15 minutes, and the transfer succeeds then. I have an upload bandwidth of ~10 Mbit/s. When I upload smaller files which take less than 5 minutes to transfer per file, there is no problem even with
Since Not sure whether it's helpful, but when I upload the 1 GB file with cURL, this is the verbose output:
|
|
@sisp, I don't see any response from the server in the above curl output. |
@skshetry Sorry, I copied the output before cURL was actually done, somehow not thinking it would contain helpful information. 🤦♂️ I've edited the above comment with the server response output. |
Could you please check if httpx solves this issue? Just trying to confirm if it is http1.1 issue (curl is using http/2): # pip install httpx[http2]
import httpx
client = httpx.Client(http2=True)
with open("data/data.txt", "rb") as src:
r = client.put(
"https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data/test/data.txt",
content=src,
headers={"DEPLOY-TOKEN": "<token>"},
timeout=60,
)
r.raise_for_status() |
@skshetry The
|
@skshetry I'm wondering, isn't it expected that this timeout kicks in when a file upload time exceeds the timeout? After the connection to the server has been established, there is no read happening until the upload has completed. Right? Perhaps the reason why cURL works fine is that there seems to be no read timeout set or even available for cURL. How about the following change to |
@skshetry In addition to my previous suggestion, or as an alternative one, how about adding support for chunked transfer encoding? I've adapted the import requests
with open("data/data.txt", "rb") as src:
def gen(chunk_size):
chunk = src.read(chunk_size)
while chunk:
yield chunk
chunk = src.read(chunk_size)
requests.put(
"https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data/test/data.txt",
data=gen(1024 * 1024 * 32), # 32 MB
headers={"DEPLOY-TOKEN": "<deploy_token>"},
timeout=60,
) Not every HTTP remote storage may support chunked transfer encoding or it may not be preferred for some reason, so we could make it configurable. The default would be as it is now. One downside of chunked transfer encoding seems to be that HTTP/2 doesn't support it, so it sounds like it's not future-proof:
|
Just to let you know that I have been having these problems lately, which are really slowing down our team since we cannot rely on dvc. We are not using github packages, but dagshub remotes. The issue is even worse with 2.16.0, since in that version it does not even attempt to push the data to the remote. If you repeat the process with dvc 2.9.5, everything works as expected. The problem I am having is that dvc does not complain. It just finishes, and when I look at the remote, the file is smaller than expected. This leads to deep trouble, since the md5 is there in the remote and the only option to modify it is to use curl and manually post the right file to the corresponding md5 url. To me it is much more important that when I push something, it is correctly pushed, than having some hangs now and then while downloading. I strongly support removing the sock_read timeout. In my team we are back to 2.9.5 just for this reason. |
We already do chunking uploads. See https://github.com/fsspec/filesystem_spec/blob/5d6c87e562396f352645df7cd555a2b4b6eec84a/fsspec/implementations/http.py#L274-L278. I was wondering if it was chunking uploads causing timeouts, hence I asked to check with httpx, but that also fails. |
@cerquide, Does uploads work correctly if you set |
It does if you go back to dvc 2.9.5. Have not checked specifically to change sock_read=None in a higher version |
@cerquide GitLab, but I think that's what you meant. Just making sure we're talking about the same thing. 😉
@cerquide That's likely related to iterative/dvc-objects#116. It should be fixed in the next DVC release.
@skshetry Ah, I hadn't noticed that. Strange that it works with |
Ooops sorry, you are right, fortunately I was saying we are not using them, it would have been worst if I said we are experts. |
@cerquide, would it be possible to try with |
@skshetry, in my case (using dagshub https remote), changing to |
@cerquide, can you try installing pip install dvc-objects==0.1.7 This will complain that it is incompatible, but you can ignore for now. Thanks. |
@skshetry If I read the |
@skshetry, with When pulling, for some of the big files I received errors like
but this was maybe due to the number of jobs being too high, with -j2 it worked flawlessly. |
We have been getting reports that the timeout on sock_read was raising timeout error even for chunked uploads, and sometimes even uploading zero-byte files. See: https://github.com/iterative/dvc/issues/8065 and iterative/dvc#8100. These kinds of logics don't belong here, and should be upstreamed (eg: RetryClient/ClientTimeout, etc). We added timeout in iterative/dvc#7460 because of freezes in iterative/dvc#7414. I think we can rollback this for now given that there are lots of report of failures/issues with this line, and if we get any new reports of hangs, we'll investigate it separately.
We have been getting reports that the timeout on sock_read was raising timeout error even for chunked uploads, and sometimes even uploading zero-byte files. See: https://github.com/iterative/dvc/issues/8065 and iterative/dvc#8100. These kinds of logics don't belong here, and should be upstreamed (eg: RetryClient/ClientTimeout, etc). We added timeout in iterative/dvc#7460 because of freezes in iterative/dvc#7414. I think we can rollback this for now given that there are lots of report of failures/issues with this line, and if we get any new reports of hangs, we'll investigate it separately.
DVC over http works like charm, but only for small files. When I try to push a bigger file (say 20MB), I get
|
We should probably get to implementing the configuration of timeout. Original report from @michuhu: |
Also see #15 |
Why is a timeout the solution to freezing downloads? It doesn't seem to fix the underlying issue; justs make DVC unusable on a slow connection. |
Hi, @jinensetpal, we are investigating a potential bug in aiohttp chunking uploads at the moment. yeah, the timeout is not a fix, but a workaround at the moment. |
Relevant upstream issue: aio-libs/aiohttp#7149 |
Gotcha; thanks @skshetry! |
Merged iterative/dvc#8722.
Waiting for upstream in the meantime |
Also facing this issue! How can I set to |
I'm guessing: Can't test since I'm on my smartphone |
Hi folks! How can I set the read_timeout? I am trying to add a new config as suggested, but I got an error:
|
Hi @marcusos, it should be enough to set it using Please make sure you're running |
Hi @dtrifiro, thank you for the help, now I am getting the following error:
Any idea on how to overcome this? dvc = 2.43.0 (via homebrew) |
@marcusos, is it possible to use |
Thank you @skshetry ! |
Closed by aio-libs/aiohttp#7150, thanks to @dtrifiro. We are still waiting for a release though. Meanwhile, there are |
Bug Report
Description
When trying to push some moderately large amount of data (e.g. 1 GB) to a GitLab generic packages repository (via the HTTP remote storage type), the upload fails after a few (e.g. 5) minutes with the following timeout error:
Reproduce
Settings > Repository > Deploy tokens
with scopewrite_package_registry
.mkdir data head -c 1G /dev/urandom > data/data.txt dvc add data/data.txt
Expected
Successful upload of the data.
When I upload the same data to the GitLab generic packages repository via cURL, it works fine:
Environment information
Output of
dvc doctor
:Additional Information (if any):
n/a
The text was updated successfully, but these errors were encountered: