Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc push fails with "Timeout on reading data from socket" with HTTP storage #27

Closed
sisp opened this issue Jul 28, 2022 · 41 comments
Closed
Labels
🐛 bug upstream Issues which need to be resolved in an upstream dependency

Comments

@sisp
Copy link
Contributor

sisp commented Jul 28, 2022

Bug Report

Description

When trying to push some moderately large amount of data (e.g. 1 GB) to a GitLab generic packages repository (via the HTTP remote storage type), the upload fails after a few (e.g. 5) minutes with the following timeout error:

$ dvc push -v
2022-07-28 17:33:07,919 DEBUG: Preparing to transfer data from '/home/user/dev/dvc-test/.dvc/cache' to 'https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data'
2022-07-28 17:33:07,919 DEBUG: Preparing to collect status from 'https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data'
2022-07-28 17:33:07,919 DEBUG: Collecting status from 'https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data'
2022-07-28 17:33:07,920 DEBUG: Querying 2 hashes via object_exists                                                                                                                                         
2022-07-28 17:33:08,249 DEBUG: Preparing to collect status from '/home/user/dev/dvc-test/.dvc/cache'                                                                
2022-07-28 17:33:08,250 DEBUG: Collecting status from '/home/user/dev/dvc-test/.dvc/cache'
2022-07-28 17:33:08,331 DEBUG: Uploading '/home/user/dev/dvc-test/.dvc/cache/1c/90d4b78d043e2a58e0746231c9024f' to 'https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data/1c/90d4b78d043e2a58e0746231c9024f'
2022-07-28 17:38:09,183 ERROR: failed to transfer 'md5: 1c90d4b78d043e2a58e0746231c9024f' - : Timeout on reading data from socket                                                                          
------------------------------------------------------------                                                                                                                                               
Traceback (most recent call last):
  File "<site-packages>/fsspec/asyn.py", line 25, in _runner
    result[0] = await coro
  File "<site-packages>/fsspec/implementations/http.py", line 294, in _put_file
    async with meth(rpath, data=gen_chunks(), **kw) as resp:
  File "<site-packages>/aiohttp_retry/client.py", line 98, in __aenter__
    return await self._do_request()
  File "<site-packages>/aiohttp_retry/client.py", line 86, in _do_request
    raise e
  File "<site-packages>/aiohttp_retry/client.py", line 71, in _do_request
    response: ClientResponse = await self._request(
  File "<site-packages>/aiohttp/client.py", line 559, in _request
    await resp.start(conn)
  File "<site-packages>/aiohttp/client_reqrep.py", line 898, in start
    message, payload = await protocol.read()  # type: ignore[union-attr]
  File "<site-packages>/aiohttp/streams.py", line 616, in read
    await self._waiter
aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<site-packages>/dvc/data/transfer.py", line 25, in wrapper
    func(fs_path, *args, **kwargs)
  File "<site-packages>/dvc/data/transfer.py", line 162, in func
    return dest.add(
  File "<site-packages>/dvc/objects/db.py", line 126, in add
    self._add_file(
  File "<site-packages>/dvc/objects/db.py", line 98, in _add_file
    return fs.utils.transfer(
  File "<site-packages>/dvc/fs/utils.py", line 96, in transfer
    _try_links(links, from_fs, from_path, to_fs, to_path)
  File "<site-packages>/dvc/fs/utils.py", line 66, in _try_links
    return _copy(from_fs, from_path, to_fs, to_path)
  File "<site-packages>/dvc/fs/utils.py", line 44, in _copy
    return to_fs.upload(from_path, to_path)
  File "<site-packages>/dvc/fs/base.py", line 386, in upload
    return self.put_file(from_info, to_info, callback=cb, size=size)
  File "<site-packages>/dvc/fs/http.py", line 151, in put_file
    super().put_file(
  File "<site-packages>/dvc/fs/base.py", line 329, in put_file
    self.fs.put_file(
  File "<site-packages>/fsspec/asyn.py", line 85, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "<site-packages>/fsspec/asyn.py", line 63, in sync
    raise FSTimeoutError from return_result
fsspec.exceptions.FSTimeoutError
------------------------------------------------------------
2022-07-28 17:38:09,197 ERROR: failed to push data to the cloud - 1 files failed to upload                                                                                                                 
------------------------------------------------------------
Traceback (most recent call last):
  File "<site-packages>/dvc/repo/push.py", line 68, in push
    pushed += self.cloud.push(
  File "<site-packages>/dvc/data_cloud.py", line 85, in push
    return transfer(
  File "<site-packages>/dvc/data/transfer.py", line 176, in transfer
    _do_transfer(
  File "<site-packages>/dvc/data/transfer.py", line 116, in _do_transfer
    raise FileTransferError(total_fails)
dvc.exceptions.FileTransferError: 1 files failed to transfer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<site-packages>/dvc/commands/data_sync.py", line 58, in run
    processed_files_count = self.repo.push(
  File "<site-packages>/dvc/repo/__init__.py", line 48, in wrapper
    return f(repo, *args, **kwargs)
  File "<site-packages>/dvc/repo/push.py", line 72, in push
    raise UploadError(exc.amount)
dvc.exceptions.UploadError: 1 files failed to upload

Reproduce

  1. Create a new project on gitlab.com and take a note of the project ID.
  2. Create a deploy token under Settings > Repository > Deploy tokens with scope write_package_registry.
  3. Clone the project from gitlab.com locally and change the current working directory to the cloned project.
  4. Initialize DVC and configure the remote storage:
    dvc init
    dvc remote add -d gitlab https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data
    dvc remote modify gitlab method PUT
    dvc remote modify gitlab auth custom
    dvc remote modify --local gitlab custom_auth_header 'DEPLOY-TOKEN'
    dvc remote modify --local gitlab password <deploy_token>
  5. Create some data and and track it with DVC:
    mkdir data
    head -c 1G /dev/urandom > data/data.txt
    dvc add data/data.txt
  6. Push the data to the DVC remote storage:
    dvc push -v

Expected

Successful upload of the data.

When I upload the same data to the GitLab generic packages repository via cURL, it works fine:

curl \
  --header "DEPLOY-TOKEN: <deploy_token>" \
  --upload-file data/data.txt \
  https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data/test/data.txt

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.10.2 (snap)
---------------------------------
Platform: Python 3.8.10 on Linux-5.13.0-48-generic-x86_64-with-glibc2.29
Supports:
	azure (adlfs = 2022.4.0, knack = 0.9.0, azure-identity = 1.10.0),
	gdrive (pydrive2 = 1.10.1),
	gs (gcsfs = 2022.3.0),
	webhdfs (fsspec = 2022.3.0),
	http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
	https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
	s3 (s3fs = 2022.3.0, boto3 = 1.21.21),
	ssh (sshfs = 2022.3.1)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/vgubuntu-root
Caches: local
Remotes: https, https
Workspace directory: ext4 on /dev/mapper/vgubuntu-root
Repo: dvc, git

Additional Information (if any):

n/a

@skshetry
Copy link
Member

@sisp, the Gitlab generic packages repository is new to me, does it support uploads to arbitrary prefix?

Also, would it be possible to try uploading with fsspec's HTTPFileSystem and see if works? Thanks.

@sisp
Copy link
Contributor Author

sisp commented Jul 30, 2022

Thanks for taking the time to reply to my issue, @skshetry! 🙏


[...] does it support uploads to arbitrary prefix?

You've actually brought up a limitation of GitLab's generic packages repository for usage as a DVC storage backend that I hadn't realized at first. The generic packages repository - from a general point of view - has exactly 3 nesting levels:

  1. package name
  2. package version
  3. file name

That's because the URL path is structured like this:

/projects/<project_id>/packages/generic/<package_name>/<package_version>/<file_name>

When I use it as a DVC data storage (i.e. not run-cache storage, I'll get to that in a second), I define those 3 nesting levels as follows (according to the cache directory structure spec):

  1. package name -> dvc-data (hardcoded, arbitrarily chosen name)
  2. package version -> first 2 characters of the file hash
  3. file name -> remaining characters of the file hash

When you asked about "arbitrary prefix", I stumbled over the run-cache structure spec and realized the run-cache may also be pushed which seems to have more nesting levels, so this might need some thinking - possibly a topic for a dedicated issue on using GitLab's generic packages repository as a storage type including why I think there are some nice benefits if this was officially possible.


Also, would it be possible to try uploading with fsspec's HTTPFileSystem and see if works?

It doesn't work either. Here's the code snippet for reproduction:

import fsspec

fs = fsspec.filesystem("http", headers={"DEPLOY-TOKEN": "<token>"})

with open("data/data.txt", "rb") as src:
    fs.put_file(src, "https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data/test/data.txt", method="PUT")

Running it gives the following error after (in my case) ~5 min:

$ python upload.py 
Traceback (most recent call last):
  File "<site-packages>/fsspec/asyn.py", line 53, in _runner
    result[0] = await coro
  File "<site-packages>/fsspec/implementations/http.py", line 291, in _put_file
    async with meth(rpath, data=gen_chunks(), **kw) as resp:
  File "<site-packages>/aiohttp/client.py", line 1138, in __aenter__
    self._resp = await self._coro
  File "<site-packages>/aiohttp/client.py", line 559, in _request
    await resp.start(conn)
  File "<site-packages>/aiohttp/client_reqrep.py", line 913, in start
    self._continue = None
  File "<site-packages>/aiohttp/helpers.py", line 721, in __exit__
    raise asyncio.TimeoutError from None
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/user/dev/dvc/upload.py", line 7, in <module>
    fs.put_file(src, "https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data/test/data.txt", method="PUT")
  File "<site-packages>/fsspec/asyn.py", line 111, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "<site-packages>/fsspec/asyn.py", line 94, in sync
    raise FSTimeoutError from return_result
fsspec.exceptions.FSTimeoutError

@sisp
Copy link
Contributor Author

sisp commented Aug 3, 2022

@skshetry I've been doing some digging and think the error originates from the AIOHTTP client timeout configuration in dvc-objects where the sock_read=self.REQUEST_TIMEOUT setting seems to cause the timeout after exactly 5 minutes. Although self.REQUEST_TIMEOUT is set to 60 seconds, I suspect the 5 minutes timeout results from retries, but I'm not 100% sure. In any case, when I set sock_read=None then the 1 GB test data upload finishes successfully and doesn't time out after 5 minutes anymore. So empirically it appears to fix the problem. From my understanding, sock_read specifies the time until data is read from the server which may take a while when uploading a large data file. So it makes sense to me to set sock_read=None to disable that timeout.

What do you think? And can you confirm that my understanding of the internals is correct? If that's the cause and sock_read=None is a valid fix, I'd be happy to send a PR to dvc-objects.

@skshetry
Copy link
Member

skshetry commented Aug 4, 2022

@sisp, we added timeout because large downloads were freezing. See iterative/dvc#7414 and iterative/dvc#7460.

However, we are open to introducing timeout=0 option that disables any timeout. This will require support in http filesystem in dvc-objects and a new config option in dvc.

@sisp
Copy link
Contributor Author

sisp commented Aug 4, 2022

@skshetry Thanks for the pointers, I hadn't seen those issues. In that case, disabling all timeouts seems to be only a workaround for cases that don't suffer from the random hangs and it may not be robust. Nevertheless, it would at least enable large file uploads to HTTP remote storage when network conditions are good which simply doesn't work right now when the transfer time of a file exceeds 5 minutes.

I think introducing data blocks (see iterative/dvc#829) could avoid the problem (in most cases, i.e. when a certain minimum upload bandwidth is available) as well. When each block is sufficiently small such that its transfer time is significantly less than 5 minutes, the timeout would not kick in. Besides, it might be more storage-efficient due to block-level deduplication.

That said, would you like the configurable timeout you suggested in any case? I could send some PRs that add this feature. After all, an implementation of iterative/dvc#829 might not become available for a while.

@skshetry
Copy link
Member

skshetry commented Aug 4, 2022

@sisp, my understanding about sock_read timeout is that it's a timeout waiting to read the data. Either the server is misconfigured, or there's some bug in aiohttp or your internet connection must be poor. There's no response for 60 seconds, which is strange.

How much time does it take if you don't set the timeout? Is it possible for you to try with small files and see if the issue still occurs? Also, can you try uploading with requests and see if it also times out:

import requests


with open("data/data.txt", "rb") as src:
  requests.put(
    "https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data/test/data.txt",
    data=src,
    headers={"DEPLOY-TOKEN": "<token>"},
    timeout=60,
)

If you can create a PR for the timeouts, we can review and merge that, but I am not convinced it is the main issue here.

@sisp
Copy link
Contributor Author

sisp commented Aug 4, 2022

@skshetry When I don't set the timeout, it takes ~15 minutes, and the transfer succeeds then. I have an upload bandwidth of ~10 Mbit/s. When I upload smaller files which take less than 5 minutes to transfer per file, there is no problem even with read_sock=self.REQUEST_TIMEOUT. When I run the upload script you posted that uses requests, I get a timeout error after ~15 minutes:

$ python upload_requests.py
Traceback (most recent call last):
  File "<site-packages>/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "<site-packages>/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/home/user/.pyenv/versions/3.9.13/lib/python3.9/http/client.py", line 1371, in getresponse
    response.begin()
  File "/home/user/.pyenv/versions/3.9.13/lib/python3.9/http/client.py", line 319, in begin
    version, status, reason = self._read_status()
  File "/home/user/.pyenv/versions/3.9.13/lib/python3.9/http/client.py", line 280, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/home/user/.pyenv/versions/3.9.13/lib/python3.9/socket.py", line 704, in readinto
    return self._sock.recv_into(b)
  File "/home/user/.pyenv/versions/3.9.13/lib/python3.9/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/home/user/.pyenv/versions/3.9.13/lib/python3.9/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<site-packages>/requests/adapters.py", line 489, in send
    resp = conn.urlopen(
  File "<site-packages>/urllib3/connectionpool.py", line 787, in urlopen
    retries = retries.increment(
  File "<site-packages>/urllib3/util/retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "<site-packages>/urllib3/packages/six.py", line 770, in reraise
    raise value
  File "<site-packages>/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "<site-packages>/urllib3/connectionpool.py", line 451, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "<site-packages>/urllib3/connectionpool.py", line 340, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='gitlab.com', port=443): Read timed out. (read timeout=60)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user/dev/dvc-test/upload_requests.py", line 5, in <module>
    requests.put(
  File "<site-packages>/requests/api.py", line 130, in put
    return request("put", url, data=data, **kwargs)
  File "<site-packages>/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "<site-packages>/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "<site-packages>/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "<site-packages>/requests/adapters.py", line 578, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='gitlab.com', port=443): Read timed out. (read timeout=60)

Since requests also fails, it might be a server-side issue. But cURL works. 🤔 Any suggestions for further debugging?

Not sure whether it's helpful, but when I upload the 1 GB file with cURL, this is the verbose output:

$ curl -v -i \
    --header "DEPLOY-TOKEN: <deploy_token>" \
    --upload-file data/data.txt \
    https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data/test/data.txt
*   Trying 2606:4700:90:0:f22e:fbec:5bed:a9b9:443...
* TCP_NODELAY set
* Connected to gitlab.com (2606:4700:90:0:f22e:fbec:5bed:a9b9) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: C=US; ST=California; L=San Francisco; O=Cloudflare, Inc.; CN=gitlab.com
*  start date: Jul  4 00:00:00 2022 GMT
*  expire date: Oct  2 23:59:59 2022 GMT
*  subjectAltName: host "gitlab.com" matched cert's "gitlab.com"
*  issuer: C=US; O=Cloudflare, Inc.; CN=Cloudflare Inc ECC CA-3
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x561a178e25f0)
> PUT /api/v4/projects/<project_id>/packages/generic/dvc-data/test/data.txt HTTP/2
> Host: gitlab.com
> user-agent: curl/7.68.0
> accept: */*
> deploy-token: <deploy_token>
> content-length: 1073741824
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Connection state changed (MAX_CONCURRENT_STREAMS == 256)!
* We are completely uploaded and fine
< HTTP/2 201 
HTTP/2 201 
< date: Thu, 04 Aug 2022 11:22:38 GMT
date: Thu, 04 Aug 2022 11:22:38 GMT
< content-type: application/json
content-type: application/json
< content-length: 25
content-length: 25
< cache-control: max-age=0, private, must-revalidate
cache-control: max-age=0, private, must-revalidate
< content-security-policy: default-src 'none'
content-security-policy: default-src 'none'
< etag: W/"6025f9a740871d09d0230776e3dbc7bf"
etag: W/"6025f9a740871d09d0230776e3dbc7bf"
< vary: Origin
vary: Origin
< x-content-type-options: nosniff
x-content-type-options: nosniff
< x-frame-options: SAMEORIGIN
x-frame-options: SAMEORIGIN
< x-request-id: 9b42ccd4fa4b83bc8e20e18455a6a620
x-request-id: 9b42ccd4fa4b83bc8e20e18455a6a620
< x-runtime: 0.612624
x-runtime: 0.612624
< strict-transport-security: max-age=31536000
strict-transport-security: max-age=31536000
< referrer-policy: strict-origin-when-cross-origin
referrer-policy: strict-origin-when-cross-origin
< gitlab-lb: fe-25-lb-gprd
gitlab-lb: fe-25-lb-gprd
< gitlab-sv: api-gke-us-east1-c
gitlab-sv: api-gke-us-east1-c
< cf-cache-status: DYNAMIC
cf-cache-status: DYNAMIC
< expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
< report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=n6pP%2FSYOkDh%2BknPkRth0TioYB8ffnOcVtSOLbxRA2wLm0TBHNH48tqx8eWSkccCCtoIxmllEWc4UeGxVfLM3gGGxalWm3yS0yEGMOCqnozJB6zocN5gkIqFyn51%2FnhdaEVNNNXLYP08%3D"}],"group":"cf-nel","max_age":604800}
report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=n6pP%2FSYOkDh%2BknPkRth0TioYB8ffnOcVtSOLbxRA2wLm0TBHNH48tqx8eWSkccCCtoIxmllEWc4UeGxVfLM3gGGxalWm3yS0yEGMOCqnozJB6zocN5gkIqFyn51%2FnhdaEVNNNXLYP08%3D"}],"group":"cf-nel","max_age":604800}
< nel: {"success_fraction":0.01,"report_to":"cf-nel","max_age":604800}
nel: {"success_fraction":0.01,"report_to":"cf-nel","max_age":604800}
< server: cloudflare
server: cloudflare
< cf-ray: 7356d45c1cc29bce-FRA
cf-ray: 7356d45c1cc29bce-FRA

< 
* Connection #0 to host gitlab.com left intact
{"message":"201 Created"}

@sisp
Copy link
Contributor Author

sisp commented Aug 4, 2022

I just noticed that the upload using requests succeeded, i.e. the file is present in GitLab's generic packages repository, although a timeout exception was raised. 🤔

@skshetry
Copy link
Member

skshetry commented Aug 4, 2022

@sisp, I don't see any response from the server in the above curl output.

@sisp
Copy link
Contributor Author

sisp commented Aug 4, 2022

@skshetry Sorry, I copied the output before cURL was actually done, somehow not thinking it would contain helpful information. 🤦‍♂️ I've edited the above comment with the server response output.

@skshetry
Copy link
Member

skshetry commented Aug 4, 2022

Could you please check if httpx solves this issue? Just trying to confirm if it is http1.1 issue (curl is using http/2):

# pip install httpx[http2]
import httpx

client = httpx.Client(http2=True)
with open("data/data.txt", "rb") as src:
  r = client.put(
    "https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data/test/data.txt",
    content=src,
    headers={"DEPLOY-TOKEN": "<token>"},
    timeout=60,
  )
  r.raise_for_status()

@sisp
Copy link
Contributor Author

sisp commented Aug 4, 2022

@skshetry The httpx-based upload script also fails with a timeout exception after ~15 minutes:

$ python upload_httpx.py 
Traceback (most recent call last):
  File "<site-packages>/httpcore/_exceptions.py", line 8, in map_exceptions
    yield
  File "<site-packages>/httpcore/backends/sync.py", line 26, in read
    return self._sock.recv(max_bytes)
  File "/home/user/.pyenv/versions/3.9.13/lib/python3.9/ssl.py", line 1226, in recv
    return self.read(buflen)
  File "/home/user/.pyenv/versions/3.9.13/lib/python3.9/ssl.py", line 1101, in read
    return self._sslobj.read(len)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<site-packages>/httpx/_transports/default.py", line 60, in map_httpcore_exceptions
    yield
  File "<site-packages>/httpx/_transports/default.py", line 218, in handle_request
    resp = self._pool.handle_request(req)
  File "<site-packages>/httpcore/_sync/connection_pool.py", line 253, in handle_request
    raise exc
  File "<site-packages>/httpcore/_sync/connection_pool.py", line 237, in handle_request
    response = connection.handle_request(request)
  File "<site-packages>/httpcore/_sync/connection.py", line 90, in handle_request
    return self._connection.handle_request(request)
  File "<site-packages>/httpcore/_sync/http2.py", line 146, in handle_request
    raise exc
  File "<site-packages>/httpcore/_sync/http2.py", line 114, in handle_request
    status, headers = self._receive_response(
  File "<site-packages>/httpcore/_sync/http2.py", line 231, in _receive_response
    event = self._receive_stream_event(request, stream_id)
  File "<site-packages>/httpcore/_sync/http2.py", line 262, in _receive_stream_event
    self._receive_events(request, stream_id)
  File "<site-packages>/httpcore/_sync/http2.py", line 283, in _receive_events
    events = self._read_incoming_data(request)
  File "<site-packages>/httpcore/_sync/http2.py", line 343, in _read_incoming_data
    raise exc
  File "<site-packages>/httpcore/_sync/http2.py", line 329, in _read_incoming_data
    data = self._network_stream.read(self.READ_NUM_BYTES, timeout)
  File "<site-packages>/httpcore/backends/sync.py", line 26, in read
    return self._sock.recv(max_bytes)
  File "/home/user/.pyenv/versions/3.9.13/lib/python3.9/contextlib.py", line 137, in __exit__
    self.gen.throw(typ, value, traceback)
  File "<site-packages>/httpcore/_exceptions.py", line 12, in map_exceptions
    raise to_exc(exc)
httpcore.ReadTimeout: The read operation timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/user/dev/dvc-test/upload_httpx.py", line 5, in <module>
    r = client.put(
  File "<site-packages>/httpx/_client.py", line 1167, in put
    return self.request(
  File "<site-packages>/httpx/_client.py", line 815, in request
    return self.send(request, auth=auth, follow_redirects=follow_redirects)
  File "<site-packages>/httpx/_client.py", line 902, in send
    response = self._send_handling_auth(
  File "<site-packages>/httpx/_client.py", line 930, in _send_handling_auth
    response = self._send_handling_redirects(
  File "<site-packages>/httpx/_client.py", line 967, in _send_handling_redirects
    response = self._send_single_request(request)
  File "<site-packages>/httpx/_client.py", line 1003, in _send_single_request
    response = transport.handle_request(request)
  File "<site-packages>/httpx/_transports/default.py", line 218, in handle_request
    resp = self._pool.handle_request(req)
  File "/home/user/.pyenv/versions/3.9.13/lib/python3.9/contextlib.py", line 137, in __exit__
    self.gen.throw(typ, value, traceback)
  File "<site-packages>/httpx/_transports/default.py", line 77, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.ReadTimeout: The read operation timed out

In contrast to the requests-based upload script, no file is present in GitLab's generic packages repository afterwards.

@sisp
Copy link
Contributor Author

sisp commented Aug 5, 2022

@sisp, my understanding about sock_read timeout is that it's a timeout waiting to read the data.

@skshetry I'm wondering, isn't it expected that this timeout kicks in when a file upload time exceeds the timeout? After the connection to the server has been established, there is no read happening until the upload has completed. Right? Perhaps the reason why cURL works fine is that there seems to be no read timeout set or even available for cURL.

How about the following change to dvc-objects? When DVC downloads files, sock_read remains as it is now. When DVC uploads files, sock_read=None is set because the delay until data will be read may be long. iterative/dvc#7414 and iterative/dvc#7460 are only about download issues, so I'd imagine this fix should not affect them.

@sisp
Copy link
Contributor Author

sisp commented Aug 5, 2022

@skshetry In addition to my previous suggestion, or as an alternative one, how about adding support for chunked transfer encoding? I've adapted the requests-based upload script to use chunked transfer encoding and it works fine with the timeout setting:

import requests


with open("data/data.txt", "rb") as src:
  def gen(chunk_size):
    chunk = src.read(chunk_size)
    while chunk:
      yield chunk
      chunk = src.read(chunk_size)

  requests.put(
    "https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data/test/data.txt",
    data=gen(1024 * 1024 * 32),  # 32 MB
    headers={"DEPLOY-TOKEN": "<deploy_token>"},
    timeout=60,
  )

Not every HTTP remote storage may support chunked transfer encoding or it may not be preferred for some reason, so we could make it configurable. The default would be as it is now.

One downside of chunked transfer encoding seems to be that HTTP/2 doesn't support it, so it sounds like it's not future-proof:

Note: HTTP/2 doesn't support HTTP 1.1's chunked transfer encoding mechanism, as it provides its own, more efficient, mechanisms for data streaming.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Transfer-Encoding

@cerquide
Copy link

cerquide commented Aug 5, 2022

Just to let you know that I have been having these problems lately, which are really slowing down our team since we cannot rely on dvc. We are not using github packages, but dagshub remotes. The issue is even worse with 2.16.0, since in that version it does not even attempt to push the data to the remote.
It seems that everything starts with release 2.10.0 where the timeout was introduced.

iterative/dvc#7460 (comment)

If you repeat the process with dvc 2.9.5, everything works as expected.

The problem I am having is that dvc does not complain. It just finishes, and when I look at the remote, the file is smaller than expected. This leads to deep trouble, since the md5 is there in the remote and the only option to modify it is to use curl and manually post the right file to the corresponding md5 url.

To me it is much more important that when I push something, it is correctly pushed, than having some hangs now and then while downloading. I strongly support removing the sock_read timeout. In my team we are back to 2.9.5 just for this reason.

@skshetry
Copy link
Member

skshetry commented Aug 5, 2022

@skshetry In addition to my previous suggestion, or as an alternative one, how about adding support for chunked transfer encoding?

We already do chunking uploads. See https://github.com/fsspec/filesystem_spec/blob/5d6c87e562396f352645df7cd555a2b4b6eec84a/fsspec/implementations/http.py#L274-L278. I was wondering if it was chunking uploads causing timeouts, hence I asked to check with httpx, but that also fails.

@skshetry
Copy link
Member

skshetry commented Aug 5, 2022

Just to let you know that I have been having these problems lately, which are really slowing down our team since we cannot rely on dvc. We are not using github packages, but dagshub remotes. The issue is even worse with 2.16.0, since in that version it does not even attempt to push the data to the remote. It seems that everything starts with release 2.10.0 where the timeout was introduced.

@cerquide, Does uploads work correctly if you set sock_read=None? I think it's a different issue than timeout.

@cerquide
Copy link

cerquide commented Aug 5, 2022

It does if you go back to dvc 2.9.5. Have not checked specifically to change sock_read=None in a higher version

@sisp
Copy link
Contributor Author

sisp commented Aug 5, 2022

We are not using github packages [...]

@cerquide GitLab, but I think that's what you meant. Just making sure we're talking about the same thing. 😉

The issue is even worse with 2.16.0, since in that version it does not even attempt to push the data to the remote.

@cerquide That's likely related to iterative/dvc-objects#116. It should be fixed in the next DVC release.

We already do chunking uploads. [...]

@skshetry Ah, I hadn't noticed that. Strange that it works with requests and chunked transfer encoding but not with aiohttp. 🤔

@cerquide
Copy link

cerquide commented Aug 5, 2022

@cerquide GitLab, but I think that's what you meant. Just making sure we're talking about the same thing. wink

Ooops sorry, you are right, fortunately I was saying we are not using them, it would have been worst if I said we are experts.

@skshetry
Copy link
Member

skshetry commented Aug 5, 2022

@cerquide, would it be possible to try with sock_read=None and see if it fixes the issue?

@cerquide
Copy link

cerquide commented Aug 5, 2022

@cerquide, would it be possible to try with sock_read=None and see if it fixes the issue?

@skshetry, in my case (using dagshub https remote), changing to sock_read=None fixes the issue both for dvc 2.10.2 and for dvc 2.15.0 (in this case the file has moved to fs/implementations/http.py). With dvc 2.16.0 I think @sisp is right and I am suffering from a different disease.

@skshetry
Copy link
Member

skshetry commented Aug 5, 2022

@cerquide, can you try installing dvc-objects==0.1.7 with dvc==2.16.0, and see if sock_read=None fixes the issue:

pip install dvc-objects==0.1.7

This will complain that it is incompatible, but you can ignore for now. Thanks.

@sisp
Copy link
Contributor Author

sisp commented Aug 5, 2022

@skshetry If I read the requests library code right, there is no read timeout for chunked requests. I've tried understanding how aiohttp's chunked transfer encoding works for requests, but it looks a bit convoluted at first glance. I'm not deeply familiar with the exact protocol of chunked transfer encoding, but I wonder whether the read timeout should be reset after each successful chunk. I think aiohttp doesn't do that, but I also don't know whether that would actually be correct.

@cerquide
Copy link

cerquide commented Aug 5, 2022

@cerquide, can you try installing dvc-objects==0.1.7 with dvc==2.16.0, and see if sock_read=None fixes the issue:

@skshetry, with sock_read=None pushing works like a charm for me with dvc-objects==0.1.7 and dvc==2.16.0.

When pulling, for some of the big files I received errors like

ERROR: failed to transfer 'md5: 57bba9a44a71691b2bb2ec11159f55e7' - Response payload is not completed

but this was maybe due to the number of jobs being too high, with -j2 it worked flawlessly.

skshetry referenced this issue in iterative/dvc-objects Aug 5, 2022
We have been getting reports that the timeout on sock_read was raising
timeout error even for chunked uploads, and sometimes even uploading
zero-byte files.

See: https://github.com/iterative/dvc/issues/8065
and iterative/dvc#8100.

These kinds of logics don't belong here, and should be upstreamed (eg: RetryClient/ClientTimeout, etc).
We added timeout in iterative/dvc#7460 because of freezes in
iterative/dvc#7414.

I think we can rollback this for now given that there are lots of report of failures/issues with this line,
and if we get any new reports of hangs, we'll investigate it separately.
efiop referenced this issue in iterative/dvc-objects Aug 5, 2022
We have been getting reports that the timeout on sock_read was raising
timeout error even for chunked uploads, and sometimes even uploading
zero-byte files.

See: https://github.com/iterative/dvc/issues/8065
and iterative/dvc#8100.

These kinds of logics don't belong here, and should be upstreamed (eg: RetryClient/ClientTimeout, etc).
We added timeout in iterative/dvc#7460 because of freezes in
iterative/dvc#7414.

I think we can rollback this for now given that there are lots of report of failures/issues with this line,
and if we get any new reports of hangs, we'll investigate it separately.
@michuhu
Copy link

michuhu commented Oct 28, 2022

DVC over http works like charm, but only for small files. When I try to push a bigger file (say 20MB), I get Timeout on reading data from socket. I can see was an issue on github and it should be fixed for dvc-objects==0.1.7, however, I have dvc_objects = 0.7.0 and the problem occurs.

2022-10-27 13:22:39,243 ERROR: failed to transfer 'md5: 07d00c327f993560a68ba6c02127320d' - : Timeout on reading data from socket
------------------------------------------------------------
Traceback (most recent call last):
  File "fsspec/asyn.py", line 53, in _runner
  File "fsspec/implementations/http.py", line 297, in _put_file
  File "aiohttp_retry/client.py", line 149, in __aenter__
  File "aiohttp_retry/client.py", line 134, in _do_request
  File "aiohttp_retry/client.py", line 100, in _do_request
  File "aiohttp/client.py", line 560, in _request
  File "aiohttp/client_reqrep.py", line 899, in start
  File "aiohttp/streams.py", line 616, in read
aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "dvc_data/hashfile/transfer.py", line 40, in wrapper
  File "dvc_data/hashfile/transfer.py", line 202, in func
  File "dvc_data/hashfile/db/__init__.py", line 105, in add
  File "dvc_objects/db.py", line 136, in add
  File "dvc_objects/fs/generic.py", line 132, in transfer
  File "dvc_objects/fs/generic.py", line 93, in _try_links
  File "dvc_objects/fs/generic.py", line 69, in copy
  File "dvc_objects/fs/callbacks.py", line 58, in func
  File "dvc_objects/fs/callbacks.py", line 41, in wrapped
  File "dvc_http/__init__.py", line 146, in put_file
  File "dvc_objects/fs/base.py", line 412, in put_file
  File "fsspec/asyn.py", line 111, in wrapper
  File "fsspec/asyn.py", line 94, in sync
fsspec.exceptions.FSTimeoutError
------------------------------------------------------------
2022-10-27 13:22:39,344 ERROR: failed to push data to the cloud - 1 files failed to upload
------------------------------------------------------------
Traceback (most recent call last):
  File "dvc/commands/data_sync.py", line 59, in run
  File "dvc/repo/__init__.py", line 48, in wrapper
  File "dvc/repo/push.py", line 117, in push
dvc.exceptions.UploadError: 1 files failed to upload
------------------------------------------------------------

@pared
Copy link
Contributor

pared commented Oct 28, 2022

We should probably get to implementing the configuration of timeout.

Original report from @michuhu:
https://discord.com/channels/485586884165107732/563406153334128681/1035466365659779093

@dtrifiro
Copy link
Contributor

Also see #15

@omesser omesser added the bug label Dec 14, 2022
@jinensetpal
Copy link

@sisp, we added timeout because large downloads were freezing. See iterative/dvc#7414 and iterative/dvc#7460.

However, we are open to introducing timeout=0 option that disables any timeout. This will require support in http filesystem in dvc-objects and a new config option in dvc.

Why is a timeout the solution to freezing downloads? It doesn't seem to fix the underlying issue; justs make DVC unusable on a slow connection.

@skshetry
Copy link
Member

Hi, @jinensetpal, we are investigating a potential bug in aiohttp chunking uploads at the moment. yeah, the timeout is not a fix, but a workaround at the moment.

@dtrifiro
Copy link
Contributor

dtrifiro commented Dec 29, 2022

Relevant upstream issue: aio-libs/aiohttp#7149
In the meantime, once iterative/dvc#8722 is merged, it will be possible to set the sock_read timeout in the remote config while we wait for a proper upstream fix.

@jinensetpal
Copy link

Gotcha; thanks @skshetry!

@efiop efiop transferred this issue from iterative/dvc Jan 1, 2023
@dtrifiro dtrifiro moved this from In Progress to Review In Progress in DVC Jan 2, 2023
@dtrifiro dtrifiro added upstream Issues which need to be resolved in an upstream dependency 🐛 bug and removed bug labels Jan 2, 2023
@dtrifiro
Copy link
Contributor

dtrifiro commented Jan 5, 2023

Merged iterative/dvc#8722.

read_timeout can be now set in the http remote config. Set to 0 to disable timeout. Please note that this may result in indeterminate hangs under certain network conditions.

Waiting for upstream in the meantime

@jaanli
Copy link

jaanli commented Jan 6, 2023

Also facing this issue! How can I set to 0 to disable the timeout @dtrifiro ? Thank you!!

@jinensetpal
Copy link

jinensetpal commented Jan 6, 2023

I'm guessing: dvc config set <remote>.read_timeout 0

Can't test since I'm on my smartphone

@marcusos
Copy link

Hi folks! How can I set the read_timeout? I am trying to add a new config as suggested, but I got an error:

configuration error - config file error: extra keys not allowed @ data['remote']['origin']['timeout']

@dtrifiro
Copy link
Contributor

Hi @marcusos, it should be enough to set it using dvc remote modify <remote name> read_timeout 0.

Please make sure you're running dvc>=2.39.0 and dvc-http>=2.30.1

@marcusos
Copy link

marcusos commented Jan 24, 2023

Hi @dtrifiro, thank you for the help, now I am getting the following error:

dvc remote modify origin read_timeout 0
dvc push -r origin data/preprocessed/loan_club.dvc
ERROR: unexpected error - read_timeout and timeout parameters conflict, please setup timeout.read

Any idea on how to overcome this?

dvc = 2.43.0 (via homebrew)

@skshetry
Copy link
Member

skshetry commented Jan 25, 2023

@marcusos, is it possible to use pip? We released a new version of dvc-http yesterday that should fix this, but homebrew will have to wait for the next dvc release.

@marcusos
Copy link

marcusos commented Feb 5, 2023

Thank you @skshetry !

@skshetry
Copy link
Member

Closed by aio-libs/aiohttp#7150, thanks to @dtrifiro.

We are still waiting for a release though. Meanwhile, there are connect_timeout/read_timeout options that you can set.

@github-project-automation github-project-automation bot moved this from Review In Progress to Done in DVC Feb 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug upstream Issues which need to be resolved in an upstream dependency
Projects
No open projects
Archived in project
Development

No branches or pull requests

10 participants