Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs.http: prevent hangs under some network conditions #7460

Merged
merged 2 commits into from
Mar 28, 2022

Conversation

dtrifiro
Copy link
Contributor

@dtrifiro dtrifiro commented Mar 14, 2022

When using an https remote (such as in the dvc-bench remote, see iterative/dvc-bench#319), under certain situations, dvc pull would freeze and never return. Note that this did not happen 100% of the time, but often enough to make the bench CI always fail (note: CI was running a matrix strategy over ~12 workers, pulling a 25k file dataset on each).

Investigation pointed to network issues (as the issue could not be reproduced locally). I ended up tracking this to a connection being dropped/lost and aiohttp not realizing this. Due to our timeout defaults for aiohttp.ClientSession:

aiohttp.ClientTimeout(
            total=None,
            connect=self.REQUEST_TIMEOUT,
            sock_connect=self.REQUEST_TIMEOUT,
            sock_read=None,
        )

(note that sock_read=None), aiohttp would keep trying to read from a dead socket thus hang.

By instantiating a aiohttp.TCPConnector with enable_cleanup_closed=True, the underlying transport is forcefully closed when the connection is lost, so that the request fails and RetryClient retries.

Thanks @pared and @karajan1001

fixes #7414

@dtrifiro dtrifiro requested a review from a team as a code owner March 14, 2022 13:58
@dtrifiro dtrifiro requested review from pmrowla and pared March 14, 2022 13:58
@dberenbaum
Copy link
Collaborator

@dtrifiro Is it possible to show the difference before and after? Possibly showing the output from each, a video, or a reproducible script?

dvc/fs/http.py Outdated Show resolved Hide resolved
@@ -114,7 +120,7 @@ async def get_client(self, **kwargs):
total=None,
connect=self.REQUEST_TIMEOUT,
sock_connect=self.REQUEST_TIMEOUT,
sock_read=None,
sock_read=self.REQUEST_TIMEOUT,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dtrifiro
Copy link
Contributor Author

Copy link
Contributor

@pared pared left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing research!

@efiop
Copy link
Contributor

efiop commented Mar 15, 2022

pared left a comment
Amazing research!

+1, amazing work 🔥

@dtrifiro dtrifiro changed the title fix hang on pull fs.http: prevent hangs Mar 15, 2022
dvc/fs/http.py Outdated
Comment on lines 91 to 93
client_kwargs["connector"] = aiohttp.TCPConnector(
enable_cleanup_closed=True
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there more detail on this? I'd like to see why this is not the default in aiohttp.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like aiohttp used to do this by default 5 years back but was changed by this commit aio-libs/aiohttp@02b0951 as an option. But I could not find any reasoning.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From https://docs.aiohttp.org/en/stable/client_reference.html#tcpconnector:

Some ssl servers do not properly complete SSL shutdown process, in that case asyncio leaks SSL connections. If this parameter is set to True, aiohttp additionally aborts underlining transport after 2 seconds. It is off by default.

From the description of how it works, it seems more reasonable that this isn't the default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was too wondering why this is not the default, but digging around could not find any reasoning either. I guess this only becomes an issue under particular network conditions (broken TLS connection?), which were often encountered in dvc bench due to our large number of concurrent requests in dvc bench.

Copy link
Collaborator

@dberenbaum dberenbaum Mar 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this could be the context for the change: aio-libs/aiohttp#1767.

Edit to add some more explanation: The discussion is right around the time of the change, the contributor is in the discussion, and they discuss the related (now outdated) attribute cleanup_closed_disabled.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could also ask 😄

@dtrifiro dtrifiro changed the title fs.http: prevent hangs fs.http: prevent hangs under some network conditions Mar 15, 2022
@dtrifiro dtrifiro force-pushed the fix/7414-hang-on-pull branch from 1745f16 to 12de7e7 Compare March 15, 2022 18:15
@dtrifiro
Copy link
Contributor Author

Added some context comments with the last force push

@efiop efiop requested a review from skshetry March 17, 2022 17:28
@@ -114,7 +123,7 @@ async def get_client(self, **kwargs):
total=None,
connect=self.REQUEST_TIMEOUT,
sock_connect=self.REQUEST_TIMEOUT,
sock_read=None,
sock_read=self.REQUEST_TIMEOUT,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious to see if only this will solve the problem in dvc-bench, and a bit hesitant on adding TCPConnector
without trying out this solution first. Would you mind opening another PR with just this line change? We can merge that, and we can keep this PR open for a 1-2 days to monitor.

Copy link
Contributor Author

@dtrifiro dtrifiro Mar 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried, it solves the issues on my machine ™️ (MacOS) but it still freezes on the dvc-bench CI

@dtrifiro
Copy link
Contributor Author

Gonna test out a few extra scenarios before merging this

@dtrifiro dtrifiro merged commit 5741927 into iterative:main Mar 28, 2022
@dtrifiro dtrifiro deleted the fix/7414-hang-on-pull branch March 28, 2022 06:21
skshetry added a commit to iterative/dvc-objects that referenced this pull request Aug 5, 2022
We have been getting reports that the timeout on sock_read was raising
timeout error even for chunked uploads, and sometimes even uploading
zero-byte files.

See: https://github.com/iterative/dvc/issues/8065
and iterative/dvc#8100.

These kinds of logics don't belong here, and should be upstreamed (eg: RetryClient/ClientTimeout, etc).
We added timeout in iterative/dvc#7460 because of freezes in
iterative/dvc#7414.

I think we can rollback this for now given that there are lots of report of failures/issues with this line,
and if we get any new reports of hangs, we'll investigate it separately.
efiop pushed a commit to iterative/dvc-objects that referenced this pull request Aug 5, 2022
We have been getting reports that the timeout on sock_read was raising
timeout error even for chunked uploads, and sometimes even uploading
zero-byte files.

See: https://github.com/iterative/dvc/issues/8065
and iterative/dvc#8100.

These kinds of logics don't belong here, and should be upstreamed (eg: RetryClient/ClientTimeout, etc).
We added timeout in iterative/dvc#7460 because of freezes in
iterative/dvc#7414.

I think we can rollback this for now given that there are lots of report of failures/issues with this line,
and if we get any new reports of hangs, we'll investigate it separately.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pull: Using jobs>1 fails with RuntimeError: Session is closed in http remote pull: freezes on large datasets
6 participants