-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Transfer Stall #7972
Comments
@aschmahmann is going to provide more context/info. |
There was a report that a node was not sending data over bitswap despite having the data locally. The user was able to run I investigated further using
|
Matt noted that disabling QUIC seemed to help. This issue could be related to libp2p/go-libp2p-quic-transport#57 (where we could be failing to timeout somewhere in bitswap, backing everything up). |
Although I have a vague understanding of the high level issue I don't find enough information to translate this into any actionable item.
|
I do not know of a way to reproduce this issue. My recollection from some comments @gmasgras made are that the situation we saw was.
It might be that simulating this situation + check on the gateway once every 30 min would at least let us know if this continues to happen and could enable more live debugging on the node. cc @coryschwartz who's started on some scripts that might help here.
Ya, this is harder. The place I'd investigate is tracking how an inbound message gets put into a queue and gets dropped, stuck forever or for a very long time, etc. (IIRC I've waited over 30 minutes on a response). This means starting with the message handling code and tracing how it enters and leaves the queue. |
Adding some content to an IPFS node and trying to read it through the gateways, some of the content would throw a HTTP 504 after several minutes. I setup a second IPFS node with the routing disabled and peered it only to the gateways. What I found is that some of the gateways responded with an error some of the time. Using Running this rudementary script to check each gateway against a set of CIDS GATEWAYS=(
"/dns/gateway-bank1-am6.dwebops.net/tcp/4001/p2p/12D3KooWEfQkss14RPkFoGCby5ZRo5F2UYEMEfSAbYUdjeAE6Dk3"
"/dns/gateway-bank1-ams1.dwebops.net/tcp/4001/p2p/QmXAdzKCUZTg9SLLRK6nrfyhJmK1S43n2y5EBeGQ7zqhfW"
"/dns/gateway-bank1-dc13.dwebops.net/tcp/4001/p2p/12D3KooWJfq7u5cYc65HAZT3mE3irDH5KYSUdFxXTdf2KTB7FQgf"
"/dns/gateway-bank1-mrs1.dwebops.net/tcp/4001/p2p/QmYZmw1PQXtKVeAjRfn8TnKYAurSdezRd1kdDjenzxTF6y"
"/dns/gateway-bank1-nrt1.dwebops.net/tcp/4001/p2p/QmQ41LAQUdH6JQCeSR1qMT8qdPAkZ9XgkvqYHs9HiAdh5Z"
"/dns/gateway-bank1-ny5.dwebops.net/tcp/4001/p2p/12D3KooWDUGVYVaJKmGR3eJCbgihEYgYAfpnZqJYzhzeaJrLxq6j"
"/dns/gateway-bank1-sg1.dwebops.net/tcp/4001/p2p/12D3KooWGoaeFe2h4uhTEAhaN6yTccUMn5uizxvyRaWkd2x8ZYHP"
"/dns/gateway-bank1-sjc1.dwebops.net/tcp/4001/p2p/QmXYHU1M8t86gRKvUmE42smybPKXqTPEUPDUaCVtLmS6Bw"
"/dns/gateway-bank2-ams1.dwebops.net/tcp/4001/p2p/QmRzfM4kLwk6AjnZVmFra9RdMguWyjU4j8tctNz6dzxjxc"
"/dns/gateway-bank2-dc13.dwebops.net/tcp/4001/p2p/12D3KooWAAW7p19cyLb39w5V5AuvX4c5gTPtQv8RuZo2tCKeFpZ7"
"/dns/gateway-bank2-mrs1.dwebops.net/tcp/4001/p2p/QmSPz3WfZ1xCq6PCFQj3xFHAPBRUudbogcDPSMtwkQzxGC"
"/dns/gateway-bank2-nrt1.dwebops.net/tcp/4001/p2p/QmXygeCqKjpcbT9E7jgSbAvcDnPaJp9gQDzEVFZ6zzjVc2"
"/dns/gateway-bank2-ny5.dwebops.net/tcp/4001/p2p/12D3KooWDBprpmGqQay9f513WvQ9cZfXAtW1TiqPP2fpuEGnZsqh"
"/dns/gateway-bank2-sg1.dwebops.net/tcp/4001/p2p/12D3KooWDK1MAZuySwEv49Zmm7JVgsUgjGwZVB8Kdzdwr7mLcNL2"
"/dns/gateway-bank2-sjc1.dwebops.net/tcp/4001/p2p/QmVLb8bi8oAmLwrTdjGurtVGN7FaJvhXP5UM3s56uyqfxL"
)
for gw in "${GATEWAYS[@]}"
do
(
echo "---------------"
echo checking $gw
for cid in $(cat pinned.txt)
do
echo -n "${cid} "
./vole bitswap check $cid $gw 2>&1
echo "return status: $?"
done
) | tee $(sed 's/\//_/g'<<<"${gw}.tst") &
done
wait
The problem doesn't seem to persist. If I have trouble htting a CID from one of the gateways once, and then try again later the problem has resolved itself. This might not always be the case, but when I ran this, only saw errors on one specific gateway node, and I captured a profile of the node running there in case it's useful. https://gist.github.com/coryschwartz/3eaae387933f16ee7b455ffd3eb6d122 one server gateway does take on more load than the rest, and it is not uncommon for the nginx response time to exhaust the full 10 minute timeout. For all gateways, the P99 response time for nginx 10 minutes (the nginx timeout). of course load follows with other metrics, throughput, etc. I still need to look farther to understand exactly what's going on here, but I wonder if just adding some load shedding to the gateways would make the service more performent for users. Nginx could be configured to try locally, and after a short timeout, try another node. |
@coryschwartz I need more information to reproduce this, could you please expand on:
and the file |
@aschmahmann I need more precision as to what a 'bitswap stall' means. Is it the In that case I can consistently get that response from GW (with ipfs-shipyard/vole#1)
|
I guess what we're looking for is a node not responding to Bitswap requests for a long time. If the node is under severe load then there may not be much to be done. However, if we can get multiple rounds wantlist messages before hearing an answer to our request then (unless the load is on the disk) load probably isn't the issue. Vole responding with It looks like that request is resolving now and I didn't have much time to see the error, but generally seeing that is a bad sign. Vole right now (without your PR) is a little impatient, I suspect adding a bit more debuggability/configurability on timeouts could be useful. |
We currently do load shedding by limiting the number of concurrent connections and requests so that we're not oversubscribed on CPU (load < threads)
The node is under heavier load when compared to others but it's not overloaded. |
@aschmahmann 👍 if you agree: The problem is the lack of reponse from the GW (either We still can't trigger this consistently. The previous tests on GW |
This is possible, however when I was initially testing this the node that wasn't responding did actually have the CID already. |
I wanted to hop in here because I saw a lot of the focus getting thrown at Gateway nodes. When we encountered this issue it wasn't just gateway nodes that were running into issues. When a storage node (a node that's storing our IPFS content long term) ran into this issue, it wouldn't serve content to any other node. So for example:
As far as I'm able to tell, the issue lies with the node that has the content needing to be retrieved, not on the nodes looking to retrieve that data. |
👍, yep in the gateway example above the issue was also on the node holding the data (in that case one of the gateways was holding it and another should've been retrieving it from the first gateway node).
This is interesting in that if we're talking about the same bug then this should make it easier to diagnose as once it occurs it should be sticky. Perhaps 10 seconds is too low of a threshold to set here for "when is it stalled", since within an hour the issue went from detectable to non-detectable on the gateway node above. @obo20 a few clarifying questions:
|
Previously we would notice that content wasn't replicating from certain nodes or wasn't able to be pulled from the gateways after upload
This was happening seemingly randomly at first it was every couple of days, but then as we slowly transitioned our nodes over to disabling quic, we noticed it didn't happen anymore. Now that we have QUIC disabled on all host nodes, we haven't run into this at all anymore. (this may be pure coincidence though)
This is mostly a check to make sure we're talking about the same bug. |
Just an idea: Maybe it's not a network issue, but a database one? Had some issues a while ago where a node could not access some CIDs while clearly in the possession in them after running the GC which somehow messed up the database. On a restart the issue was gone for a while and reappeared after some hours. |
I'm skeptical about this being related to GC. When we hit this, we hadn't run GC since restarting the node. Unless you mean the node had been GC'd at some point in its life. |
Just ran into this again. This was a node that had accidentally been overlooked when wanting to turn off quic for our network. I ran the go-ipfs/bin/.collect-profiles script on this node and can share if needed. Something else to consider since this seems to be related to quic. Almost all of our nodes have multi-terabyte datastores and are storing hundreds of thousands/millions of root CIDs. I'm not sure if this would have any effect on things, but it seemed potentially useful to point out. |
Please do. I want to see if we're stuck opening and/or reading from a stream somewhere. |
@obo20 gave us a stack trace with the culprit:
This is happening because QUIC blocks forever if it can't open a stream (e..g, too many open streams). So:
@schomatis: the first fix is to make |
Full stacks attached Take a look for blocked |
Note: there could be multiple issues here, but we can at least fix this one and move on. |
This should be fixed once we get go-ipfs to use a newer version of go-bitswap with ipfs/go-bitswap#477 included. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Hi there, |
It's running the old version of go-ipfs (before the fixes). This'll be a good chance to check. |
Closing for now, we'll reopen if we see continued reports. |
Update (May 28th): We have deployed #8082 to monitor BS with the latest fix ipfs/go-bitswap#477. We are still running into the 'this request was not processed in time' error but this is likely a false positive due to an incorrect timeout setting in the monitoring tool that we should calibrate for a GW node in production. We need to make this setting a normal IPFS config that can be modified through the standard CLI to avoid restarting the node and better calibrate the timeout. (There's a tribute issue about it that might be picked up next week: #8157.)
Version information:
go-ipfs v0.8.0
Description:
We saw a bitswap stall on some of the preload nodes where they were unable to download any content, even from connected nodes (ipfs/js-ipfs#3577).
This looks related to #5183 but that issue was mostly about specific content. That is, a specific request would fail, and we'd get into a stuck state where we couldn't download that specific content.
The preload nodes did not appear to be spinning, stuck, etc. but restarting them fixed the issue.
Possible causes:
The text was updated successfully, but these errors were encountered: