-
Notifications
You must be signed in to change notification settings - Fork 491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop ping sending rate-limit suggestion #918
Conversation
What about the receiver?
Also we could add a note the sender must not send another ping when it has not yet received a pong, |
There's not a lot of reason to explicitly rate-limit ping messages on the sending side, hosts on the internet can *always* send you as much traffic as they want, its up to you whether you want to talk back to them. This seems to have been intended as a cutoff where nodes can skip responding to pings below a certain rate, but in practice 30 seconds is much too long a time to learn that your peer has disconnected. We could reduce the threshold, but its not like this is the only place in the spec where a peer can request a message response, and that is unlikely to change, making it of highly dubious value.
372e900
to
49e1c1c
Compare
Oops, sorry.
I dunno if its worth bothering? If there is a concern over outbound buffer fills or inbound traffic floods, rate-limiting pings does ~nothing to address that. Both must be addressed at a different level as there will never not be a way to get someone to send a response message in the spec. |
Since you removed |
Ah, yes, I'd misunderstood the context for your comment, agreed. |
What would be a reasonable amount of time to know that your peer has been disconnected? |
We've switched to 5 seconds, which is probably too long in most cases but too short on Tor. There's no right answer, so the spec shouldn't be opinionated about it. |
A ping-timeout of 5 seconds might cause rust-lightning to disconnect from a channel peer if they are under heavy load (e.g. packet flood). Is the assumption you're making that this doesn't happen? On another note, I do agree with this spec pull that 30seconds is arbitrary. |
Five seconds is a huge amount of time (for non-tor connections). Getting a message shouldn't take longer than a second, and getting it back shouldn't take longer than a second. Taking three seconds to process a message probably indicates the peer is blocked on a disk write writing to a disk connected via smoke signals. I admit lots of nodes are basically smoke-signal-operated (ie running on RPis with SD cards), but the flip side has real nontrivial cost - accepting ping timeouts of more than ~500ms implies you may not detect a TCP socket getting killed when your app went to the background. This appears to be especially nasty on iOS where sockets hang when the app is in the background, and users want to open an app and hit "send" right away. You can ping right before the send, but of course now you've added an extra 250ms to your payment latency. Obviously I'm being a bit facetious here, but the point is there isn't a really good answer - it depends a lot on the peer you're talking to (is it an LSP running on "real" hardware over clearnet or is it an RPi that goes on vacation for seconds at a time and communicates over Tor), your network (Tor or 2G modem or a datacenter?), etc. The spec should not have an opinion on the ping latency, and its up to users to tune their deployment to what makes the sense. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK
ACK Rational on not adding a "no more than 1 inflight ping" rule (yet or at all) is that it would only be enforceable if the responder (pong sender) needs to set a random number that the sender needs to use on the next ping in order for it to be valid. Since this is bikeshedding to add without need, we can just drop the rate limit and care if that ever gets a problem at all. |
cc @Roasbeef @Crypt-iQ there was some hesitation from the lnd side in the meeting - noting that y'all were thinking of maybe adding a rate-limit in the future. Obviously that would be incompatible with this PR as written, so if there's hesitation/a desire to do that in the future we should probably discuss it now rather than later. On the LDK end, we may actually start pinging even more often, using it to gate how quickly we send gossip messages to ensure we're always able to get channel messages through quickly even during full gossip sync. We'll still only have one in-flight ping at once, but may end up with pings happening at almost the RTT of a peer. |
We haven't made any strong commitments, but just that it's something we considered doing after a newer member of the team brought up concerns re not enforcing the deadline, and being flooded with ping messages. I think it'll be one of those reactive things, where if there's an issue, or people report their node is handling too many (?) pings, we'd start to throttle. As I mentioned before, I don't have a strong opinion on this. |
These two things are mutually exclusive - "we've removed it from the spec and well-behaved implementations can do ~whatever" and "we may add a rate-limiter later, dunno, YOLO" are incompatible from each other. One thing we've been looking at on the LDK side is to rate-limit gossip sync by sending pings in between to ensure we keep consistent low latency. This would imply substantial ping volume, but not as a % of total messages. |
Would you feel better if this PR was amended to say something like "no more than 50% of messages on the wire may be pings if more than one message is received per second"? That way you can still rate-limit to your heart's content later but it doesn't impact others. |
Ack. We don't enforce a limit either, and if you're being painful we will have to start anyway, whatever spec says :) |
If I say: "we don't plan to rate limit in the immediate future, but will examine it on an on going basis", are you satisfied as far as this PR? I think this is just a communication/language thing: I'm expressing the desire for optionality in the future, but I'm not committing to anything (as I don't know what the future holds, if someone starts spamming and destabilizing the network, we'll put in mitigations at the implementation level).
Sure doesn't seem to hurt, also echoes @rustyrussell's comments earlier today about a heuristic that could be used to decide if peers are misbehaving. I think this PR is fine to land, we don't do rate limiting atm, but only send pings every 15 seconds or so. |
Can you elaborate on this? Doesn't that assume certain implementation-level processing/prioritization details (particularly the internal queueing system used)? |
I think you've been saying that for a while, I don't see how that changes anything. I don't think anyone is suggesting we can't revisit this if someone starts ping-spamming (not that I buy that it would be useful given they would just start udp-flood-spamming or some other messages), but what you're saying, to me, is "well, we reserve the right to ignore the spec in this regard later, but you go ahead and do you", which isn't helpful here. There is no "well, we can merge this but maybe we'll ignore it later", the spec exists so that we can agree on something and be compatible, not so that we can throw it out later cause its annoying. If you want to reserve the right to do something that the spec forbids in the future, the spec must not forbid it. I agree there is a communication issue here, but I'm entirely unsure of why this is such an issue.
My question isn't "does it hurt" its "what will lnd commit to not disconnecting for in the future". ie "what can we put in the spec to ensure compatibility with the various implementations, which is the whole point of the spec...
Not directly - we don't actually care if a peer prioritizes pings and responds to them immediately with gossip handled in the background (modulo TCP buffering delays and the like), all we care about is making sure our pings get through in a timely manner so that we can quickly detect disconnection. |
Just a question about your example: what if I don't have a udp port open on my router? An attacker could fill up my bandwidth-pipes, but it's a much higher threshold than if I had a tcp port open and the kernel keeps getting interrupted for each received packet. Right? |
I doubt that? 1Gbps of UDP garbage is pretty easy for modern Linux to shrug off, but 1Gbps of UDP garbage is probably enough to make most home internet connections drop 20-50% of inbound packets, more than enough to make TCP unusable. Of course even if your ISP has some DDoS protection in place you can always replace "UDP flood" with "9735-syn flood" or even "flood of random odd-value messages" or "flood of repeating the same update_channel message over and over again" or..... The "standard" method for these kinds of things is to track the "useful-message-vs-useless-message" ratio for any peer and get upset if its too high, but that's somewhat hard for lightning because you're expected to send update_channel messages a bunch which are duplicative with what other nodes sent, and you have to be really careful not to let any node flood you with a message that meets the "useful" criteria (like sending the same update_fee over and over again, or a different update_fee by 1 sat or...). Luckily in lightning we have a concept of "I like that peer" - "I have a channel with that peer", and you can firewall out other peers as needed, which seems like the only "real" solution to all of this. |
As an aside: I have a 1 year old arch linux install running on an five-ish year-old macbook. It cannot shrug off a packet flood and I don't think I'm hitting 1Gbps. Though it could be the age of the machine. I used
So I think this is the model that Laolu is referring to when saying rate-limiting pings might be a thing in lnd - not accepting connections except for certain peers we have channels with. Home node users / plebnet users without DDoS protection are not included here. So the unconnected-attacker's packet flood does not apply. If you bypass a threshold of sending too much junk, you probably should be disconnected (not banned because of Tor). Though determining a threshold here is pretty tricky if you can send almost-useless |
This sounds like a bug in your NIC driver, plus, yes, that's probably a pretty old machine for the purposes of this discussion.
How does this have anything to do with rate-limiting pings?
Huh?
So we're back where we started? Let me rephrase the whole issue here - the PR text, as currently written, says that a peer can literally spin-loop pings because it wants to make the connection constant-bit-rate. This is in fact a totally reasonable thing to do for privacy! Set the connection to 1Mbps by sending pings in a loop! If merged as-is, the spec says you MUST NOT disconnect just because a peer is doing that. Rate-limiting and not responding to these pings would be a spec violation, you MUST NOT do this. Y'all keep coming in and saying "well, we may do that in the future" which implies that you are NACK'ing this PR, because you're saying you want to (maybe) violate this proposed spec change in the future. Obviously if you don't have a channel with a peer, you don't generally have to accept a connection with it, who cares, not your problem, but if you have a channel, this PR would require you to be able to handle ping "floods" as a condition of being a lightning node that cares about being able to stay connected to channel peers. My point is that you are likely confused in wanting to do that explicitly, hence the conversation about the value of such rate-limiting, and I think this PR still makes sense. If you still disagree, I proposed alternate wording above. |
You had previously brought up that there was no point to doing this since you could packet flood anyways. A random attacker won't be able to packet flood since we are working with a different model (cloud deployment with DDoS protection, whitelisting).
I had thought the motivation was for your mobile users and not privacy. If a node has 1000 channels, and each is spamming you with pings, you are now subject to a 1Gbps ping flood? Why would you not want to discourage this?
I don't see the |
Rate-limiting pings still doesn't solve your issue, though, I can send you an update_fee message in a loop, or repeat gossip messages over and over again, or even sign new gossip messages in a loop so that they're "useful".
There are many reasons why allowing more pings could be useful. Indeed, in our case we had some mobile clients that had issues detecting disconnections and we had to ship a faster ping time in order for things to work. Constant bandwidth lightning connections is also probably something we need to move towards for privacy reasons, and pings aren't a terrible way to go about it, though dummy unknown odd messages may be even better. Maybe 1Mbps is too much, I dunno, but the point being just a raw rate-limit of pings isn't helpful in any context I can come up with.
|
We rate-limit gossip messages, rate-limiting pings if they're excessive isn't too much of a difference in my opinion. The same goes for rate-limiting other messages depending on the amount of work it costs us to process. This pull could use a feature-bit and then lnd just won't have to implement the feature-bit which seems like the best of both worlds? Constant-bandwidth connections makes sense in theory to me, but seems like it is not that great in certain use-cases (high channel count). So I don't think it's enough of a reason for us to not rate-limit pings if they're excessive. |
Do you also rate-limit channel messages?
Dear god, please lets not start shoving in feature bits for trivial things, that just makes everyone's code more complicated. If you have an issue with the PR, lets discuss how to make it better such that it fits within your idea of what lnd will accept.
Sure, you'd have to do it at a reasonable rate, but let's say its 64Kbps, that's probably more than enough for any channel updates to not be delayed, but also means you'll have a connection that's 99% ping messages, making "useful-messages"-based peer criteria though. |
No, but it's something I've idly thought about before.
Sure I see in this pull that there's no mention of a timely response for pings. What happens if rust-lightning gets a response that is delayed - I recall you mentioning that you wanted to know if your peers were up or not? Additionally, are you padding your pings out and if so, by how much? |
Disconnect + reconnect, cause it probably means the socket is actually closed but the host didn't tell us, at least on iOS.
Currently I believe not, but it may be something useful in the future if/when we implement constant bandwidth mode. |
But do you not end up requiring your peers to have a certain bandwidth or is that not an issue except for the potato-router model? |
Yea, I mean I think we require a full packets per second or so! If you don't get that, you probably can't be considered "Internet-connected" lol. |
Would the constant-bandwidth model require a certain bandwidth? And if so, how would it work if it requires 1Mbps while my speed is 500Kbps? |
Yes, it would disconnect. I suppose 1Mbps was a bit overly aggressive - a "real" CBR model here would probably be more like: (a) open a separate, non-CBR connection that you use only for gossip, (b) set the HTLC-only connection to CBR at, like, 32Kbps or 128Kbps or so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🌛
@Crypt-iQ what is your current thinking on this one? Should we just land it and amend the spec here if y'all decide to go down a ratelimiting path later (presuming its not something that'll happen fast/be needed any time soon)? Should we set a limit here that people are happy with? Or just rely on the general "well, if you dont have a channel with them, you dont have to stay connected in any case, really" take? |
I think this pull is fine and we may not need to go down the ratelimiting path at all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 49e1c1c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 49e1c1c
As @rustyrussell said: c-Lightning does not enforce this. they have a FIXME
in their code to be spec compliant at some point and include the 30 seconds limit: https://github.com/ElementsProject/lightning/blob/d22fd599973175e3d1a484a7aa2efd9916116bc5/common/ping.c#L16
Unless I oversaw something Eclair currently also does not enforce the rate-limiting: https://github.com/ACINQ/eclair/blob/59b403559bdfc0ba884503bb12a14846b7aa5d2d/eclair-core/src/main/scala/fr/acinq/eclair/io/PeerConnection.scala#L190 They seem however to send a heartbeat ping every 30 seconds if no other message was sent and have the ping interval as a config parameter.
Electrum seemed to not have implemented a rate limiting of 30 seconds for incoming ping messages: https://github.com/spesmilo/electrum/blob/56b03e2e8d6923c4d396c1272d3fafed57919a8f/electrum/lnpeer.py#L230
I did not check rust-lightning as @TheBlueMatt proposed the PR
I also did not check lnd as @Roasbeef and @Crypt-iQ acked the PR after the lengthy discussion.
Given the current code of c-lightning, eclair and electrum we seem to have met the 2 implementation threshold and should merge.
Thanks for looking into it, that's correct, we don't enforce the rate-limiting, that's why we're fine with this PR as even old eclair nodes will be compliant with it. |
There's not a lot of reason to explicitly rate-limit ping messages
on the sending side, hosts on the internet can always send you
as much traffic as they want, its up to you whether you want to
talk back to them.
This seems to have been intended as a cutoff where nodes can skip
responding to pings below a certain rate, but in practice 30
seconds is much too long a time to learn that your peer has
disconnected.
We could reduce the threshold, but its not like this is the only
place in the spec where a peer can request a message response, and
that is unlikely to change, making it of highly dubious value.