-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: message not delivered during interop test #2369
Comments
Thanks for submitting the issue @romanzac ! Given its random nature, it seems a timing issue. Notice that gossip-sub doesn't ensure message delivery and there is no native mechanism that allows the sender node to know whether its messages were properly sent. Maybe a naïve solution would be to increase a "wait" somewhere within the tests. May kindly add some more detail about the node interaction? Maybe some kind of simple diagram where we can see the messages flow, and what is expected. On the other hand, I'm not very familiarized with the interop tests, so instructions on how to run the test locally would be golden as well :) Thanks in advance! |
Related to #2388 |
I went thought the test "test_publish_after_node1_restarts" and here are the steps:
Our tests looks a bit bumpy, yet I would not expect it fails. We test REST API responsiveness as "known at that time" best practice how to determine the node1 is ready after restart. I would rise a question here:
Please note this problem is not entirely technical. What we want at the end is a stable app (frontend) behavior for the end user. It could be translated as frontend <> Waku interaction being reliable enough to abstract away possibly non deterministic behavior of Waku network. I would also include Status-go or other stakeholders to set the reliability expectation. And after that I would move to figure out technical details. |
Related to waku-org/docs.waku.org#165 |
So if I understand correctly the problem is that there is a potential race condition where REST API is still available and responsive, but the node is already in process of shutting down and potentially disconnected from peers, hence cannot publish new messages (although REST API still accepts them). That sounds like the right way to fix this would be change the order of components being shut down - disable most REST API (apartf from health?) endpoints and then start the actual shutdown of p2p layer? I don't think we can get to 100% safe flow where no messages are lost. Especially with non-graceful shutdowns like OOM kills. So my thoughts would be to
I would not make the node to persist messages and automatically retry after restart - that should be a job of some higher level abstraction in my opinion HOpe this helps:) |
Thanks for comprehensive reply. It looks we are on the track to improve readings on node health state. I need to admit I am to sure where exactly to pin point expectation about how reliable message delivery should be. Let's do some math with Whatsapp statistics. Based on report https://whatsthebigdata.com/whatsapp-statistics/ their 2.78B users exchange 140B messages per day. That is 50 messages per day per user. With 0.2% non delivery rate it is 280M messages lost per day?! Divided by 50, we should have 5 600 000 angry users every day. Is it the case ? I don't think so. I believe Whatsapp could deliver which much lower, orders of magnitude lower loss rate. Perhaps 1000 angry users is acceptable ? If my math is correct, that would be 99.99996428% messages are delivered. I remember from "Waku & Vac DST/QA" meeting, Vac DST team is looking at the reliability problem and trying various simulations. My question is how RLN feature could fit with reliability, when I cannot send retry message within the same epoch (lets assume 1 sec) ? What would our guidance for integrator be here ? |
I would be really happy we could seal reliability problems within Waku. Integrator need not to worry about anything related to gossipsub communication while using Waku in lib mode. Once they want to operate their own Waku network of course it is different story. |
Indeed, we want to improve what we can to make Waku as reliable as possible.
Strictly speaking (1) and (2) are the only elements directly under Waku's control. Neither will ever guarantee the end-to-end message reliability that is needed within the (encrypted) application layer. Just as Whatsapp cannot rely on the underlying TCP/IP network for its app reliability, applications building on Waku would still need an end-to-end reliability overlay. Since this is a tall order for many small app teams, we are also working on: I think we might be publishing a blog article on this topic soon. :) cc @vpavlin |
I can't find any failure related to "message not delivered" in Daily test reports for past 2 months. I believe message delivery reliability has improved beyond degree noticeable by interop tests. We can discuss further, outside this issue, about the next level tests (probably need more scale and randomness). I would like to close this issue, let me know if any objection: @fbarbu15 @jm-clius @vpavlin @Ivansete-status |
Problem
At rare occasion, message is not delivered when a node is shut down during the test sequence.
Impact
It is low occurrence, medium impact. The node may shut down before the message in sent/published to other nodes.
To reproduce
Run interop tests at https://github.com/waku-org/waku-interop-tests/actions/workflows/go_waku_daily.yml
Simillar issue happened also with nwaku during execution of test_publish_after_node1_restarts.
https://github.com/waku-org/waku-interop-tests/actions/workflows/nim_waku_daily.yml
Expected behavior
Node which is shutting down should reject accepting new messages. When the message gets stuck, next time the node starts it retries the delivery. I would assume shutdown was part of a restart. Open question is what to do when node stays shut for long periods of time. Messages lost ?
Related to waku-org/go-waku#1014
Screenshots/logs
data_attachments_8d428324ec7625a3.txt
The text was updated successfully, but these errors were encountered: