-
Notifications
You must be signed in to change notification settings - Fork 260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Olm sessions can wedge when using multiple Clients (e.g with NSE process) #3110
Comments
Thanks for opening an issue. Indeed it looks like the session cache in the NSE isn't being reloaded, for some reason, and addressing #2624 might help with this. |
@kegsay do you see this line in the Rust logs, out of curiosity?
|
No I do not see this log line. I did trace the "crypto store mismatch" stuff to see if it was ever wrong, and based on my reading of it, it was always very sane and reasonable. |
Presumably this only affects Element X iOS, because that is the only platform that uses two processes? |
Yes. Web with multiple tabs would also be similarly affected. |
I don't understand why we can't just disable the in-memory cache here. This would mean we exclusively do database hits which would work (sqlite supports multiprocess just fine: https://www.sqlite.org/faq.html#q5 ) If there are performance concerns, can we just profile it and see if that's indeed the case? |
For what it's worth, it's not only about having atomic transactions and enabling the WAL in sqlite. A few reasons off the top of my head:
|
Both firefox and chrome appear to be able to handle locking cross-tabs at the very least when I did some testing to try to trip it up with read-modify-writes, which may be enough? |
As Kegan has said, indexeddb is reasonably resilient to multiple "processes" attempting to write to it at once, although its transaction semantics are annoying and interact poorly with asynchronous code. However, I don't think we should concern ourselves too much with indexeddb here; multiprocess behaviour on web is not currently an issue due to element-hq/element-web#25157, and right now I think it's more likely that we'll solve multi-tab operation by having a single worker process which is shared between tabs at the JS level than by trying to make concurrent matrix-sdk-crypto processes truly robust. (Note that the NSE process is an easier problem to solve than the web process, because the NSE process does less stuff. For example, it will never send encrypted messages, which means a whole class of failure modes is ruled out.) In other words: if indexeddb support were the only thing worrying us here, I wouldn't count that as a good argument. |
iOS Share Extension enters the room (This is another process that will be used to send encrypted messages, for what it's worth.) |
ugh. |
also, i think NSE /can/ send encrypted messages (via quick reply functionality) - i.e. replying from lockscreen etc. |
(also, if rust-sdk /does/ have to support concurrent access to the same underlying store - e.g. for NSE, it feels really weird not to avail ourselves of that as a feature on EWR, imo, rather than coming up with an entirely different service-worker based architecture that solves the same underlying problem) |
NSE can send messages in EI today, though accidentally, which is what element-hq/element-ios#7751 is talking about - specifically key sharing requests. |
Context Links: |
For the record, I believe this is currently incorrect: the share extension and quick reply are currently disabled. Though it doesn't entirely change the point here: we need a "proper" fix to that eventually.
hrmrmrmrm you may be right. Web locks make implementing the concurrency in the Rust less horrifying than it used to be. That said: I don't think we've really figured out the semantics for how different processes can call Anyway: my real point here is that even if we massively improve the concurrency support in the crypto-sdk to the point where the iOS extension processes function correctly (which is a lot of work in itself), that doesn't necessarily get us to the full level of concurrency that web would need. |
Related NSE problem, causing re-use of megolm index (detected as replay attack) element-hq/element-ios#7499 |
I agree, we may need to be more creative to support multi-tab web (e.g Web Locks) but I don't think this should be a blocker for progress on this issue.
I think @bnjbvr is saying much the same thing here, that if we want true multi-process then we need additional locks in place. |
It seems like there are still a lot of unknowns here. For example, as well as the case Kegan mentioned in the OP, we've also seen cases where the NSE process appears to reload the Another theory is that maybe the current cross-process lock is not sound. We're not seeing much evidence of that in the logs though. |
#3313 is likely a big cause of this. |
@richvdh @andybalaam and myself have identified the root cause on this. We re-use the
This matches our minimal working example which does Bob receive -> Alice send -> Bob receive (this wedges the session, this assumes for each message there are new olm messages aka a rotation period of 1 msg). If we remove the cache the test passes. We hypothesise it may also be the root cause of element-hq/element-ios#7751 and element-hq/element-ios#7480 |
Correction: we have identified one root cause of this, which appears likely to cause a lot of the observed cases. If #3313 is real, it is likely another cause of these symptoms. |
Adds a test for #3110 that fails before the fix and passes afterwards.
Adds a test for #3110 that fails before the fix and passes afterwards.
Adds a test for #3110 that fails before the fix and passes afterwards.
Adds a test for #3110 that fails before the fix and passes afterwards.
I believe the fix for this is finally released today, in Element X iOS 1.6.7. |
I've been looking into bug reports on Element X, and this appears on iOS: (snipped a lot of data)
The important thing here is that the session was merrily decrypting to-device events fine, until the NSE process took over. At that point, it failed to find the session, and then failed to decrypt the event (due to the sender chain index being >0 I guess) which then marked the Olm session as wedged.
I believe the problem lies in decrypt_olm_message because the NSE process fails to find the session in
existing_sessions
, implying a cache miss when it should have hit.The text was updated successfully, but these errors were encountered: