Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clients may throw away one-time keys which are still published on the server, or have messages in flight #2356

Closed
5 tasks done
Tracked by #245
richvdh opened this issue Feb 23, 2017 · 12 comments
Closed
5 tasks done
Tracked by #245
Assignees
Labels
A-E2EE O-Occasional Affects or can be seen by some users regularly or most users rarely S-Major Severely degrades major functionality or product features, with no satisfactory workaround T-Defect T-Story

Comments

@richvdh
Copy link
Member

richvdh commented Feb 23, 2017

Background: On first login, clients generate about 50 one-time-keys. Each key consists of a public part and a secret part. The public parts are uploaded to the server, and the secret parts are retained on the device. A public one-time-key can then be "claimed" later by other devices, to initiate a secure Olm channel between the two devices. (The secure channel is then used for sharing Megolm message keys.) Whenever a client finds that there are less than 50 keys on the server, it will generate more key pairs and upload the public parts.

However, there is a limit to the number of secret keys that a client can keep hold of, and over time it may accumulate unused secret keys that appear to have never been used. The client has little alternative but to throw away such old keys.

Problem: there is nothing in the spec about the order in which servers give out one-time keys. In particular, Synapse does not give them out in the obvious first-in first-out order (see below). It is therefore possible for some very old public keys to hang around on the server. By the time those keys are eventually claimed by the sender, it is quite possible that the receiving device has forgotten the secret part of the corresponding key.

The net effect is that the recipient cannot read the messages sent over the Olm channel, does not receive the message keys, and the user sees an Unable To Decrypt error.

(Worse: we have mitigations to "unwedge" such broken Olm channels, but they don't actually work very well in this case due to matrix-org/matrix-rust-sdk#3427.)


Original description:

element-hq/element-web#2782 was a special-case of this, but there are other cases when we can throw away important one-time keys. It may be impossible to make this completely water-tight, but I'd like to add some logging when we throw away one-time keys, and reconsider the heuristic. (Does /claim give out the oldest key? which is the one we expire first?)

@richvdh
Copy link
Member Author

richvdh commented Jun 6, 2017

(Does /claim give out the oldest key? which is the one we expire first?)

The specific scenario I'm worrying about here is:

  1. Alice creates 50 OTKs, numbered 1 to 50, and uploads them to the server.
  2. Bob wants to message Alice and claims one of the keys. Alice's server hands out one of the keys randomly: let's say key 10. Alice does not immediately receive Bob's pre-key message. (Potentially, Bob's client goes offline. Or Bob may even claim the key maliciously.)
  3. Alice's server tells her that she has only 49 OTKs left on the server. She must make a new key; to do so, she must throw away the private part of one of her existing OTKs. She throws away the oldest one - key 1. However, the public part of that key is still on Alice's server.
  4. Later, Carol wants to message Alice, and claims a key. She is given key 1. Alice cannot decrypt it, because she threw away the private part. Sad times.

Synapse's algorithm for picking a key for /keys/claim is:

SELECT key_id, key_json FROM e2e_one_time_keys_json WHERE user_id = ? AND device_id = ?
 AND algorithm = ? LIMIT 1

Since that's totally unsorted, you've got a good chance of some very old keys sitting around for ages.

@MadLittleMods MadLittleMods added O-Occasional Affects or can be seen by some users regularly or most users rarely S-Major Severely degrades major functionality or product features, with no satisfactory workaround and removed P2 labels Mar 15, 2023
@MadLittleMods
Copy link

@richvdh Is this issue still useful to you?

Besides fixing the issue itself, has logging around this been added with the latest crypto updates (perhaps in the Rust implementation)? Is this area easier to trace now?

@richvdh
Copy link
Member Author

richvdh commented Mar 15, 2023

It's still a potential problem. I don't really remember if more logging got added; the fact that nothing is linked here suggests that it didn't.

Element-R won't necessarily fix the problem in itself but is rewriting this area, so it's not worth investigating in the legacy implementation.

@BillCarsonFr
Copy link
Member

Rust SDK is keeping a lot more OTK (50 * 100). Lib olm appears to keep 100.
So this would be a lot less common on rust.

@richvdh
Copy link
Member Author

richvdh commented Feb 27, 2024

This is mitigated in Element R, because the Rust SDK keeps up to 5000 OTKs before it starts throwing them away.

However, it is still a problem, because that 5000 limit will eventually fill up (from network blips etc which cause keys to be claimed but not used), and then the Rust SDK will start throwing keys away. Whilst the Rust SDK will throw away the "oldest" key, we have no guarantee that that key is not still active on the server, so the problem recurs.

@kegsay
Copy link

kegsay commented Feb 27, 2024

To further clarify why this is a problem, consider what happens when all 5000 keys are used on the client and 50 of those keys are on the server and someone calls /keys/claim for that device:

  • Synapse picks a random key to give to the caller of /keys/claim,
  • Synapse then drops the count to 49 and sends it down /sync,
  • The client sees 49 and uploads one more OTK. It has 5001 keys now so needs to throw away a key, which it does by age (oldest key gets thrown).
  • What happens if the oldest key was the one that Synapse provided to the caller?

In this case, when the caller encrypts for that device, the pre-key message will be undecryptable because the client threw away the OTK private key.

@kegsay
Copy link

kegsay commented Mar 15, 2024

This is a ticking time bomb. We are likely to see this as the age of client's DBs get older and more OTKs accumulate which are never claimed, consuming the 5000 buffer.

A heuristic like "oldest" could work, or perhaps more usefully, the server could send back the key IDs it gave out rather than just decrementing the count by 1. The latter is more invasive though.

@t3chguy t3chguy transferred this issue from element-hq/element-web Mar 15, 2024
@t3chguy
Copy link
Member

t3chguy commented Mar 15, 2024

Moving as this seems to apply to all clients, including the Rust SDK based ones

@richvdh richvdh changed the title We can still throw away one-time keys which have messages in-flight We can throw away one-time keys which have messages in-flight Apr 30, 2024
@richvdh richvdh removed the Z-Awaiting-Element-R Issues in code which will be replaced by the port of Element's crypto layer to Rust label May 28, 2024
@richvdh
Copy link
Member Author

richvdh commented Jul 9, 2024

Synapse's algorithm for picking a key for /keys/claim is:

SELECT key_id, key_json FROM e2e_one_time_keys_json WHERE user_id = ? AND device_id = ?
 AND algorithm = ? LIMIT 1

Since that's totally unsorted, you've got a good chance of some very old keys sitting around for ages.

It's actually worse than "totally unsorted". In practice, postgres will use the first matching entry from the e2e_one_time_keys_json_uniqueness index, which covers (user_id, device_id, algorithm, key_id) -- so Synapse will end up handing out the OTK with the lexicographically-lowest key_id.

Now, the key ID is a base64-encoding of a 32 bit int; in particular that means that, for example, the key ids for the 208th through 255th keys (AAAA0A - AAAA/w) sort before those of the 0th through 207th keys (AAAAAA-AAAAxw). So as soon as the 208th key is published, it will be issued by the next /keys/claim, even though we may have many of the earlier keys sitting around -- none of which will be issued until the client publishes another 50 keys or so. This pattern repeats for every 4-character prefix (256 keys) -- and gets even worse at key 13312 (key ID AAA0AA), which sorts before any earlier-generated key.

Key IDs later in the alphabet (AAA.?., where . is any character and ? is, say, s-z) are particularly likely to get dropped because of this: they are the keys likely to be unclaimed at the point the client starts publishing earlier-sorted key IDs, and therefore those most likely to still be unclaimed while the client is generating keys with much later key IDs.

@richvdh richvdh changed the title We can throw away one-time keys which have messages in-flight We can throw away one-time keys which are still published on the server, or have messages in flight Jul 9, 2024
@kegsay
Copy link

kegsay commented Jul 11, 2024

element-hq/synapse#17267 can cause us to have indefinitely claimed OTKs because the receiver timed out the request.

@richvdh
Copy link
Member Author

richvdh commented Jul 11, 2024

element-hq/synapse#17267 can cause us to have indefinitely claimed OTKs because the receiver timed out the request.

What do you mean by "indefinitely claimed OTKs"? An OTK that is claimed but never used, causing it to get "stuck" on the client? That's true; the same can result from clients issuing a /keys/claim request but losing connectivity before the response arrives. But that's a separate issue to this one?

Edit: well, it's not a separate issue: it's the reason that clients end up having to throw away the private part of OTKs. What I really mean is: it's pretty much expected behaviour.

@kegsay
Copy link

kegsay commented Jul 12, 2024

Indeed, the reason why I'm flagging it is because it outlines a real-world example of how we can accumulate keys on the client.

@richvdh richvdh changed the title We can throw away one-time keys which are still published on the server, or have messages in flight Clients may throw away one-time keys which are still published on the server, or have messages in flight Sep 12, 2024
@richvdh richvdh self-assigned this Nov 4, 2024
richvdh added a commit to element-hq/synapse that referenced this issue Nov 4, 2024
Currently, one-time-keys are issued in a somewhat random order. (In practice,
they are issued according to the lexicographical order of their key IDs.) That
can lead to a situation where a client gives up hope of a given OTK ever
being used, whilst it is still on the server.

Fixes: element-hq/element-meta#2356
@richvdh richvdh added the T-Story label Nov 5, 2024
richvdh added a commit to element-hq/synapse that referenced this issue Nov 6, 2024
Currently, one-time-keys are issued in a somewhat random order. (In
practice, they are issued according to the lexicographical order of
their key IDs.) That can lead to a situation where a client gives up
hope of a given OTK ever being used, whilst it is still on the server.

Related: element-hq/element-meta#2356
@richvdh richvdh closed this as completed Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-E2EE O-Occasional Affects or can be seen by some users regularly or most users rarely S-Major Severely degrades major functionality or product features, with no satisfactory workaround T-Defect T-Story
Projects
None yet
Development

No branches or pull requests

6 participants