Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cross-process lock doesn't seem to work #3313

Open
richvdh opened this issue Apr 9, 2024 · 4 comments
Open

Cross-process lock doesn't seem to work #3313

richvdh opened this issue Apr 9, 2024 · 4 comments

Comments

@richvdh
Copy link
Member

richvdh commented Apr 9, 2024

We have a lock whose job it is to prevent both the main process and the NSE (notifications) process on Element X from using the OlmAccount at the same time. It also stops the two processes from performing an "encryption sync" (ie, a sliding-sync request in which we request to-device messages) concurrently.

However, we have evidence that it doesn't actually work. We have a rageshake where the main process makes a sync request:

2024-04-03T16:18:41.253319Z DEBUG matrix_sdk::http_client: Sending request 
  | crates/matrix-sdk/src/http_client/mod.rs:179 
  | spans: 
    root
    next_sync_with_lock
      > store_generation=26880
    sync_once
      > pos="29"
    send
      > server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5]
      > config=RequestConfig { timeout: 60s }
      > request_id="REQ-351"
      > method=POST
      > uri="https://slidingsync.lab.matrix.org/_matrix/client/unstable/org.matrix.msc3575/sync"
      > request_size="72 B"
      > request_body=b"{\"conn_id\":\"encryption\",\"extensions\":{\"to_device\":{\"since\":\"58922486\"}}}"
...
2024-04-03T16:19:02.361031Z DEBUG matrix_sdk::http_client: Got response 
  | crates/matrix-sdk/src/http_client/mod.rs:185 
  | spans: 
    root
    next_sync_with_lock
      > store_generation=26880
    sync_once
      > pos="29"
    send
      > server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5]
      > config=RequestConfig { timeout: 60s }
      > request_id="REQ-351"
      > method=POST
      > uri="https://slidingsync.lab.matrix.org/_matrix/client/unstable/org.matrix.msc3575/sync"
      > request_size="72 B"
      > request_body=b"{\"conn_id\":\"encryption\",\"extensions\":{\"to_device\":{\"since\":\"58922486\"}}}"
      > status=200
      > response_size="1.4 kiB"

... and overlapping with that, the NSE process makes another /sync request:

2024-04-03T16:18:59.168928Z DEBUG matrix_sdk::http_client: Sending request 
  | crates/matrix-sdk/src/http_client/mod.rs:179 
  | spans: 
    root
    get_notification
      > room_id="!wMBBpsZzEENxZzGCoY:matrix.org"
      > event_id="$Bydvzcl4X70NQyfw00EIeIWd3fN_Rh8rxqxdYDxLUFw"
    retry_decryption
    run_fixed_iterations
      > store_generation=26881
    sync_once
    send
      > server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5, V1_6, V1_7, V1_8, V1_9]
      > config=RequestConfig { timeout: 7s }
      > request_id="REQ-209"
      > method=POST
      > uri="https://slidingsync.lab.matrix.org/_matrix/client/unstable/org.matrix.msc3575/sync"
      > request_size="155 B"
      > request_body=b"{\"conn_id\":\"encryption\",\"txn_id\":\"6cad88800b6540f589143dc318fde96a\",\"extensions\":{\"to_device\":{\"enabled\":true,\"since\":\"58922486\"},\"e2ee\":{\"enabled\":true}}}"
...
2024-04-03T16:18:59.328466Z DEBUG matrix_sdk::http_client: Got response 
  | crates/matrix-sdk/src/http_client/mod.rs:185 
  | spans: 
    root
    get_notification
      > room_id="!wMBBpsZzEENxZzGCoY:matrix.org"
      > event_id="$Bydvzcl4X70NQyfw00EIeIWd3fN_Rh8rxqxdYDxLUFw"
    retry_decryption
    run_fixed_iterations
      > store_generation=26881
    sync_once
    send
      > server_versions=[V1_0, V1_1, V1_2, V1_3, V1_4, V1_5, V1_6, V1_7, V1_8, V1_9]
      > config=RequestConfig { timeout: 7s }
      > request_id="REQ-209"
      > method=POST
      > uri="https://slidingsync.lab.matrix.org/_matrix/client/unstable/org.matrix.msc3575/sync"
      > request_size="155 B"
      > request_body=b"{\"conn_id\":\"encryption\",\"txn_id\":\"6cad88800b6540f589143dc318fde96a\",\"extensions\":{\"to_device\":{\"enabled\":true,\"since\":\"58922486\"},\"e2ee\":{\"enabled\":true}}}"
      > status=200
      > response_size="1.6 kiB"

This shouldn't be possible, because both operations are done holding the cross-process lock (the main process in next_sync_with_lock, and the NSE process in run_fixed_iterations).

Now, I also see this in the log from the main process, in the middle of that next_sync_with_lock operation:

2024-04-03T16:18:51.391736Z  INFO matrix_sdk_common::store_locks: exiting the lease extension loop 
  | crates/matrix-sdk-common/src/store_locks.rs:237 
  | spans: 
    root

... which I believe means we are dropping the cross-process lock. That would obviously explain a lot of things, but I can't figure out how it happens.

@richvdh
Copy link
Member Author

richvdh commented Apr 9, 2024

I'm not really sure how best to debug this. We could probably start by turning on all the trace logging in CrossProcessStoreLock. It might also be informative to give each CrossProcessStoreLockGuard a sequence number, and log when each is allocated and dropped.

@Hywan
Copy link
Member

Hywan commented Apr 18, 2024

I'll try to check.

@kegsay
Copy link
Member

kegsay commented May 15, 2024

I think I repro'd this. It's flakey though. https://github.com/matrix-org/complement-crypto/actions/runs/9094842308/job/24996764717?pr=52#step:15:790 shows TestMultiprocessDupeOTKUpload failing. That test:

  • Creates Alice and Bob and join them to a room
  • Bob sends a message. This consumes one of Alice's OTKs.
  • Alice sees the OTK get consumed in /sync and tries to upload a new OTK.
  • This OTK is tarpitted in the test, artificially delaying the response by 4 seconds.
  • During this gap, an NSE process is made and NotificationClient.GetNotification is called.
  • If the lock isn't working, the NSE process will also do /sync, see the OTK get consumed, and upload a different key with the same key ID, causing an HTTP 400 to be returned.

This is exactly what the test shows: notification_test.go:466: /keys/upload returned an error, duplicate key upload? POST http://hs1:8008/_matrix/client/v3/keys/upload (token=syt_dXNlci0zNS1hbGljZQ_YwyfILDCncfsOAbiZBSL_3CgJtE) req_len=538 => HTTP 400

@kegsay
Copy link
Member

kegsay commented Jun 28, 2024

I'll try to come back to this next week to see if I can get a reliable repro case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants