fix(tee): fix race condition in batch locking #3342

pbeza · 2024-11-28T11:55:23Z

What ❔

After scaling zksync-tee-prover to two instances/replicas on Azure for azure-stage2, azure-testnet2, and azure-mainnet2, we started experiencing duplicated proving for some batches.

While this is not an erroneous situation, it is wasteful from a resource perspective. This was due to a race condition in batch locking. This PR fixes the issue by adding atomic batch locking.

Why ❔

To fix the bug that only activates after running zksync-tee-prover on multiple instances.

Checklist

PR title corresponds to the body of PR (we generate changelog entries from PRs).
Tests for the changes have been added / updated.
Documentation comments have been added / updated.
Code has been formatted via zkstack dev fmt and zkstack dev lint.

After [scaling][1] [zksync-tee-prover][2] to two instances/replicas on Azure for azure-stage2, azure-testnet2, and azure-mainnet2, we started experiencing [duplicated proving for some batches][3]. While this is not an erroneous situation, it is wasteful from a resource perspective. This was due to a race condition in batch locking. This PR fixes the issue by adding atomic batch locking. [1]: https://github.com/matter-labs/gitops-kubernetes/pull/7033/files [2]: https://github.com/matter-labs/zksync-era/blob/aaca32b6ab411d5cdc1234c20af8b5c1092195d7/core/bin/zksync_tee_prover/src/main.rs [3]: https://grafana.matterlabs.dev/goto/M1I_Bq7HR?orgId=1

slowli

Dumb question: How is the locking made atomic in this PR? AFAIU, the first SELECT statement, if queried concurrently, can still return the same L1 batch number unless some kind of row-level locking is implemented (cf. SELECT FOR UPDATE SKIP LOCKED in this contract verifier query). I'm not even sure the UPDATE query will fail for the transaction committed last in case of a race (maybe it would with serialization isolation level, but I'd argue that erroring is not the best cause of action here; row-level locks seem to work better).

core/lib/dal/src/models/storage_tee_proof.rs

core/lib/dal/src/tee_proof_generation_dal.rs

…locking

pbeza · 2024-11-29T18:13:12Z

Dumb question: How is the locking made atomic in this PR? (...)

Not a dumb question at all! The dumb one here was me! ;P I totally misunderstood what SQL transactions can actually handle in this context. Had to brush up on the finer details of SQL locking. Thanks for steering me in the right direction! These two links were super helpful:

…locking

core/lib/dal/src/tee_proof_generation_dal.rs

pbeza · 2024-11-29T18:56:52Z

@slowli, I’ve addressed your code review comments. Take a look when you get a chance.

It’s kinda hard to test properly without deploying it to stage and letting it run for a while. Specifically, let me know if locking rows in the proof_generation_details table is okay (instead of just locking tee_proof_generation_details rows).

…locking

pbeza · 2024-12-03T13:34:28Z

@slowli, @haraldh suggested locking the entire tee_proof_generation_details table to keep things simpler. He also raised a concern that if one TEE prover locks the batch, a second TEE prover instance will just get a no job response instead of waiting for new batches to become available.

Let me know if this more fine-grained locking approach still works for you, or if we’re missing something – or maybe there’s an easier way we haven’t considered.

This reverts commit a1f99ab.

Commit a7dc0ed (PR #3342) was supposed to fix a race condition in batch locking by introducing SQL row-locking, but it didn't work as expected. Now we are switching back to coarser-grained table-level locking as [originally suggested][1] by Harald. The original fix was hard to test unless deployed to `stage` due to the undeterministic nature of the problem, so we needed to merge it to the `main` branch to properly test it. [1]: #3342 (comment)

…3358) ## What ❔ Commit a7dc0ed (PR #3342) was supposed to fix a race condition in batch locking by introducing SQL row-locking, but it [didn't work][2] as expected. ![Screenshot From 2024-12-04 11-32-32](https://github.com/user-attachments/assets/959ffc3c-593f-409a-87ab-68ec197040a0) Now we are switching back to coarser-grained table-level locking as [originally suggested][1] by Harald. The original fix was hard to test unless deployed to `stage` due to the undeterministic nature of the problem, so we needed to merge it to the `main` branch to properly test it. [1]: #3342 (comment) [2]: https://grafana.matterlabs.dev/goto/AhEd5FVNg?orgId=1 ## Why ❔ To fix the bug that only activates after running `zksync-tee-prover` on multiple instances. ## Checklist - [x] PR title corresponds to the body of PR (we generate changelog entries from PRs). - [ ] Tests for the changes have been added / updated. - [ ] Documentation comments have been added / updated. - [x] Code has been formatted via `zkstack dev fmt` and `zkstack dev lint`.

🤖 I have created a release *beep* *boop* --- ## [25.3.0](core-v25.2.0...core-v25.3.0) (2024-12-11) ### Features * change seal criteria for gateway ([#3320](#3320)) ([a0a74aa](a0a74aa)) * **contract-verifier:** Download compilers from GH automatically ([#3291](#3291)) ([a10c4ba](a10c4ba)) * integrate gateway changes for some components ([#3274](#3274)) ([cbc91e3](cbc91e3)) * **proof-data-handler:** exclude batches without object file in GCS ([#2980](#2980)) ([3e309e0](3e309e0)) * **pruning:** Record L1 batch root hash in pruning logs ([#3266](#3266)) ([7b6e590](7b6e590)) * **state-keeper:** mempool io opens batch if there is protocol upgrade tx ([#3360](#3360)) ([f6422cd](f6422cd)) * **tee:** add error handling for unstable_getTeeProofs API endpoint ([#3321](#3321)) ([26f630c](26f630c)) * **zksync_cli:** Health checkpoint improvements ([#3193](#3193)) ([440fe8d](440fe8d)) ### Bug Fixes * **api:** batch fee input scaling for `debug_traceCall` ([#3344](#3344)) ([7ace594](7ace594)) * **tee:** correct previous fix for race condition in batch locking ([#3358](#3358)) ([b12da8d](b12da8d)) * **tee:** fix race condition in batch locking ([#3342](#3342)) ([a7dc0ed](a7dc0ed)) * **tracer:** adds vm error to flatCallTracer error field if exists ([#3374](#3374)) ([5d77727](5d77727)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: zksync-era-bot <[email protected]>

pbeza requested review from haraldh and slowli November 28, 2024 14:03

pbeza force-pushed the tee/fix/atomic-batch-locking branch from a86fc98 to e95cb27 Compare November 28, 2024 18:10

pbeza requested a review from RomanBrodetski November 29, 2024 11:59

pbeza force-pushed the tee/fix/atomic-batch-locking branch from e95cb27 to 46dcfde Compare November 29, 2024 12:06

pbeza force-pushed the tee/fix/atomic-batch-locking branch from 46dcfde to 7d96c1c Compare November 29, 2024 12:10

slowli reviewed Nov 29, 2024

View reviewed changes

core/lib/dal/src/models/storage_tee_proof.rs Outdated Show resolved Hide resolved

core/lib/dal/src/tee_proof_generation_dal.rs Outdated Show resolved Hide resolved

pbeza added 2 commits November 29, 2024 17:44

Merge remote-tracking branch 'origin/main' into tee/fix/atomic-batch-…

17977dc

…locking

Addressed review comments

464b5d4

pbeza added 2 commits November 29, 2024 19:19

fixup! Addressed review comments

574896b

Merge remote-tracking branch 'origin/main' into tee/fix/atomic-batch-…

66a9154

…locking

pbeza commented Nov 29, 2024

View reviewed changes

core/lib/dal/src/tee_proof_generation_dal.rs Show resolved Hide resolved

pbeza requested a review from slowli November 29, 2024 19:00

slowli previously approved these changes Dec 2, 2024

View reviewed changes

pbeza added 2 commits December 3, 2024 12:35

Merge remote-tracking branch 'origin/main' into tee/fix/atomic-batch-…

bef72a4

…locking

Address Alex's code review comments

6796543

pbeza dismissed slowli’s stale review via 6796543 December 3, 2024 11:48

fixup! Address Alex's code review comments

ecd8c33

pbeza requested a review from slowli December 3, 2024 12:19

slowli previously approved these changes Dec 3, 2024

View reviewed changes

Address Harald's code review comments

a1f99ab

pbeza dismissed slowli’s stale review via a1f99ab December 3, 2024 13:24

pbeza requested a review from slowli December 3, 2024 13:38

Revert "Address Harald's code review comments"

de8de11

This reverts commit a1f99ab.

haraldh approved these changes Dec 3, 2024

View reviewed changes

haraldh enabled auto-merge December 3, 2024 16:32

haraldh added this pull request to the merge queue Dec 3, 2024

slowli approved these changes Dec 3, 2024

View reviewed changes

Merged via the queue into main with commit a7dc0ed Dec 3, 2024
32 checks passed

haraldh deleted the tee/fix/atomic-batch-locking branch December 3, 2024 17:15

zksync-era-bot mentioned this pull request Dec 3, 2024

chore(main): release core 25.3.0 #3313

Merged

pbeza mentioned this pull request Dec 4, 2024

fix(tee): correct previous fix for race condition in batch locking #3358

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(tee): fix race condition in batch locking #3342

fix(tee): fix race condition in batch locking #3342

pbeza commented Nov 28, 2024 •

edited

Loading

slowli left a comment

pbeza commented Nov 29, 2024 •

edited

Loading

pbeza commented Nov 29, 2024 •

edited

Loading

pbeza commented Dec 3, 2024 •

edited

Loading

fix(tee): fix race condition in batch locking #3342

fix(tee): fix race condition in batch locking #3342

Conversation

pbeza commented Nov 28, 2024 • edited Loading

What ❔

Why ❔

Checklist

slowli left a comment

Choose a reason for hiding this comment

pbeza commented Nov 29, 2024 • edited Loading

pbeza commented Nov 29, 2024 • edited Loading

pbeza commented Dec 3, 2024 • edited Loading

pbeza commented Nov 28, 2024 •

edited

Loading

pbeza commented Nov 29, 2024 •

edited

Loading

pbeza commented Nov 29, 2024 •

edited

Loading

pbeza commented Dec 3, 2024 •

edited

Loading