-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: SNARK pool batching #4882
RFC: SNARK pool batching #4882
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
## Summary | ||
[summary]: #summary | ||
|
||
Complicate the SNARK pool so that it can do "batch verification" - verification of several proofs at once. | ||
|
||
## Motivation | ||
[motivation]: #motivation | ||
|
||
Verification of new SNARK work becomes a throughput bottleneck at high transaction-per-second configurations (saith [#3981](https://github.com/CodaProtocol/coda/pull/3981)). There is a more efficient "batch verification" that we can use - by verifying N SNARKs at once, we spend `600ms + N*10ms` instead of `N * 600ms` <sup>(citation needed)</sup>. This comes with a catch: if batch verification fails, we only know that _some_ SNARK in there failed to verify, but not which. In the case that a batch fails to verify, we need to do extra work to attribute the failure to a particular sender IP / prover, and know which of the SNARKs we can actually include in the pool. By sending incorrect or fraudulent proofs, an attacker can degrade the throughput of a block producer. Each transaction in the block needs to come with two (or occasionally one) proofs, and if not enough proofs are available, the block producer is forced into leaving profit on the table. | ||
|
||
## Detailed design | ||
[detailed-design]: #detailed-design | ||
|
||
The proposed mechanism is to have two "batch modes": Batch All and Batch by Prover, and to split provers into two categories: Trusted and Untrusted. | ||
|
||
Provers start out Untrusted, and have their proofs checked separately. Proofs from Untrusted provers are batched "by prover": for each prover, all their proofs are verified together in individual batches. If a batch fails to verify, that prover is banned, and none of their proofs will be used. After `TRUST_AMOUNT_PROOFS` proofs, provers are promoted into Trusted. All proofs from trusted provers are verified in a single batch. If that batch fails, the proofs are re-batched "by prover", and the provers of the failing batches are punished. | ||
|
||
This changes how SNARK proofs are collected and verified: instead of verifying them immediately as they arrive, we wait for some time before collecting all the received proofs into their respective batches. We'll do this like so: | ||
|
||
- Store a queue of pending SNARK work, starting out empty. | ||
- Upon receipt of the first SNARK work into the queue, start a `MAX_SNARK_DELAY` second timer. | ||
- When the queue hits `MAX_SNARK_BATCH_LENGTH`, or after the timer expires, clear the queue and start the batch verification process. | ||
|
||
The timer ensures that, even when SNARK work is coming in slowly, we aren't adding too much latency into the gossip verification. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @emberian is there a need for a Have a state type verifier_state =
{ mutable out_for_verification : Snark_work.t list option
; queue : Snark_work.t Queue.t }
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess basically you are suggesting waiting |
||
|
||
In order to identify provers into those two categories, we need an unforgeable notion of prover identity. I propose we add a new signature of the SNARK work, using a key that is present in the best tip ledger. This adds a new failure mode to SNARK proofs: signing with a key that's not present in the ledger. Users should be discouraged from storing value in that account. By necessity, the private key will need to be relatively unprotected as it needs to be available on the SNARK coordinator to sign proofs before broadcasting them. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How much time does it takes to check a signature? Is that the 10 ms in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Aha, good point, this doesn't account for the extra signature verification time. I can add a benchmark for this, and I suspect it'll be slower than it could be (using a different signature scheme) but not as slow as not batching. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @imeckler gave me new observed numbers: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The identity can be included as part of the signature-of-knowledge message and thus implicitly included in SNARK verification |
||
|
||
Concretely, this would change the `Snark_pool_diff` type from `Work.t * Ledger_proof.t One_or_two.t Priced_proof.t` to `{signature: Signature.t; body: Snark_work_body.t}` where `Snark_work_body` is a stable type with contents `Public_key.t * Work.t * Ledger_proof.t One_or_two.t Priced_proof.t`. | ||
|
||
By requiring provers to have an identity present in the ledger, and by banning the IPs _and identities_ of senders of bad proofs, we can avoid a Sybil attack where one party can arbitrarily degrade TPS for individual block producers for the low cost of an IP. | ||
|
||
### Testing | ||
|
||
In the integration tests: Add a way to spin up a new SNARK coordinator node, broadcast a bad proof, and verify that the node is banned, and that the "batch by prover" mode works. | ||
|
||
In the SNARK pool unit tests: manually inject some bogus snark pool diffs. | ||
|
||
## Drawbacks | ||
[drawbacks]: #drawbacks | ||
|
||
Requiring an account to make proofs is a really nasty operational requirement. One possible mitigation: run a server with a massive horizontally scaled SNARK verifier pool that accepts proofs from anyone and signs it with its own account. SNARK pools (analogous to mining pools) could provide this service to their clients. | ||
|
||
## Rationale and alternatives | ||
[rationale-and-alternatives]: #rationale-and-alternatives | ||
|
||
Some alternatives to "batching by prover": | ||
|
||
- Do a split-in-half-and-conquer on the batches. This can end up with a substantial amount of wasted work. | ||
- Unbatch all and verify sequentially. This lets us keep good proofs from bad provers, but is somewhat contrary to the immediate punishment that follows. | ||
|
||
An early idea was an extra "not yet known" state before nodes become Untrusted. In the extra state, we don't do any batching, so as to avoid commiting potentially a lot of time to verifying a bogus batch before any reputation is established. This was deemed unnecessary. | ||
|
||
One idea was for identifying provers was to use the fee recipient account, since it's already encoded into the SoK. Without adding an additional signature this allows forgery as only public information is needed, and also if the proof fails to verify then nothing can be inferred from it. Using the fee recipient account itself seems unwise as it will be storing actual value, and having the private key just lying around with on the SNARK coordinator is risky (and a regression from today's world where SNARK coordinators and workers need no sensitive information). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure if I understand the forgery part. The Sok is baked in the snark and so if someone sent it with a different key, the proof would fail to verify? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The SoK digest is baked into the snark (and I'm not sure there's a way to extract it? maybe it's a public input), and if the proof fails to verify because it was sent with a different key we still don't know who to blame for the bad proof. |
||
|
||
It is not possible to use IP addresses alone for batching categories because the gossip-net-sender of the proof is unrelated to who proved it. | ||
|
||
One idea that was discussed was broadcasting "SNARK fraud proofs", consisting of the signed-but-wrong proof, so the entire network could ban that prover. This seems unnecessary: it incurs a "we need to try and fail to verify this proof" cost on the entire network. This may still be preferable to letting the attacker poison as many batches as they want across the network while they get banned by each individual node. | ||
|
||
We might want to instaban senders of work whose work isn't present in the best tip ledger. There are some complications around the network not actually having consensus on the best tip ledger yet. We could use the locked ledger instead, but this adds a lot of delay to becoming a new SNARK coordinator. | ||
|
||
## Prior art | ||
[prior-art]: #prior-art | ||
|
||
I don't know. | ||
|
||
## Unresolved questions | ||
[unresolved-questions]: #unresolved-questions | ||
|
||
Is the signing key being present in the ledger an acceptable price to pay? Are there any other ways to avoid the sybil attack? Is the sybil attack worth mitigating? | ||
|
||
All of the constants above need to be determined: | ||
|
||
`MAX_SNARK_DELAY`: With randomsub, a network of 10k nodes has an expected diameter of around 4, and so a delay of around 5 seconds shouldn't put undue burden on the SNARK pool (effective diameter of 20s, versus our 3 minute slot time and roughly ~2 minute proving time). Assuming 600ms of overhead per batch, this is a 12% overhead compared to a 0.3% overhead if we were to somehow batch all proofs received in a 3 minute slot. The 12% batching overhead can be cranked down by increasing the max batching window. | ||
|
||
`MAX_SNARK_BATCH_LENGTH` can almost be computed from the longest amount of time we want to accept as a sunk cost for a bad batch, modulo the complication that untrusted provers have the extra fixed-cost-per-batch. | ||
|
||
`TRUST_AMOUNT_PROOFS` is hard to set. It's not clear that there are notions of economic incentive to appeal to here (and [rationality is self-defeating](https://bford.info/2019/09/23/rational/)). Is there any pressing reason it shouldn't just be `1`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think batching snarks that are required immediately (if the block producer is scheduled to create the next block) would also be nice? if we wait too long to batch verify/ or some snarks are in a batch that won't be verified until after the diff is created then we might not be able include as many transactions if they were serialized. But this might not be as beneficial from a performance standpoint.
Also, we have this
delay
factor in scan state so, if there are enough provers and the coordinator picks the work that are needed first then the nodes producing blocks should have the required work at least a block before they need it (if the delay is at least 1). So maybe it is okThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deepthiskumar how can we detect which of these proofs are required in the next block? I think it makes sense to eagerly start a batch before the timeout/max batch size if there's a pending block production deadline, but I'm not sure how to know which proofs to give that treatment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can query the scan state at the best tip to get all the required work for the next block.
work_statements_for_new_diff
intransaction_snark_scan_state.ml
will fetch you that