Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

approval-distribution: process assignments and votes in parallel #732

Closed
sandreim opened this issue Jan 23, 2023 · 22 comments
Closed

approval-distribution: process assignments and votes in parallel #732

sandreim opened this issue Jan 23, 2023 · 22 comments
Assignees
Labels
I4-refactor Code needs refactoring.

Comments

@sandreim
Copy link
Contributor

sandreim commented Jan 23, 2023

Follow up from some earlier scaling tests with paritytech/polkadot#6247. The initial attempts did made it apparent that approval-distribution and approval-voting are bottlenecks increasing approval checking finality lag in our tests with 250 paravalidators and 50 parachains at the time.
That argument still holds true, but only at much larger scale than what we were experimenting at the time. The real issue was paritytech/polkadot#6400 which was solved.

Bumping up to 350+ para validators requires that we make approval-voting and approval-distribution do more work in the same amount of time. The first step is to implement the parallelization changes from paritytech/polkadot#6247 which would result in maxing out cpu usage in approval-voting which sits right now at 80% due to the serial processing and waiting for completion of vote imports issued by approval-distribution. This will enable us to further improve how fast the subsystem churns through work by making approval-voting parallelise signature checks.

Important note: The implementation will make use of internal queueing of messages. We have to be careful to preserve the backpressure so we need these internal queues to be bounded, for example by a fixed number of imports that can be pending at a given time.

@eskimor
Copy link
Member

eskimor commented Jan 23, 2023

I am still in favor of trying real hard to improve performance/reduce work somehow. Using multiple cores just for signature checking in approval voting ... maxing out multiple cores, even when all parachains are 100% idle? That does sound like something we should be able to improve.

@sandreim
Copy link
Contributor Author

Yes, I created #729 and if I understand correctly it does exactly what you mean and helps a lot.

I think we should implement improvements at the protocol/cryptography level as well as architectural. While in perfect network conditions maybe the changes suggested in #729 might be enough to scale to 500+ validators, but if we go higher or upper tranches trigger or disputes happen then we clearly need a solution to be able to handle bursts of assignments and votes efficiently and scale beyond 1 CPU.

@rphmeier
Copy link
Contributor

maxing out multiple cores, even when all parachains are 100% idle? That does sound like something we should be able to improve.

FWIW, we just have these constant overheads for every parachain block. The flip side is that they don't get worse when parachains get heavier.

@alexggh alexggh self-assigned this May 30, 2023
@alexggh
Copy link
Contributor

alexggh commented Jun 12, 2023

Spent some time introspecting and profiling the approval voting to understand why we see the high ToF in approval-distribution, here are my findings, in the order of the impact:

  1. We are processing a lot of assignments and votes, which come in bursts, and the crypto for checking them takes 2-3 hundreds of micro-seconds(with rare occasions where it takes 2-3millis) , that capes the full capacity of approval-voting assignments/votest processing to about 3k checks.
    At peak performance that's how many assignments/approval we are processing, so if the messages arrive in the queue at about the same time in the approval-distribution queue, just by the nature of processing them serially we are going to see higher than 1s ToF. There are also a lot of messages that contain more than one assignments, so there will be messages that keep approval-distribution busy for a few milliseconds, just because processing all the assignments/votes in the message take time to be processed, see process_incoming_peer_message. Besides this, our system doesn't process assignments/approvals at peak capacity because of the items bellow.

Reducing the assignments from this paritytech/polkadot#6782 will definitely help with the number of messages we have to process.

  1. There are other operations that keep the approval-voting main loop busy, occasionally:
  • handle_actions(issue_approval) , every 6 seconds we have around 6 messages of this operation which takes 3-4 milliseconds to be processed, with all the useful time being spent in signing the approval here.

  • ActiveLeaves - takes about 20 millis every 6 seconds.

  • Occasionally(once or twice a second) approval-voting db flushes takes 1-2 milliseconds

While this operations happen relatively rarely, they do happen at peak time, so they might be just the straw that break the camel's back.

  1. Occasionally(every few seconds), there seems to be situations where approval-distribution is waiting after approval-voting for 3-4 millis, but approval-voting doesn't do any work it is just stuck in the main select my theory is that in those cases the subsystem future doesn't execute because the system is busy executing other tasks. (Do you guys have any other theory) ?

@eskimor
Copy link
Member

eskimor commented Jun 12, 2023

Great work! On signature checking, do we know how much of it is approvals, vs. assignments?

@burdges
Copy link

burdges commented Jun 12, 2023

ToF?

We need better benchmarks in schnorrkel but verifying an approval should take roughly 40 microseconds, so like ed25519, and verifying an assignment should be less than 3*25 + 40 = 115 micro seconds, so in 1 millisecond on one core you'd maybe 25 approvals, or 8 assignments, or 6 of each.

I've some tricks to speed up the VRF parts..

  1. Actually use all the features correctly, ala Support SIMD on Rust stable dalek-cryptography/curve25519-dalek#520
  2. We can prove multiple VRFs with one signature, which helps our average case some, and really helps if attackers push no-shows on multiple candidates.
  3. Thin VRF merges the verification equations, which maybe saves something right now, but definitely makes batching cheaper.

An elligator hash-to-curve should be fast, but we could cache the hash-to-curve if we uniformized them by excluding the validator public key from the hashed-to-curve. As the extreme we could compute a base point table for those uniformized hash-to-curves, which saves 10 microseconds if not batching.

I'd think sr25519 and ed25519 cost the same, so we'd only speed up approvals verification though either batching or else by adopting a considerably larger but non-batchable signature, like Rabin-Williams. We no longer think network throughout bottlenecks us here?

We already have something placing multiple assignments into the same message? Are the assignments by the same signer or possibly different signers?

handle_actions(issue_approval) , every 6 seconds we have around 6 messages of this operation which takes 3-4 milliseconds to be processed, with all the useful time being spent in signing the approval here.

Why 6? An sr25519 signing operation should take only 15 microseconds, just like ed25519, so maybe the keystore has problems here? We need roughly 25+25+15 = 65 microseconds for VRF signing, btw.

ActiveLeaves - takes about 20 millis every 6 seconds.

Is this querying approvals data or something about the chain?

Occasionally(once or twice a second) approval-voting db flushes takes 1-2 milliseconds

mmap fixes this one. approvals-voting db needs to be serialized in memory for this, but that's not really problematic.

@sandreim
Copy link
Contributor Author

sandreim commented Jun 12, 2023

Spent some time introspecting and profiling the approval voting to understand why we see the high ToF in approval-distribution, here are my findings, in the order of the impact:

Nice work @alexggh

  1. We are processing a lot of assignments and votes, which come in bursts, and the crypto for checking them takes 2-3 hundreds of micro-seconds(with rare occasions where it takes 2-3millis) , that capes the full capacity of approval-voting assignments/votest processing to about 3k checks.
    At peak performance that's how many assignments/approval we are processing, so if the messages arrive in the queue at about the same time in the approval-distribution queue, just by the nature of processing them serially we are going to see higher than 1s ToF. There are also a lot of messages that contain more than one assignments, so there will be messages that keep approval-distribution busy for a few milliseconds, just because processing all the assignments/votes in the message take time to be processed, see process_incoming_peer_message. Besides this, our system doesn't process assignments/approvals at peak capacity because of the items bellow.

Reducing the assignments from this paritytech/polkadot#6782 will definitely help with the number of messages we have to process.

Yes! But this improvement becomes less useful if we just scale the validators up. It is more useful when we actually need to do more work per validator - more parachains.

  1. There are other operations that keep the approval-voting main loop busy, occasionally:
  • handle_actions(issue_approval) , every 6 seconds we have around 6 messages of this operation which takes 3-4 milliseconds to be processed, with all the useful time being spent in signing the approval here.
  • ActiveLeaves - takes about 20 millis every 6 seconds.
  • Occasionally(once or twice a second) approval-voting db flushes takes 1-2 milliseconds

While this operations happen relatively rarely, they do happen at peak time, so they might be just the straw that break the camel's back.

  1. Occasionally(every few seconds), there seems to be situations where approval-distribution is waiting after approval-voting for 3-4 millis, but approval-voting doesn't do any work it is just stuck in the main select my theory is that in those cases the subsystem future doesn't execute because the system is busy executing other tasks. (Do you guys have any other theory) ?

I also observed this some time ago. My theory back then:

  • both approval voting and approval distribution run on their own threads (for obvious reasons), but that doesn't mean anything in terms of priority
  • it might happen that due to system load, the OS will prefer to schedule any other threads vs the approval voting thread.
  • we can also observe the high CPU load of the networking stack which I believe also plays a role here

@alexggh
Copy link
Contributor

alexggh commented Jun 12, 2023

ToF?

Time of flight and it defines how much time a message arrived in the approval-distribution waits till it gets picked up for processing.

do we know how much of it is approvals, vs. assignments?

They seem to be evenly spread, e.g 1 second snapshot: 1717 assignments adding up to 640ms of processing time and 2040 approvals adding up to 360ms.

We need better benchmarks in schnorrkel but verifying an approval should take roughly 40 microseconds, so like ed25519, and verifying an assignment should be less than 3*25 + 40 = 115 micro seconds, so in 1 millisecond on one core you'd maybe 25 approvals, or 8 assignments, or 6 of each

I'll dig a bit deeper here, but in both my instrumentation on the tracing we have here assignment imports seem to take 200-300 micro seconds and approvals imports between 100 and 200 micro seconds. Also, I noticed rare cases where checking of a RelayVRFDelay assignment takes 1-2 ms.

handle_actions(issue_approval) , every 6 seconds we have around 6 messages of this operation which takes 3-4 milliseconds to be processed, with all the useful time being spent in signing the approval here.

Why 6?

It seems we are doing it per candidate, so I think we have more at the same time.

An sr25519 signing operation should take only 15 microseconds, just like ed25519, so maybe the keystore has problems here?

What problems are you thinking about ?

We need roughly 25+25+15 = 65 microseconds for VRF signing, btw.

Will re-check it.

ActiveLeaves - takes about 20 millis every 6 seconds.

Is this querying approvals data or something about the chain?

Yes, it is asking the runtime for block information a few times.

I've some tricks to speed up the VRF parts..

Actually use all the features correctly, ala dalek-cryptography/curve25519-dalek#520

Will look to see if this is used on the images I did the testing, if not I'll check what impact it has.

We already have something placing multiple assignments into the same message? Are the assignments by the same signer or possibly different signers?

Are you referring to this paritytech/polkadot#6782, if yes. The answer is no, I wasn't running that PR when I did the measurements.

We no longer think network throughout bottlenecks us here?

The messages are already in the approval-distribution queue, so I would say the network shouldn't be the bottleneck here.

@burdges
Copy link

burdges commented Jun 12, 2023

We no longer think network throughout bottlenecks us here?

The messages are already in the approval-distribution queue, so I would say the network shouldn't be the bottleneck here.

If I understand, we do not currently know what performance issues the networking itself brings, so we should not yet assume we can make life harder for the networking either.

Actually use all the features correctly, ala dalek-cryptography/curve25519-dalek#520

Will look to see if this is used on the images I did the testing, if not I'll check what impact it has.

Ask @koute maybe?

Is this querying approvals data or something about the chain?

Yes, it is asking the runtime for block information a few times.

Relay VRF preoutputs probably, maybe something else. We copy & keep the relay VRF output outside the chain once we start approvals work, no?

An sr25519 signing operation should take only 15 microseconds, just like ed25519, so maybe the keystore has problems here?

What problems are you thinking about ?

I've no idea how the keystore works internally, but unrealted..

I noticed one reason signing might be slow or unpredictable: We do a syscall every time we sign anything in https://github.com/w3f/schnorrkel/blob/master/src/lib.rs#L228 which gives linux an excuse to bump the thread. I'd previously used ThreadRng but removed it to simplify maintenance in w3f/schnorrkel@c333370 I'll revert that commit, fix rand, and make some other changes.

They seem to be evenly spread, e.g 1 second snapshot: 1717 assignments adding up to 640ms of processing time and 2040 approvals adding up to 360ms.

So 3x and 4x longer than my guestimates for the bare crypto, respectively. We do need better benchmarks on schnorrkel of course, like the rand issue for example.

We already have something placing multiple assignments into the same message? Are the assignments by the same signer or possibly different signers?

Are you referring to this paritytech/polkadot#6782, if yes. The answer is no, I wasn't running that PR when I did the measurements.

No. We discussed batching together other approvals messages too, but maybe no issue exists yet. If we're already signing 6 around the same time then maybe this saves lots.

@koute
Copy link
Contributor

koute commented Jun 12, 2023

Will look to see if this is used on the images I did the testing, if not I'll check what impact it has.

It's not. This still requires passing a special flag to the compiler when compiling polkadot so that the SIMD instructions actually get used.

But I have another PR to curve25519-dalek which makes it work automatically, and it was recently merged in. As soon as there's a new release I'll be integrating that in schnorrkel.

The expected speedup thanks to this is somewhere between ~30%-50%, depending on the hardware.

@alexggh
Copy link
Contributor

alexggh commented Jun 13, 2023

No. We discussed batching together other approvals messages too, but maybe no issue exists yet. If we're already signing 6 around the same time then maybe this saves lots.

It doesn't exists yet, I think it is this one: #701

Relay VRF preoutputs probably, maybe something else. We copy & keep the relay VRF output outside the chain once we start approvals work, no?

Yes, we do that.

used ThreadRng but removed it to simplify maintenance in w3f/schnorrkel@c333370 I'll revert that commit, fix rand, and make some other changes.

Thank you, I'll try to see if I can obtain some numbers with that reverted.

But I have another PR to curve25519-dalek which makes it work automatically, and it was recently merged in. As soon as there's a new release I'll be integrating that in schnorrkel.

@koute Could you point me to the PR, so I can bring it locally and get out some numbers.

@koute
Copy link
Contributor

koute commented Jun 13, 2023

But I have another PR to curve25519-dalek which makes it work automatically, and it was recently merged in. As soon as there's a new release I'll be integrating that in schnorrkel.

@koute Could you point me to the PR, so I can bring it locally and get out some numbers.

It's merged to curve25519-dalek's main, so if you pull that in it'll automatically work as long as the simd cargo feature is enabled. (Note: there are two backends, and the faster avx512 one requires a nightly compiler; when compiled on a non-nightly compiler it will be silently disabled. The crate will automatically pick the fastest available one at runtime.)

That said, actually integrating it is a little bit more complex because the new version of curve25519-dalek has breaking changes and it needs to be replaced in a few crates (schnorrkel, ed25519-zebra, sp-io), so it's not just a simple job of bumping the deps. I don't know whether you actually want to invest your time into this right now. (Hopefully there will be a new official release of curve25519-dalek soon-ish and then I'll be making the relevant PRs myself to pull it in.)

If you're interested here are some performance numbers from curve25519-dalek's own benchmarks for its two available SIMD backends. I took these on a GCP VM.

Scalar vs AVX2
testcase scalar avx2 %
multiscalar benches/Constant-time variable-base multiscalar multiplication/384 5896.6 3886.29 -34.09%
multiscalar benches/Constant-time variable-base multiscalar multiplication/768 11792.0 7778.8 -34.03%
multiscalar benches/Constant-time variable-base multiscalar multiplication/512 7843.0 5194.8 -33.77%
multiscalar benches/Constant-time variable-base multiscalar multiplication/16 271.05 179.74 -33.69%
multiscalar benches/Constant-time variable-base multiscalar multiplication/1024 15724.0 10432.0 -33.66%
multiscalar benches/Constant-time variable-base multiscalar multiplication/64 1005.10 669.14 -33.43%
multiscalar benches/Constant-time variable-base multiscalar multiplication/128 1973.4 1315.19 -33.35%
multiscalar benches/Constant-time variable-base multiscalar multiplication/256 3902.8 2601.79 -33.34%
multiscalar benches/Constant-time variable-base multiscalar multiplication/32 512.99 342.09 -33.31%
multiscalar benches/Constant-time variable-base multiscalar multiplication/8 151.03 102.31 -32.26%
multiscalar benches/Constant-time variable-base multiscalar multiplication/4 91.36 63.62 -30.36%
multiscalar benches/Constant-time variable-base multiscalar multiplication/2 61.51 44.24 -28.07%
multiscalar benches/Variable-time variable-base multiscalar multiplication/1024 6677.2 4906.0 -26.53%
multiscalar benches/Variable-time variable-base multiscalar multiplication/512 3775.3 2794.1 -25.99%
multiscalar benches/Variable-time variable-base multiscalar multiplication/384 3034.5 2246.9 -25.95%
multiscalar benches/Constant-time variable-base multiscalar multiplication/1 46.62 34.53 -25.92%
multiscalar benches/Variable-time variable-base multiscalar multiplication/256 2184.5 1621.3 -25.78%
multiscalar benches/Variable-time variable-base multiscalar multiplication/768 5258.20 3903.9 -25.76%
multiscalar benches/Variable-time mixed-base/(size: 8) (50pct dyn) 99.18 77.43 -21.93%
multiscalar benches/Variable-time variable-base multiscalar multiplication/64 616.72 481.67 -21.90%
multiscalar benches/Variable-time variable-base multiscalar multiplication/128 1201.6 938.67 -21.88%
multiscalar benches/Variable-time variable-base multiscalar multiplication/16 177.74 139.12 -21.73%
multiscalar benches/Variable-time variable-base multiscalar multiplication/32 324.31 253.95 -21.70%
multiscalar benches/Variable-time mixed-base/(size: 4) (50pct dyn) 65.02 50.92 -21.68%
multiscalar benches/Variable-time mixed-base/(size: 16) (50pct dyn) 166.7 130.68 -21.61%
multiscalar benches/Variable-time variable-base multiscalar multiplication/8 104.01 81.73 -21.42%
multiscalar benches/Variable-time mixed-base/(size: 256) (50pct dyn) 2172.5 1712.1 -21.19%
multiscalar benches/Variable-time mixed-base/(size: 64) (50pct dyn) 566.45 446.53 -21.17%
multiscalar benches/Variable-time mixed-base/(size: 32) (50pct dyn) 301.09 237.38 -21.16%
multiscalar benches/Variable-time mixed-base/(size: 384) (50pct dyn) 3249.5 2562.60 -21.14%
multiscalar benches/Variable-time mixed-base/(size: 512) (50pct dyn) 4320.2 3409.60 -21.08%
multiscalar benches/Variable-time mixed-base/(size: 128) (50pct dyn) 1097.6 866.8 -21.03%
multiscalar benches/Variable-time variable-base multiscalar multiplication/4 67.03 52.98 -20.96%
multiscalar benches/Variable-time mixed-base/(size: 2) (50pct dyn) 47.33 37.52 -20.73%
multiscalar benches/Variable-time variable-base multiscalar multiplication/2 48.53 38.59 -20.49%
multiscalar benches/Variable-time mixed-base/(size: 8) (20pct dyn) 92.14 73.28 -20.47%
edwards benches/Constant-time variable-base scalar mul 44.45 35.43 -20.27%
multiscalar benches/Variable-time mixed-base/(size: 16) (20pct dyn) 155.31 124.24 -20.01%
multiscalar benches/Variable-time mixed-base/(size: 4) (20pct dyn) 59.81 47.89 -19.92%
multiscalar benches/Variable-time mixed-base/(size: 4) (0pct dyn) 59.79 47.89 -19.89%
multiscalar benches/Variable-time mixed-base/(size: 8) (0pct dyn) 89.27 71.62 -19.77%
multiscalar benches/Variable-time mixed-base/(size: 2) (20pct dyn) 44.73 35.99 -19.54%
multiscalar benches/Variable-time mixed-base/(size: 2) (0pct dyn) 44.72 35.98 -19.52%
multiscalar benches/Variable-time variable-base multiscalar multiplication/1 39.15 31.54 -19.44%
multiscalar benches/Variable-time mixed-base/(size: 512) (20pct dyn) 3986.4 3218.6 -19.26%
multiscalar benches/Variable-time mixed-base/(size: 256) (20pct dyn) 2007.3 1620.9 -19.25%
multiscalar benches/Variable-time mixed-base/(size: 384) (20pct dyn) 2995.6 2419.79 -19.22%
multiscalar benches/Variable-time mixed-base/(size: 64) (20pct dyn) 521.72 421.92 -19.13%
multiscalar benches/Variable-time mixed-base/(size: 128) (20pct dyn) 1015.09 821.04 -19.12%
multiscalar benches/Variable-time mixed-base/(size: 32) (20pct dyn) 277.22 224.61 -18.98%
multiscalar benches/Variable-time mixed-base/(size: 768) (50pct dyn) 6329.90 5141.4 -18.78%
multiscalar benches/Variable-time mixed-base/(size: 16) (0pct dyn) 146.78 119.67 -18.47%
multiscalar benches/Variable-time mixed-base/(size: 1) (20pct dyn) 37.08 30.24 -18.45%
multiscalar benches/Variable-time mixed-base/(size: 1) (50pct dyn) 37.08 30.24 -18.44%
multiscalar benches/Variable-time mixed-base/(size: 1) (0pct dyn) 37.07 30.24 -18.42%
multiscalar benches/Variable-time mixed-base/(size: 1024) (50pct dyn) 8438.0 6894.3 -18.29%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/2 43.61 35.93 -17.60%
multiscalar benches/Variable-time mixed-base/(size: 384) (0pct dyn) 2808.39 2320.3 -17.38%
multiscalar benches/Variable-time mixed-base/(size: 256) (0pct dyn) 1879.1 1553.39 -17.33%
multiscalar benches/Variable-time mixed-base/(size: 512) (0pct dyn) 3736.60 3090.6 -17.29%
multiscalar benches/Variable-time mixed-base/(size: 32) (0pct dyn) 260.52 215.51 -17.28%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/4 57.79 47.82 -17.25%
multiscalar benches/Variable-time mixed-base/(size: 128) (0pct dyn) 949.81 786.7 -17.17%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/1 36.41 30.16 -17.16%
multiscalar benches/Variable-time mixed-base/(size: 64) (0pct dyn) 488.12 404.41 -17.15%
multiscalar benches/Variable-time mixed-base/(size: 768) (20pct dyn) 5834.1 4864.0 -16.63%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/8 85.91 71.62 -16.63%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/16 141.72 119.42 -15.74%
multiscalar benches/Variable-time mixed-base/(size: 1024) (20pct dyn) 7773.79 6565.1 -15.55%
edwards benches/Variable-time aA+bB A variable B fixed 40.69 34.62 -14.91%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/32 252.32 215.53 -14.58%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/64 473.06 405.01 -14.39%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/256 1819.19 1558.6 -14.32%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/384 2715.5 2328.6 -14.25%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/512 3614.5 3099.8 -14.24%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/128 918.51 788.72 -14.13%
multiscalar benches/Variable-time mixed-base/(size: 768) (0pct dyn) 5413.2 4667.6 -13.77%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/768 5424.29 4683.5 -13.66%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/1024 7260.2 6339.5 -12.68%
multiscalar benches/Variable-time mixed-base/(size: 1024) (0pct dyn) 7231.8 6320.5 -12.60%
edwards benches/Constant-time fixed-base scalar mul 11.56 12.72 9.99%
montgomery benches/Montgomery pseudomultiplication 50.93 55.64 9.25%
montgomery benches/Constant-time fixed-base scalar mul 15.57 16.56 6.36%
scalar benches/Scalar addition 23511000.0 22964000.0 -2.33%
scalar benches/Batch scalar inversion/1 13.10 13.21 0.89%
scalar benches/Scalar inversion 12.81 12.92 0.88%
scalar benches/Batch scalar inversion/4 13.65 13.76 0.85%
scalar benches/Batch scalar inversion/8 14.36 14.48 0.81%
scalar benches/Batch scalar inversion/16 15.80 15.92 0.75%
scalar benches/Batch scalar inversion/2 13.28 13.38 0.74%
ristretto benches/Batch Ristretto double-and-encode/16 13.30 13.21 -0.68%
ristretto benches/Batch Ristretto double-and-encode/8 8.70 8.66 -0.41%
scalar benches/Scalar subtraction 21408000.0 21485000.0 0.36%
scalar benches/Scalar multiplication 91928000.0 91662000.0 -0.29%
ristretto benches/Batch Ristretto double-and-encode/4 6.35 6.34 -0.23%
edwards benches/EdwardsPoint compression 3.95 3.95 0.12%
ristretto benches/RistrettoPoint decompression 4.61 4.61 0.07%
ristretto benches/Batch Ristretto double-and-encode/1 4.58 4.58 -0.06%
edwards benches/EdwardsPoint decompression 4.31 4.32 0.04%
ristretto benches/Batch Ristretto double-and-encode/2 5.18 5.18 -0.01%
ristretto benches/RistrettoPoint compression 4.57 4.57 0.00%
Scalar vs AVX512
testcase scalar avx512 %
multiscalar benches/Constant-time variable-base multiscalar multiplication/768 11792.0 5123.8 -56.55%
multiscalar benches/Constant-time variable-base multiscalar multiplication/384 5896.6 2567.0 -56.47%
multiscalar benches/Constant-time variable-base multiscalar multiplication/512 7843.0 3427.9 -56.29%
multiscalar benches/Constant-time variable-base multiscalar multiplication/1024 15724.0 6908.5 -56.06%
multiscalar benches/Constant-time variable-base multiscalar multiplication/256 3902.8 1718.6 -55.96%
multiscalar benches/Constant-time variable-base multiscalar multiplication/128 1973.4 872.49 -55.79%
multiscalar benches/Constant-time variable-base multiscalar multiplication/64 1005.10 449.67 -55.26%
multiscalar benches/Constant-time variable-base multiscalar multiplication/32 512.99 236.33 -53.93%
multiscalar benches/Constant-time variable-base multiscalar multiplication/16 271.05 128.61 -52.55%
multiscalar benches/Variable-time variable-base multiscalar multiplication/768 5258.20 2529.79 -51.89%
multiscalar benches/Variable-time variable-base multiscalar multiplication/384 3034.5 1475.7 -51.37%
multiscalar benches/Variable-time variable-base multiscalar multiplication/1024 6677.2 3251.79 -51.30%
multiscalar benches/Variable-time variable-base multiscalar multiplication/512 3775.3 1854.60 -50.88%
multiscalar benches/Constant-time variable-base multiscalar multiplication/8 151.03 75.82 -49.79%
multiscalar benches/Variable-time variable-base multiscalar multiplication/256 2184.5 1104.0 -49.46%
multiscalar benches/Constant-time variable-base multiscalar multiplication/4 91.36 49.58 -45.72%
multiscalar benches/Constant-time variable-base multiscalar multiplication/2 61.51 35.94 -41.56%
multiscalar benches/Constant-time variable-base multiscalar multiplication/1 46.62 29.46 -36.80%
edwards benches/Constant-time variable-base scalar mul 44.45 29.97 -32.56%
multiscalar benches/Variable-time mixed-base/(size: 512) (50pct dyn) 4320.2 2954.0 -31.62%
multiscalar benches/Variable-time mixed-base/(size: 384) (50pct dyn) 3249.5 2223.6 -31.57%
multiscalar benches/Variable-time mixed-base/(size: 256) (50pct dyn) 2172.5 1490.89 -31.37%
multiscalar benches/Variable-time mixed-base/(size: 128) (50pct dyn) 1097.6 763.06 -30.48%
multiscalar benches/Variable-time variable-base multiscalar multiplication/128 1201.6 835.7 -30.45%
multiscalar benches/Variable-time mixed-base/(size: 512) (20pct dyn) 3986.4 2785.29 -30.13%
multiscalar benches/Variable-time mixed-base/(size: 384) (20pct dyn) 2995.6 2095.1 -30.06%
multiscalar benches/Variable-time mixed-base/(size: 64) (50pct dyn) 566.45 396.4 -30.02%
multiscalar benches/Variable-time variable-base multiscalar multiplication/64 616.72 431.84 -29.98%
multiscalar benches/Variable-time mixed-base/(size: 768) (50pct dyn) 6329.90 4435.2 -29.93%
multiscalar benches/Variable-time mixed-base/(size: 256) (20pct dyn) 2007.3 1406.9 -29.91%
multiscalar benches/Variable-time mixed-base/(size: 1024) (50pct dyn) 8438.0 5921.59 -29.82%
multiscalar benches/Variable-time mixed-base/(size: 32) (50pct dyn) 301.09 212.54 -29.41%
multiscalar benches/Variable-time mixed-base/(size: 128) (20pct dyn) 1015.09 719.19 -29.15%
multiscalar benches/Variable-time variable-base multiscalar multiplication/32 324.31 230.12 -29.04%
multiscalar benches/Variable-time mixed-base/(size: 512) (0pct dyn) 3736.60 2666.6 -28.64%
multiscalar benches/Variable-time mixed-base/(size: 384) (0pct dyn) 2808.39 2007.1 -28.53%
multiscalar benches/Variable-time mixed-base/(size: 16) (50pct dyn) 166.7 119.31 -28.43%
multiscalar benches/Variable-time mixed-base/(size: 768) (20pct dyn) 5834.1 4180.5 -28.34%
multiscalar benches/Variable-time mixed-base/(size: 256) (0pct dyn) 1879.1 1347.39 -28.30%
multiscalar benches/Variable-time mixed-base/(size: 64) (20pct dyn) 521.72 374.48 -28.22%
multiscalar benches/Variable-time mixed-base/(size: 1024) (20pct dyn) 7773.79 5599.5 -27.97%
multiscalar benches/Variable-time variable-base multiscalar multiplication/16 177.74 129.05 -27.39%
multiscalar benches/Variable-time mixed-base/(size: 128) (0pct dyn) 949.81 689.65 -27.39%
multiscalar benches/Variable-time mixed-base/(size: 32) (20pct dyn) 277.22 201.88 -27.18%
multiscalar benches/Variable-time mixed-base/(size: 8) (50pct dyn) 99.18 72.67 -26.73%
multiscalar benches/Variable-time mixed-base/(size: 768) (0pct dyn) 5413.2 3998.3 -26.14%
multiscalar benches/Variable-time mixed-base/(size: 16) (20pct dyn) 155.31 114.76 -26.11%
multiscalar benches/Variable-time mixed-base/(size: 64) (0pct dyn) 488.12 360.78 -26.09%
multiscalar benches/Variable-time mixed-base/(size: 1024) (0pct dyn) 7231.8 5371.8 -25.72%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/1024 7260.2 5398.5 -25.64%
multiscalar benches/Variable-time variable-base multiscalar multiplication/8 104.01 77.50 -25.48%
multiscalar benches/Variable-time mixed-base/(size: 32) (0pct dyn) 260.52 194.85 -25.21%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/768 5424.29 4059.99 -25.15%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/512 3614.5 2708.5 -25.07%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/384 2715.5 2039.8 -24.88%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/256 1819.19 1368.4 -24.78%
multiscalar benches/Variable-time mixed-base/(size: 4) (50pct dyn) 65.02 48.93 -24.75%
multiscalar benches/Variable-time mixed-base/(size: 8) (20pct dyn) 92.14 69.44 -24.64%
multiscalar benches/Variable-time mixed-base/(size: 16) (0pct dyn) 146.78 111.45 -24.07%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/128 918.51 699.73 -23.82%
multiscalar benches/Variable-time variable-base multiscalar multiplication/4 67.03 51.08 -23.79%
multiscalar benches/Variable-time mixed-base/(size: 8) (0pct dyn) 89.27 68.24 -23.56%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/64 473.06 365.51 -22.73%
multiscalar benches/Variable-time mixed-base/(size: 4) (20pct dyn) 59.81 46.43 -22.36%
multiscalar benches/Variable-time mixed-base/(size: 4) (0pct dyn) 59.79 46.43 -22.34%
multiscalar benches/Variable-time variable-base multiscalar multiplication/2 48.53 37.73 -22.25%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/32 252.32 196.43 -22.15%
multiscalar benches/Variable-time mixed-base/(size: 2) (50pct dyn) 47.33 36.85 -22.14%
multiscalar benches/Variable-time variable-base multiscalar multiplication/1 39.15 30.78 -21.38%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/16 141.72 111.78 -21.13%
scalar benches/Scalar inversion 12.81 16.17 26.21%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/8 85.91 68.35 -20.44%
multiscalar benches/Variable-time mixed-base/(size: 2) (20pct dyn) 44.73 35.68 -20.23%
multiscalar benches/Variable-time mixed-base/(size: 2) (0pct dyn) 44.72 35.68 -20.20%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/4 57.79 46.42 -19.68%
multiscalar benches/Variable-time mixed-base/(size: 1) (0pct dyn) 37.07 29.93 -19.27%
multiscalar benches/Variable-time mixed-base/(size: 1) (20pct dyn) 37.08 29.94 -19.27%
multiscalar benches/Variable-time mixed-base/(size: 1) (50pct dyn) 37.08 29.95 -19.23%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/2 43.61 35.52 -18.54%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/1 36.41 29.77 -18.25%
edwards benches/Variable-time aA+bB A variable B fixed 40.69 33.53 -17.60%
montgomery benches/Constant-time fixed-base scalar mul 15.57 17.22 10.62%
scalar benches/Scalar multiplication 91928000.0 99047000.0 7.74%
edwards benches/Constant-time fixed-base scalar mul 11.56 12.40 7.29%
ristretto benches/Batch Ristretto double-and-encode/16 13.30 14.25 7.14%
montgomery benches/Montgomery pseudomultiplication 50.93 54.34 6.68%
scalar benches/Scalar addition 23511000.0 22191000.0 -5.61%
ristretto benches/Batch Ristretto double-and-encode/8 8.70 9.21 5.90%
ristretto benches/Batch Ristretto double-and-encode/4 6.35 6.60 3.86%
scalar benches/Batch scalar inversion/16 15.80 16.32 3.29%
ristretto benches/RistrettoPoint compression 4.57 4.69 2.67%
scalar benches/Batch scalar inversion/8 14.36 14.72 2.50%
ristretto benches/Batch Ristretto double-and-encode/2 5.18 5.31 2.47%
ristretto benches/RistrettoPoint decompression 4.61 4.71 2.23%
scalar benches/Batch scalar inversion/4 13.65 13.95 2.18%
edwards benches/EdwardsPoint decompression 4.31 4.39 1.72%
scalar benches/Batch scalar inversion/1 13.10 13.31 1.61%
scalar benches/Batch scalar inversion/2 13.28 13.47 1.45%
ristretto benches/Batch Ristretto double-and-encode/1 4.58 4.64 1.35%
scalar benches/Scalar subtraction 21408000.0 21139000.0 -1.26%
edwards benches/EdwardsPoint compression 3.95 3.95 0.05%
AVX2 vs AVX512
testcase avx2 avx512 %
multiscalar benches/Variable-time variable-base multiscalar multiplication/768 3903.9 2529.79 -35.20%
multiscalar benches/Variable-time variable-base multiscalar multiplication/384 2246.9 1475.7 -34.32%
multiscalar benches/Constant-time variable-base multiscalar multiplication/768 7778.8 5123.8 -34.13%
multiscalar benches/Constant-time variable-base multiscalar multiplication/512 5194.8 3427.9 -34.01%
multiscalar benches/Constant-time variable-base multiscalar multiplication/384 3886.29 2567.0 -33.95%
multiscalar benches/Constant-time variable-base multiscalar multiplication/256 2601.79 1718.6 -33.95%
multiscalar benches/Constant-time variable-base multiscalar multiplication/1024 10432.0 6908.5 -33.78%
multiscalar benches/Variable-time variable-base multiscalar multiplication/1024 4906.0 3251.79 -33.72%
multiscalar benches/Constant-time variable-base multiscalar multiplication/128 1315.19 872.49 -33.66%
multiscalar benches/Variable-time variable-base multiscalar multiplication/512 2794.1 1854.60 -33.62%
multiscalar benches/Constant-time variable-base multiscalar multiplication/64 669.14 449.67 -32.80%
multiscalar benches/Variable-time variable-base multiscalar multiplication/256 1621.3 1104.0 -31.91%
multiscalar benches/Constant-time variable-base multiscalar multiplication/32 342.09 236.33 -30.92%
multiscalar benches/Constant-time variable-base multiscalar multiplication/16 179.74 128.61 -28.45%
multiscalar benches/Constant-time variable-base multiscalar multiplication/8 102.31 75.82 -25.88%
multiscalar benches/Constant-time variable-base multiscalar multiplication/4 63.62 49.58 -22.06%
scalar benches/Scalar inversion 12.92 16.17 25.11%
multiscalar benches/Constant-time variable-base multiscalar multiplication/2 44.24 35.94 -18.76%
edwards benches/Constant-time variable-base scalar mul 35.43 29.97 -15.41%
multiscalar benches/Variable-time mixed-base/(size: 1024) (0pct dyn) 6320.5 5371.8 -15.01%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/1024 6339.5 5398.5 -14.84%
multiscalar benches/Variable-time mixed-base/(size: 1024) (20pct dyn) 6565.1 5599.5 -14.71%
multiscalar benches/Constant-time variable-base multiscalar multiplication/1 34.53 29.46 -14.68%
multiscalar benches/Variable-time mixed-base/(size: 768) (0pct dyn) 4667.6 3998.3 -14.34%
multiscalar benches/Variable-time mixed-base/(size: 1024) (50pct dyn) 6894.3 5921.59 -14.11%
multiscalar benches/Variable-time mixed-base/(size: 768) (20pct dyn) 4864.0 4180.5 -14.05%
multiscalar benches/Variable-time mixed-base/(size: 768) (50pct dyn) 5141.4 4435.2 -13.74%
multiscalar benches/Variable-time mixed-base/(size: 512) (0pct dyn) 3090.6 2666.6 -13.72%
multiscalar benches/Variable-time mixed-base/(size: 384) (0pct dyn) 2320.3 2007.1 -13.50%
multiscalar benches/Variable-time mixed-base/(size: 512) (20pct dyn) 3218.6 2785.29 -13.46%
multiscalar benches/Variable-time mixed-base/(size: 384) (20pct dyn) 2419.79 2095.1 -13.42%
multiscalar benches/Variable-time mixed-base/(size: 512) (50pct dyn) 3409.60 2954.0 -13.36%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/768 4683.5 4059.99 -13.31%
multiscalar benches/Variable-time mixed-base/(size: 256) (0pct dyn) 1553.39 1347.39 -13.26%
multiscalar benches/Variable-time mixed-base/(size: 384) (50pct dyn) 2562.60 2223.6 -13.23%
multiscalar benches/Variable-time mixed-base/(size: 256) (20pct dyn) 1620.9 1406.9 -13.20%
multiscalar benches/Variable-time mixed-base/(size: 256) (50pct dyn) 1712.1 1490.89 -12.92%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/512 3099.8 2708.5 -12.62%
multiscalar benches/Variable-time mixed-base/(size: 128) (20pct dyn) 821.04 719.19 -12.40%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/384 2328.6 2039.8 -12.40%
multiscalar benches/Variable-time mixed-base/(size: 128) (0pct dyn) 786.7 689.65 -12.34%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/256 1558.6 1368.4 -12.20%
multiscalar benches/Variable-time mixed-base/(size: 128) (50pct dyn) 866.8 763.06 -11.97%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/128 788.72 699.73 -11.28%
multiscalar benches/Variable-time mixed-base/(size: 64) (20pct dyn) 421.92 374.48 -11.24%
multiscalar benches/Variable-time mixed-base/(size: 64) (50pct dyn) 446.53 396.4 -11.23%
multiscalar benches/Variable-time variable-base multiscalar multiplication/128 938.67 835.7 -10.97%
multiscalar benches/Variable-time mixed-base/(size: 64) (0pct dyn) 404.41 360.78 -10.79%
multiscalar benches/Variable-time mixed-base/(size: 32) (50pct dyn) 237.38 212.54 -10.46%
multiscalar benches/Variable-time variable-base multiscalar multiplication/64 481.67 431.84 -10.35%
multiscalar benches/Variable-time mixed-base/(size: 32) (20pct dyn) 224.61 201.88 -10.12%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/64 405.01 365.51 -9.75%
multiscalar benches/Variable-time mixed-base/(size: 32) (0pct dyn) 215.51 194.85 -9.59%
multiscalar benches/Variable-time variable-base multiscalar multiplication/32 253.95 230.12 -9.38%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/32 215.53 196.43 -8.86%
multiscalar benches/Variable-time mixed-base/(size: 16) (50pct dyn) 130.68 119.31 -8.70%
multiscalar benches/Variable-time mixed-base/(size: 16) (20pct dyn) 124.24 114.76 -7.63%
scalar benches/Scalar multiplication 91662000.0 99047000.0 8.06%
ristretto benches/Batch Ristretto double-and-encode/16 13.21 14.25 7.88%
multiscalar benches/Variable-time variable-base multiscalar multiplication/16 139.12 129.05 -7.24%
multiscalar benches/Variable-time mixed-base/(size: 16) (0pct dyn) 119.67 111.45 -6.87%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/16 119.42 111.78 -6.40%
multiscalar benches/Variable-time mixed-base/(size: 8) (50pct dyn) 77.43 72.67 -6.15%
ristretto benches/Batch Ristretto double-and-encode/8 8.66 9.21 6.33%
multiscalar benches/Variable-time mixed-base/(size: 8) (20pct dyn) 73.28 69.44 -5.24%
multiscalar benches/Variable-time variable-base multiscalar multiplication/8 81.73 77.50 -5.17%
multiscalar benches/Variable-time mixed-base/(size: 8) (0pct dyn) 71.62 68.24 -4.72%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/8 71.62 68.35 -4.58%
ristretto benches/Batch Ristretto double-and-encode/4 6.34 6.60 4.10%
multiscalar benches/Variable-time mixed-base/(size: 4) (50pct dyn) 50.92 48.93 -3.92%
montgomery benches/Constant-time fixed-base scalar mul 16.56 17.22 4.01%
multiscalar benches/Variable-time variable-base multiscalar multiplication/4 52.98 51.08 -3.58%
scalar benches/Scalar addition 22964000.0 22191000.0 -3.37%
edwards benches/Variable-time aA+bB A variable B fixed 34.62 33.53 -3.16%
multiscalar benches/Variable-time mixed-base/(size: 4) (0pct dyn) 47.89 46.43 -3.06%
multiscalar benches/Variable-time mixed-base/(size: 4) (20pct dyn) 47.89 46.43 -3.05%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/4 47.82 46.42 -2.93%
ristretto benches/RistrettoPoint compression 4.57 4.69 2.67%
scalar benches/Batch scalar inversion/16 15.92 16.32 2.52%
edwards benches/Constant-time fixed-base scalar mul 12.72 12.40 -2.45%
ristretto benches/Batch Ristretto double-and-encode/2 5.18 5.31 2.48%
multiscalar benches/Variable-time variable-base multiscalar multiplication/1 31.54 30.78 -2.40%
montgomery benches/Montgomery pseudomultiplication 55.64 54.34 -2.35%
multiscalar benches/Variable-time variable-base multiscalar multiplication/2 38.59 37.73 -2.22%
ristretto benches/RistrettoPoint decompression 4.61 4.71 2.16%
multiscalar benches/Variable-time mixed-base/(size: 2) (50pct dyn) 37.52 36.85 -1.78%
edwards benches/EdwardsPoint decompression 4.32 4.39 1.69%
scalar benches/Batch scalar inversion/8 14.48 14.72 1.67%
scalar benches/Scalar subtraction 21485000.0 21139000.0 -1.61%
ristretto benches/Batch Ristretto double-and-encode/1 4.58 4.64 1.41%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/1 30.16 29.77 -1.32%
scalar benches/Batch scalar inversion/4 13.76 13.95 1.32%
multiscalar benches/Variable-time fixed-base multiscalar multiplication/2 35.93 35.52 -1.14%
multiscalar benches/Variable-time mixed-base/(size: 1) (0pct dyn) 30.24 29.93 -1.04%
multiscalar benches/Variable-time mixed-base/(size: 1) (20pct dyn) 30.24 29.94 -1.00%
multiscalar benches/Variable-time mixed-base/(size: 1) (50pct dyn) 30.24 29.95 -0.96%
multiscalar benches/Variable-time mixed-base/(size: 2) (20pct dyn) 35.99 35.68 -0.86%
multiscalar benches/Variable-time mixed-base/(size: 2) (0pct dyn) 35.98 35.68 -0.84%
scalar benches/Batch scalar inversion/1 13.21 13.31 0.72%
scalar benches/Batch scalar inversion/2 13.38 13.47 0.70%
edwards benches/EdwardsPoint compression 3.95 3.95 -0.07%

@burdges
Copy link

burdges commented Jun 13, 2023

I think the lower level crypto optimizations should be somewhat orthogonal. It's clear they'll happen over the next couple months regardless, so..

If you want to focus on our real protocol optimization that's fine too. Ask @eskimor what he thinks. We do need to know when the crypto optimizations merge of course, so that we do not mistakenly conclude some protocol optimization did stuff it did not do.

We can otoh quickly cut a schnorrkel version without the syscall if you want to see if that's causing threads to be bumped or whatever.

@alexggh
Copy link
Contributor

alexggh commented Jun 13, 2023

I think the lower level crypto optimizations should be somewhat orthogonal. It's clear they'll happen over the next couple months regardless, so..

Yes, I agree they are orthogonal. I just went this route because I wanted to understand where we spent most of the time. With that in mind, I think in the order of impact we should do the following:

  1. Implement approval-voting: coalesce multiple approvals in a single signed message #701 because it is clear that if we do less signature checks we should gain some performance.
  2. After we get 1) implemented measure it together with the crypto improvements we have in the pipeline and decide if parallelization of the work is actually what will give use the desired gaines.

Don't you agree @sandreim @eskimor @burdges ?

@sandreim
Copy link
Contributor Author

sandreim commented Jun 13, 2023

In principle it sounds good, but do we know how much time the signature check takes vs the internal book keeping that is done by approval voting ?

At a scale of 1k paravalidators and 200 parachains, under normal operation with 30 needed approvals and 6 vrf modulo samples, the question we are asking is basically - Can we check 1k approvals faster in a single thread than 6k on multiple threads (assuming we implement parallel processing) ?

@burdges
Copy link

burdges commented Jun 13, 2023

In principle it sounds good, but do we know how much time the signature check takes vs the internal book keeping that is done by approval voting ?

Yes this was always my question. :)

I suspect this syscall in schnorrkel is a major cost since it lets linux bump the thread.

Can we check 1k approvals faster in a single thread than 6k on multiple threads (assuming we implement parallel processing) ?

Not really, but we could use those other threads for checking more parachain blocks or whatever.

At a high level, we're not exactly in the shoes of a web2 company who once they parallelize then they can simply throw more hardware at the problem. We have the hardware we told validators they could buy, and upgrading that requires community work.

@sandreim
Copy link
Contributor Author

TBH, not sure that the syscall would result in the CPU being given to another thread, but we could actually test the 2 scenarios and see what the result is.

Totally agree with you, the thing is that with current architecture we cannot dynamically scale to process stuff on more than 1 CPU to handle bursty workloads. Assuming 1k para valdiators and 200 paras, I think we need to worry about the specs since that means at most 6-7 candidates to check (1 to back, 6 to approve) and 2x times that with async backing. We could add more validators there so the amount of checks per validator goes down, but that means we need to be able to handle more gossip traffic still.

@alexggh
Copy link
Contributor

alexggh commented Jun 13, 2023

In principle it sounds good, but do we know how much time the signature check takes vs the internal book keeping that is done by approval voting ?

At least in approval-voting subsystem from 1 second of useful work it spends around 700 millis doing just the crypto operations for either assignment or approval.

I'll get back when I have some number splits for approval-distribution as well.

@alexggh
Copy link
Contributor

alexggh commented Jun 14, 2023

Some measurements regarding where the time is spent in approval-distribution, in average from 1 second of work 750ms is spent waiting for the approval-voting to check the signatures. So, I would say our book-keeping is not where the main part of the work is happening. Example of 1s breakdown at peak :

  • 15K messages coming in from peers
  • 938 assignments checked in 379ms
  • 1373 approvals checked in 338 ms
  • the rest(~250ms) book-keeping and forwarding message to the network bridge.

One thing I noticed while performing this measurements is that we've got cases where we get multiple assignments & approvals from peers in the same message, here https://github.com/paritytech/polkadot/blob/master/node/network/approval-distribution/src/lib.rs#L568 and here https://github.com/paritytech/polkadot/blob/master/node/network/approval-distribution/src/lib.rs#L607 but because we process those assignments one by one we are going to fan-out those message into multiple messages sent to the same peer. In average we have 10-15 approvals in those messages, but I see some case where we even get 328 approvals in it all those might translate to a single message for each approval.

I think this inefficiency, is a low-hanging fruit, and should help us with at least reducing the network traffic/load.

@sandreim
Copy link
Contributor Author

Some measurements regarding where the time is spent in approval-distribution, in average from 1 second of work 750ms is spent waiting for the approval-voting to check the signatures. So, I would say our book-keeping is not where the main part of the work is happening. Example of 1s breakdown at peak :

  • 15K messages coming in from peers
  • 938 assignments checked in 379ms
  • 1373 approvals checked in 338 ms
  • the rest(~250ms) book-keeping and forwarding message to the network bridge.

Thanks for the numbers @alexggh, it appears that book keeping is not eating a lot, but we might be able to improve there as well.

One thing I noticed while performing this measurements is that we've got cases where we get multiple assignments & approvals from peers in the same message, here https://github.com/paritytech/polkadot/blob/master/node/network/approval-distribution/src/lib.rs#L568 and here https://github.com/paritytech/polkadot/blob/master/node/network/approval-distribution/src/lib.rs#L607 but because we process those assignments one by one we are going to fan-out those message into multiple messages sent to the same peer. In average we have 10-15 approvals in those messages, but I see some case where we even get 328 approvals in it all those might translate to a single message for each approval.

I think this inefficiency, is a low-hanging fruit, and should help us with at least reducing the network traffic/load.

Sounds like a good idea, it should clearly reduce the notification counts.

alexggh referenced this issue in alexggh/polkadot Jun 19, 2023
…er approval-voting

In, the current implementation every time we process an assignment or an
approval that needs checking after the approval voting, we will wait till
approval-voting answers the message.

Given that approval-voting will execute some signatures checks that take
significant time(between 400us and 1 millis) per message, that's where most of
the time in the approval-distribution, see
https://github.com/paritytech/polkadot/issues/6608#issuecomment-1590942235 for
the numbers.

So, modify approval-distribution, so that it picks another message from the
queue while the approval-voting is busy doing it's work.

This will have a few benefits:
1. Better pipelinening of the messages, approval-voting will always have work to
do and it won't have to wait for the approval-distribution to send it a message.
Additionally, some of the works of the approval-distribution will be executed in
parallel with work in approval-voting instead of serially.
2. By allowing approval-distribution to process messages from it's queue while
approval-voting confirms that a message is valid we give the
approval-distribution the ability to build a better view about what messages
other peers already know, so it won't decide to gossip messages to some of it's
peers once we confirm that message as being correct.
3. It opens the door for other optimizations in approval-voting subsystem, which
would still be the bottleneck.

Note!
I still expect the amount of work the combo of this two systems can do, to still
be bounded by the numbers of signatures checks it has to do, so we would have to
stack this with other optimizations we have in the queue.
- https://github.com/paritytech/polkadot/issues/6608
- https://github.com/paritytech/polkadot/issues/6831

[] Evaluate impact in versi
[] Cleanup code an make CI happy to make the PR meargeable.

Signed-off-by: Alexandru Gheorghe <[email protected]>
alexggh referenced this issue in alexggh/polkadot Jun 19, 2023
…voting

In, the current implementation every time we process an assignment or an
approval that needs checking in the approval voting, we will wait till
approval-voting answers the message.

Given that approval-voting will execute some signatures checks that take
significant time(between 400us and 1 millis) per message, that's where most of
the time in the approval-distribution, see
https://github.com/paritytech/polkadot/issues/6608#issuecomment-1590942235 for
the numbers.

So, modify approval-distribution, so that it picks another message from the
queue while the approval-voting is busy doing it's work.

This will have a few benefits:
1. Better pipelinening of the messages, approval-voting will always have work to
do and it won't have to wait for the approval-distribution to send it a message.
Additionally, some of the works of the approval-distribution will be executed in
parallel with work in approval-voting instead of serially.
2. By allowing approval-distribution to process messages from it's queue while
approval-voting confirms that a message is valid we give the
approval-distribution the ability to build a better view about what messages
other peers already know, so it won't decide to gossip messages to some of it's
peers once we confirm that message as being correct.
3. It opens the door for other optimizations in approval-voting subsystem, which
would still be the bottleneck.

Note!
I still expect the amount of work the combo of this two systems can do, to still
be bounded by the numbers of signatures checks it has to do, so we would have to
stack this with other optimizations we have in the queue.
- https://github.com/paritytech/polkadot/issues/6608
- https://github.com/paritytech/polkadot/issues/6831

[] Evaluate impact in versi
[] Cleanup code an make CI happy to make the PR meargeable.

Signed-off-by: Alexandru Gheorghe <[email protected]>
@alexggh
Copy link
Contributor

alexggh commented Jul 25, 2023

Putting this on-hold in favor of: #701

@Sophia-Gold Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023
claravanstaden pushed a commit to Snowfork/polkadot-sdk that referenced this issue Dec 8, 2023
* chore: format code

* format code

* add format hook

* add lint when commit
@alexggh
Copy link
Contributor

alexggh commented May 31, 2024

Will be addressed with #1617

@alexggh alexggh closed this as completed Aug 28, 2024
@github-project-automation github-project-automation bot moved this from Backlog to Completed in parachains team board Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I4-refactor Code needs refactoring.
Projects
Status: Completed
Development

No branches or pull requests

7 participants