Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracking issue (brain dump): unified scheduler in block-production #3832

Open
ryoqun opened this issue Nov 28, 2024 · 2 comments
Open

tracking issue (brain dump): unified scheduler in block-production #3832

ryoqun opened this issue Nov 28, 2024 · 2 comments

Comments

@ryoqun
Copy link
Member

ryoqun commented Nov 28, 2024

Problem

my working style is messy.

Proposed Solution

create some sensible place for team collaboration.

and here it is

Overall status

design: 99% done
impl: 95% done
clean up: 90% done (ci is green)
perf eval: 80% done
write test for new code: 20% done
code review: 0% done

Code

the all-in-one messy pr: #2325

list of reviewed (and upcoming) prs:

Proposition/justification

tldr: unbatched scheduler is a thing.

With its final state, simple tx throughput is roughly same with central scheduler (see below early bench results). I think its competitiveness will remain, even after central scheduler is improved. This is because it's shown that the unbatched style of unified scheduler can be optimized extensively, to the point that its inherent overhead compared to batching can well be offset by its advantages: low-latency, maximum-parallelism, local-fee-market adherence

all the advantages contribute to the profit maximization for leaders, which is the ultimate utility of given block production methods from the operator's standpoint. low-latency means higher-paying transactions arriving in the middle of leaders slots can be timely included into blocks. maximum-parallelism means buffered non-conflicting higher-paying transactions can be cleared as fast as possible. Finally local-fee-market adherence means more dense block by just not idling worker threads. These advantages are reflected into the result of simulate-block-production as charted in the Google Sheet. All that said, these statements are actually needs to be proven at mainnet-beta....

Finally, its scheduling logic sharing between block-verification and block-production means the possibility of wall-time based block metering, instead of CUs without introducing unacceptable variance of block processing duration.

Perf eval

on the same machine (AMD EPYC 7513, 32 Core). some numbers are taken from still-not pushed commits

note: there's upcoming improvement for central-scheduler with the transaction view. so, the numbers are tentative.

solana-banking-bench (4 non-vote threads)

--write-lock-contention none:

# unified scheduler
$ RUST_LOG=off target/x86_64-unknown-linux-gnu/release/solana-banking-bench --iterations 1000 --batches-per-iteration 10 --num-chunks 10 --packets-per-batch 10 --block-production-method unified-scheduler --num-banking-threads 6 --write-lock-contention none |& grep -A3 "total_sent"
[total_sent: 100000, base_tx_count: 3000, txs_processed: 102790, txs_landed: 99790, total_us: 5268421, tx_total_us: 5200363]
{'name': 'banking_bench_total', 'median': '18981.02'}
{'name': 'banking_bench_tx_total', 'median': '19229.43'}
{'name': 'banking_bench_success_tx_total', 'median': '18941.16'}

# central scheduler
$ RUST_LOG=off target/x86_64-unknown-linux-gnu/release/solana-banking-bench --iterations 1000 --batches-per-iteration 10 --num-chunks 10 --packets-per-batch 10 --block-production-method central-scheduler --num-banking-threads 6 --write-lock-contention none |& grep -A3 "total_sent" 
[total_sent: 100000, base_tx_count: 3000, txs_processed: 102880, txs_landed: 99880, total_us: 5663711, tx_total_us: 5596809]
{'name': 'banking_bench_total', 'median': '17656.27'}
{'name': 'banking_bench_tx_total', 'median': '17867.32'}
{'name': 'banking_bench_success_tx_total', 'median': '17635.08'}

--write-lock-contention none (large batch):

# unified scheduler
$ RUST_LOG=off target/x86_64-unknown-linux-gnu/release/solana-banking-bench --iterations 1000 --batches-per-iteration 100 --num-chunks 100 --packets-per-batch 100 --block-production-method unified-scheduler --num-banking-threads 6 --write-lock-contention none |& grep -A3 "total_sent"
[total_sent: 10000000, base_tx_count: 3000000, txs_processed: 12839100, txs_landed: 9839100, total_us: 125929705, tx_total_us: 115334744]
{'name': 'banking_bench_total', 'median': '79409.38'}
{'name': 'banking_bench_tx_total', 'median': '86704.14'}
{'name': 'banking_bench_success_tx_total', 'median': '78131.68'}

# central scheduler
$ RUST_LOG=off target/x86_64-unknown-linux-gnu/release/solana-banking-bench --iterations 1000 --batches-per-iteration 100 --num-chunks 100 --packets-per-batch 100 --block-production-method central-scheduler --num-banking-threads 6 --write-lock-contention none |& grep -A3 "total_sent" 
[total_sent: 10000000, base_tx_count: 3000000, txs_processed: 12935724, txs_landed: 9935724, total_us: 132445903, tx_total_us: 121626859]
{'name': 'banking_bench_total', 'median': '75502.52'}
{'name': 'banking_bench_tx_total', 'median': '82218.68'}
{'name': 'banking_bench_success_tx_total', 'median': '75017.22'}

--write-lock-contention full:

# unified scheduler
$ RUST_LOG=off target/x86_64-unknown-linux-gnu/release/solana-banking-bench --iterations 1000 --batches-per-iteration 10 --num-chunks 10 --packets-per-batch 10 --block-production-method unified-scheduler --num-banking-threads 6 --write-lock-contention full |& grep -A3 "total_sent"
[total_sent: 100000, base_tx_count: 2000, txs_processed: 102000, txs_landed: 100000, total_us: 5267873, tx_total_us: 5214671]
{'name': 'banking_bench_total', 'median': '18982.99'}
{'name': 'banking_bench_tx_total', 'median': '19176.67'}
{'name': 'banking_bench_success_tx_total', 'median': '18982.99'}

# central scheduler (note results had some large variance)
$ RUST_LOG=off target/x86_64-unknown-linux-gnu/release/solana-banking-bench --iterations 1000 --batches-per-iteration 10 --num-chunks 10 --packets-per-batch 10 --block-production-method central-scheduler --num-banking-threads 6 --write-lock-contention full |& grep -A3 "total_sent" 
[total_sent: 100000, base_tx_count: 2000, txs_processed: 101780, txs_landed: 99780, total_us: 5720672, tx_total_us: 5657796]
{'name': 'banking_bench_total', 'median': '17480.46'}
{'name': 'banking_bench_tx_total', 'median': '17674.73'}
{'name': 'banking_bench_success_tx_total', 'median': '17442.01'}

solana-banking-bench (16 non-vote threads)

--write-lock-contention none:

# unified scheduler
$ RUST_LOG=off target/x86_64-unknown-linux-gnu/release/solana-banking-bench --iterations 1000 --batches-per-iteration 10 --num-chunks 10 --packets-per-batch 10 --block-production-method unified-scheduler --num-banking-threads 18 --write-lock-contention none |& grep -A3 "total_sent" 
[total_sent: 100000, base_tx_count: 3000, txs_processed: 103000, txs_landed: 100000, total_us: 5365205, tx_total_us: 5285013]
{'name': 'banking_bench_total', 'median': '18638.62'}
{'name': 'banking_bench_tx_total', 'median': '18921.43'}
{'name': 'banking_bench_success_tx_total', 'median': '18638.62'}

# central scheduler
$ RUST_LOG=off target/x86_64-unknown-linux-gnu/release/solana-banking-bench --iterations 1000 --batches-per-iteration 10 --num-chunks 10 --packets-per-batch 10 --block-production-method central-scheduler --num-banking-threads 18 --write-lock-contention none |& grep -A3 "total_sent" 
[total_sent: 100000, base_tx_count: 3000, txs_processed: 103000, txs_landed: 100000, total_us: 5338780, tx_total_us: 5262498]
{'name': 'banking_bench_total', 'median': '18730.87'}
{'name': 'banking_bench_tx_total', 'median': '19002.38'}
{'name': 'banking_bench_success_tx_total', 'median': '18730.87'}

--write-lock-contention none (large batches):

# unified scheduler
$ RUST_LOG=off target/x86_64-unknown-linux-gnu/release/solana-banking-bench --iterations 1000 --batches-per-iteration 100 --num-chunks 100 --packets-per-batch 100 --block-production-method unified-scheduler --num-banking-threads 18 --write-lock-contention none |& grep -A3 "total_sent" 
[total_sent: 10000000, base_tx_count: 3000000, txs_processed: 12618585, txs_landed: 9618585, total_us: 80325909, tx_total_us: 69471761]
{'name': 'banking_bench_total', 'median': '124492.83'}
{'name': 'banking_bench_tx_total', 'median': '143943.38'}
{'name': 'banking_bench_success_tx_total', 'median': '119744.49'}

# central scheduler
$ RUST_LOG=off target/x86_64-unknown-linux-gnu/release/solana-banking-bench --iterations 1000 --batches-per-iteration 100 --num-chunks 100 --packets-per-batch 100 --block-production-method central-scheduler --num-banking-threads 18 --write-lock-contention none |& grep -A3 "total_sent"
[total_sent: 10000000, base_tx_count: 3000000, txs_processed: 12905656, txs_landed: 9905656, total_us: 121703143, tx_total_us: 111000776]
{'name': 'banking_bench_total', 'median': '82167.15'}
{'name': 'banking_bench_tx_total', 'median': '90089.46'}
{'name': 'banking_bench_success_tx_total', 'median': '81391.95'}

--write-lock-contention full:

# unified scheduler
$ RUST_LOG=off target/x86_64-unknown-linux-gnu/release/solana-banking-bench --iterations 1000 --batches-per-iteration 10 --num-chunks 10 --packets-per-batch 10 --block-production-method unified-scheduler --num-banking-threads 18 --write-lock-contention full |& grep -A3 "total_sent" 
[total_sent: 100000, base_tx_count: 2000, txs_processed: 102000, txs_landed: 100000, total_us: 5305080, tx_total_us: 5246760]
{'name': 'banking_bench_total', 'median': '18849.86'}
{'name': 'banking_bench_tx_total', 'median': '19059.38'}
{'name': 'banking_bench_success_tx_total', 'median': '18849.86'}

# central scheduler
$ RUST_LOG=off target/x86_64-unknown-linux-gnu/release/solana-banking-bench --iterations 1000 --batches-per-iteration 10 --num-chunks 10 --packets-per-batch 10 --block-production-method central-scheduler --num-banking-threads 18 --write-lock-contention full |& grep -A3 "total_sent" 
[total_sent: 100000, base_tx_count: 2000, txs_processed: 101770, txs_landed: 99770, total_us: 6133888, tx_total_us: 6062903]
{'name': 'banking_bench_total', 'median': '16302.87'}
{'name': 'banking_bench_tx_total', 'median': '16493.75'}
{'name': 'banking_bench_success_tx_total', 'median': '16265.38'}

./multinode-demo/bench-tps.sh (4 non-vote threads)
# unified scheduler
$ SOLANA_BANKING_THREADS=6 NDEBUG=1 ./multinode-demo/bootstrap-validator.sh --block-production-method unified-scheduler
...
[2024-11-28T15:04:40.911576103Z INFO  solana_bench_tps::bench] 
    Average max TPS: 93788.65, 0 nodes had 0 TPS
[2024-11-28T15:04:40.911580231Z INFO  solana_bench_tps::bench] 
    Highest TPS: 93788.65 sampling period 1s max transactions: 4156856 clients: 1 drop rate: 0.79
[2024-11-28T15:04:40.911584940Z INFO  solana_bench_tps::bench]  Average TPS: 45707.926

# central scheduler
$ SOLANA_BANKING_THREADS=6 NDEBUG=1 ./multinode-demo/bootstrap-validator.sh --block-production-method central-scheduler
...
[2024-11-28T15:08:13.261073692Z INFO  solana_bench_tps::bench] 
    Average max TPS: 74200.45, 0 nodes had 0 TPS
[2024-11-28T15:08:13.261080075Z INFO  solana_bench_tps::bench] 
    Highest TPS: 74200.45 sampling period 1s max transactions: 4487726 clients: 1 drop rate: 0.78
[2024-11-28T15:08:13.261088260Z INFO  solana_bench_tps::bench]  Average TPS: 49243.36
./multinode-demo/bench-tps.sh (16 non-vote threads)

(note: i commented out some code to disable block cost limits)

# unified scheduler
$ SOLANA_BANKING_THREADS=18 NDEBUG=1 ./multinode-demo/bootstrap-validator.sh --block-production-method unified-scheduler
...
[2024-11-28T15:00:12.910576223Z INFO  solana_bench_tps::bench] 
    Average max TPS: 125672.52, 0 nodes had 0 TPS
[2024-11-28T15:00:12.910582395Z INFO  solana_bench_tps::bench] 
    Highest TPS: 125672.52 sampling period 1s max transactions: 8325590 clients: 1 drop rate: 0.57
[2024-11-28T15:00:12.910587254Z INFO  solana_bench_tps::bench]  Average TPS: 92330.64

# central scheduler
$ SOLANA_BANKING_THREADS=18 NDEBUG=1 ./multinode-demo/bootstrap-validator.sh --block-production-method central-scheduler
...
[2024-11-28T14:56:20.503528843Z INFO  solana_bench_tps::bench]
    Average max TPS: 86596.78, 0 nodes had 0 TPS
[2024-11-28T14:56:20.503534875Z INFO  solana_bench_tps::bench]
    Highest TPS: 86596.78 sampling period 1s max transactions: 4359165 clients: 1 drop rate: 0.78
[2024-11-28T14:56:20.503540596Z INFO  solana_bench_tps::bench]  Average TPS: 48222.74
simulate-block-production

google sheets

@ryoqun ryoqun changed the title tracking issue (aka, my brain dump) of unified scheduler in block-production (USIBP) tracking issue (ie. brain dump) of unified scheduler in block-production (USIBP) Nov 28, 2024
@ryoqun ryoqun changed the title tracking issue (ie. brain dump) of unified scheduler in block-production (USIBP) tracking issue (= brain dump): unified scheduler in block-production (USIBP) Nov 28, 2024
@ryoqun ryoqun changed the title tracking issue (= brain dump): unified scheduler in block-production (USIBP) tracking issue (= brain dump): unified scheduler in block-production Dec 9, 2024
@ryoqun ryoqun changed the title tracking issue (= brain dump): unified scheduler in block-production tracking issue (brain dump): unified scheduler in block-production Dec 9, 2024
@apfitzge
Copy link

Awesome. I'm excited to see this in mainnet, and generally more variability in scheduling is good for network health & security.

Some initial thoughts about trade-offs between unified (unbatched) and central (batched) scheduler variants wrt block-production:

  • Unbatched transaction execution will likely see more overhead relative to transaction execution time since due to our inefficient locking approach in several spots in tx execution
  • Upcoming TransactionView stuff will likely give central-scheduler a leg up until replay/unified are converted to using TransactionView (not as simple to make generic, as discussed privately). This may falsely skew results against unified-scheduler in short-term.
  • SIMD-0083 - batching may become even more efficient by bypassing shared accounts-cache to perform write-access on same account. Unified-scheduler in unbatched approach will not benefit from this.
  • Several conflict patterns can be handled better by unbatched approach, these patterns DO appear on mainnet traffic, and so it is definitely possible that unified-scheduler can collect more fees.

@ryoqun
Copy link
Member Author

ryoqun commented Dec 11, 2024

Some initial thoughts about trade-offs between unified (unbatched) and central (batched) scheduler variants wrt block-production:

thanks for the input.

* Unbatched transaction execution will likely see more overhead relative to transaction execution time since due to our inefficient locking approach in several spots in tx execution

certainly, there will be some extra overhead left, no matter what I optimize the runtime. However, i think this overhead isn't that large actually as i demo-ed with a hacky bench in the past. In the future, I'll introduce a fast-path of unbatched tx exeuction. in it, there will be less cpu-dcache thrashing, fewer allocations, fewer instructions.

Also, most of cpu should be spent on spinning on jitted on-chain program code in principle, which doesn't benefit from batching not so much. With all these personal analysis, I'm betting the overhead can be offset.

* Upcoming TransactionView stuff will likely give central-scheduler a leg up until replay/unified are converted to using TransactionView (not as simple to make generic, as discussed privately). This may falsely skew results against unified-scheduler in short-term.

Thanks for working on TrnsationView by the way. :) I think it benefits both schedulers to the same extent, roughly speaking. Until the unified scheduler migration is complete, i can bench with some local changes, once after #3820 is merged.

* SIMD-0083 - batching may become even more efficient by bypassing shared accounts-cache to perform write-access on same account. Unified-scheduler in unbatched approach will not benefit from this.

I'm interested in seeing improved perf numbers of central scheduler after SIMD-0083, which will lift the artificial limitation currently only imposed on central scheduler. Regarding the mentioned shared accounts-cache, unified scheduler will also bypass it as part of the unbatched fast-path impl.

* Several conflict patterns can be handled better by unbatched approach, these patterns DO appear on mainnet traffic, and so it is definitely possible that unified-scheduler can collect more fees.

yeah, i hope this assumption will be held.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants