Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Cross-Implementation Benchmarking Dataset for Plutus Performance #6626

Open
sierkov opened this issue Nov 2, 2024 · 12 comments
Open
Labels
Benchmarks Low priority Doesn't require immediate attention status: triaged

Comments

@sierkov
Copy link

sierkov commented Nov 2, 2024

Describe the feature you'd like

I'm working on a C++ implementation of Plutus aimed at optimizing batch synchronization. We'd like to benchmark our implementation against existing open-source Plutus implementations to foster cross-learning and understand their relative performance. This issue is a request for feedback on the proposed benchmark dataset, as well as for approved code samples representing your implementation to include in our benchmarks. Detailed information is provided below.

The proposed benchmark dataset is driven by the following considerations:

  1. Predictive Power: Benchmark results should allow us to predict the time required for a given implementation to validate all script witnesses on Cardano’s mainnet.
  2. Efficient Runtime: The benchmark should complete quickly to enable rapid experimentation and performance evaluation.
  3. Parallelization Awareness: It must assess both single-threaded and multi-threaded performance to identify implementation approaches that influence the parallel efficiency of script witness validation.
  4. Sufficient Sample Size: The dataset should contain enough samples to allow computing reasonable sub-splits for further analysis, such as by Plutus version or by Cardano era.

The procedure for creating the proposed benchmark dataset is as follows:

  1. Transaction Sampling: Randomly without replacement select a sample of 256,000 mainnet transactions containing Plutus script witnesses. This sample size is chosen as a balance between speed, sufficient data for analysis, and compatibility with high-end server hardware with up to 256 execution threads. The randomness of the sample allows for generalizable predictions of validation time of all transactions with script witnesses.
  2. Script Preparation: For each script witness in the selected transactions, prepare the required arguments and script context data. Save each as a Plutus script in Flat format, with all arguments pre-applied.
  3. File Organization: For easier debugging, organize all extracted scripts using the following filename pattern: <mainnet-epoch>/<transaction-id>-<script-hash>-<redeemer-idx>.flat.

To gather performance data across open-source Plutus implementations, I am reaching out to the projects listed below. If there are any other implementations not listed here, please let me know, as I’d be happy to include them in the benchmark analysis. The known Plutus implementations:

  1. https://github.com/IntersectMBO/plutus
  2. https://github.com/pragma-org/uplc
  3. https://github.com/nau/scalus
  4. https://github.com/OpShin/uplc
  5. https://github.com/HeliosLang/uplc
  6. https://github.com/HarmonicLabs/plutus-machine

I look forward to your feedback on the proposed benchmark dataset and to your support in providing code that can represent your project in this benchmark.

Describe alternatives you've considered

No response

@github-actions github-actions bot added the status: needs triage GH issues that requires triage label Nov 2, 2024
@effectfully
Copy link
Contributor

The randomness of the sample allows for generalizable predictions of validation time of all transactions with script witnesses.

I suppose this will give you the most popular scripts rather than the most diverse ones. But I think that's fine.

So what do you want from us? A thumbs up? That all sounds great. I think whatever code we might have provided you with would just be skewed for our implementation and who cares about our implementation when people pay for the scripts on the mainnet. So just take it from the mainnet as per your plan, I think it's representative enough.

I'm not sure how to triage this issue, so I'll triage it as "Low priority".

@effectfully effectfully added Low priority Doesn't require immediate attention Benchmarks status: triaged and removed status: needs triage GH issues that requires triage labels Nov 12, 2024
@sierkov
Copy link
Author

sierkov commented Nov 14, 2024

@effectfully, this task has two stages with the following needs:

  1. Stage 1 - Planning:
    • A thumbs up that the proposed methodology makes sense.
    • A thumbs up that the proposed dataset format (a directory of scripts in the flat format with arguments pre-applied) makes it easy to create a reference script.
  2. Stage 2 - Benchmarking (after the dataset and a reference implementation are shared):
    • Prepare a reference script that takes a directory path to the dataset and the number of worker threads and outputs the run time and the result for each executed script.
    • Provide an interpretation of the benchmark results from the POV of the Haskell implementation.

Regarding the representation of scripts, yes, with the random selection, more popular scripts have a stronger influence on the results than less popular ones. However, that's exactly the way they influence the Node's run time on the mainnet.
At the same time, not reflecting in the dataset the relative frequency of scripts on the mainnet would make the results not representative of the actual mainnet workload. The relatively large sample size (256,000 transactions with script witnesses), should still ensure a solid representation of diverse scripts in the dataset. If you have some specific characteristics in mind that you'd like to test (the number of unique scripts, etc.), please let me know. Generally, the sample size can be increased if there are clear benefits that outweigh the longer benchmark run times.

Regarding approved scripts for each implementation, one of the benchmark's goals is to measure the parallel efficiency of each implementation (does the performance scale linearly with the number of workers). As I understand that may require some fine-tuning of the Haskell runtime parameters for optimal performance, and I don't want to take the risk of misconfiguring it. Another reason is that you could provide several scripts representing experimental optimizations. For example, Rust and C++ implementations seem to benefit drastically from optimizations related to memory allocation patterns. A strong impact of certain optimizations may help to advocate for their earlier inclusion into the mainnet release of Cardano Node.

@Unisay
Copy link
Contributor

Unisay commented Nov 15, 2024

JFYI: we have a daily updated database with all script evaluation data from mainnet:
db

There are 92991 unique plutus scripts on mainnet today.
(Those scripts were evaluated 403_665_609 times)

@effectfully
Copy link
Contributor

@sierkov everything you said makes perfect sense to me.

@rvcas @MicroProofs are you folks interested in any of what's been discussed here?

@sierkov
Copy link
Author

sierkov commented Nov 19, 2024

@effectfully, @rvcas, here are the links:

The README includes detailed information, such as the latest performance results of the C++ Plutus implementation and step-by-step instructions for reproducing the transaction sampling and dataset creation.

The performance of the C++ implementation already meets our internal target by validating all transaction witnesses in under an hour on a high-end laptop. However, we believe there is room for further optimization and are eager to collect feedback and exchange ideas with other implementations.

Feedback on the dataset, benchmarking script, and performance results is welcome. Let me know if you have questions or need support in preparing implementation-specific scripts.

@Unisay, thank you for sharing the statistics. To generate the benchmarking dataset, mainnet data up to epoch 521 was analyzed. At the time of generation, the number of unique observed Plutus scripts was 95,459. However, the number of observed Plutus redeemers—and therefore Plutus script evaluations—was only 40,525,056.

This figure appears significantly lower than the number you reported. Could you kindly double-check your results? If your figures are correct, I’d greatly appreciate it if you could share your methodology so we can better understand the discrepancy.

In our case, the number of redeemers was calculated as the number of (non-unique) sub-entries of transaction witnesses of type 5 across all blockchain blocks. This analysis was performed by directly examining the raw blockchain data, allowing us to trace each number back to a specific block and transaction.

@Unisay
Copy link
Contributor

Unisay commented Nov 21, 2024

This figure appears significantly lower than the number you reported. Could you kindly double-check your results? If your figures are correct, I’d greatly appreciate it if you could share your methodology so we can better understand the discrepancy.

@sierkov Interesting discrepancy, indeed.

The code I've used to extract plutus script evaluations from Mainnet is currently in a private repo. I am working on making it public, once ready -- I'll share a link with you. I can also describe what is done there: Cardano.Api is a Haskell library that delegates tasks to the Ledger library. It contains the applyBlock function.
The indexer applies it block by block, folding over the ledger state and getting [LedgerEvent] for each application. These LedgerEvents include SuccessfulPlutusScript plutusEventsWithCtx and FailedPlutusScript plutusEventsWithCtx. All such events are then inserted into a PostgreSQL DB table script_evaluation_events, and there are 40929053 records in it right now.

Here is a sample of the last 20 rows:
image

@sierkov
Copy link
Author

sierkov commented Nov 22, 2024

@Unisay, thank you for the prompt follow-up. To better understand the issue, we analyzed a small random sample of transactions with Plutus witnesses by manually comparing redeemer counts against Cexplorer.io. All analyzed transactions matched precisely.

Examples:

To help find the root cause of the discrepancy, I’m attaching two files with the following statistics:

  • The number of plutus redeemers for each mainnet epoch: epoch-stats.txt.
  • The number of script evaluations for each unique Plutus script: script-stats.txt.

Would it be possible for you to prepare tables with the same contents from for your dataset? That would allow us to trace the causes of discrepancies down to individual epochs and scripts. Then, we can confirm each case by manually analyzing the respective raw data.

P.S. The epochs table reports both unique and non-unique redeemers. That's because a small fraction of transactions contains multiple redeemers with the same id (purpose tag + reference index) while Cardano Node evaluates only the final entry. Reporting this for comprehensiveness.

@sierkov
Copy link
Author

sierkov commented Nov 22, 2024

@Unisay, your last message mentions '40929053' records (~40 million), whereas your first message mentions '403_665_609' (~400 million, 10x more). The numbers we observe are ~40 million. Could you confirm which is correct according to your dataset?

@Unisay
Copy link
Contributor

Unisay commented Nov 22, 2024

I apologize for the confusion: I gave you the wrong number first time, it's 40 millions, not 400.

@Unisay
Copy link
Contributor

Unisay commented Nov 25, 2024

Here is the repository (as promised).
This is the place where LedgerEvents are emitted for each block.

@sierkov
Copy link
Author

sierkov commented Nov 27, 2024

@Unisay, thank you for sharing the code. Two quick questions:

  1. Which approach do you use to test the correctness of computed statistics? Is there a pre-generated test chain with known parameters? If so, I’d be grateful for a pointer.
  2. Could you share the expected time to populate the database with mainnet data?

@Unisay
Copy link
Contributor

Unisay commented Nov 28, 2024

  1. Which approach do you use to test the correctness of computed statistics? Is there a pre-generated test chain with known parameters? If so, I’d be grateful for a pointer.

We don't test computed stats currently 🤷🏼‍♂️

  1. Could you share the expected time to populate the database with mainnet data?

The majority of time spent is indexing from Genesis, and we've done that quite some time ago, IIRC it took very roughly 1 - 1.5 days to reach the "immutable tip". Since then we're running a cron job to catch-up daily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Benchmarks Low priority Doesn't require immediate attention status: triaged
Projects
None yet
Development

No branches or pull requests

3 participants