[Discussion] Benchmarking and regression testing setup #3048

MaksymZavershynskyi · 2020-07-27T23:11:13Z

Motivation

We need to have a way to benchmark our nodes in various scenarios and especially track regressions. We also need to have a clean and reproducible way to measure improvements to our core metrics.

Metrics

The following metrics are important for us and need to be constantly tracked for each of the releases:

Core metrics

TPS (min, max, avg, stddev);
Max TPS spike we can handle (important metric for our partners) and time it takes to recover;
Transaction confirmation speed (from RPC request to execution in the block);
Transaction rejection rate (due to timeouts or other reasons).

Chain metrics

Node catchup/synchronization speed;

Runtime metrics

Delayed receipts growth;

Health metrics

Memory usage, memory growth rate;
CPU usage, CPU spikiness (important for GC);
Disk usage (IO/sec, and growth rate);
Network usage (separately for RPC and p2p network).

Setups

One node, multiple nodes one shard, multiple shards;
Perfect vs imperfect network (e.g. 50% package drop rate + garbage packages);
Good validators (present 99% of the time) vs validators that constantly drop out;
2/3+ validators vs < 2/3 validators online;

Type of traffic

Synthetic homogenous:
- Only transfers;
- Only fungible token calls (one fungible token contract per shard);
- Only contract calls (take one of the contracts from our partners);
  - Using both cold and hot contracts;
Synthetic heterogenous:
- Random mix of the above.
Organic traffic replay (we need to have a tool that records a real traffic, and we would need to fork the real state);

Ideally we want to measure every metric mentioned above, for every combination of setup x type of traffic. We can start with 9 setups and progress to 3 * 2^3 setups to cover all combinations. Then measure each setup with each type of traffic (6 types of traffic). Ideally it yields 6 * 24 combinations, but we can narrow it down to few most important for now.

We would perform this benchmarking automatically on each release and then would use these numbers for the following:

External communication with the community;
Tracking regressions;
Measuring our performance improvements.

Please discuss: which metrics, setups, and types of traffic we should cover first.

vgrichina · 2020-07-27T23:17:27Z

I'd like to add that it's critical to test in production networks (betanet/testnet) in addition to aforementioned simulated conditions,

Otherwise we don't have visibility that some stuff just doesn't work in prod (e.g. recent issue is that we couldn't easily get claimed 200 TPS outside of some peak blocks and RPC performance generally degrades not nicely under load).

Previously we also had issues where chain team would declare everything is working fine (as blocks are produced) while RPC was completely pillaged. None of app-level tools worked, but testing / monitoring wasn't done using app-level tools.

Basically we need to get more of end to end testing happening on production networks. This involves using real tools we ship vs some internal optimized version just for benchmark which isn't making all the same HTTP queries that app will do.

MaksymZavershynskyi · 2020-07-27T23:18:42Z

I'd like to add that it's critical to test in production networks (betanet/testnet) in addition to aforementioned simulated conditions,

For that we need to collect organic traffic and then replay it.

vgrichina · 2020-07-27T23:25:47Z

For that we need to collect organic traffic and then replay it.

@nearmax this is very useful and we should do it, but it's orthogonal thing. We need to test on actual production networks. Simulated workload already exposes a lot of problems.

chefsale · 2020-07-28T08:34:02Z

We also have this issue already opened for pagerduty alerts + grafana setup: https://github.com/nearprotocol/near-ops/issues/61.

birchmd · 2020-07-28T15:20:07Z

Perhaps another metric to consider is performance persistence over time. For example, the existing mocknet load test running in nightly passes with a fresh network (newly restarted from genesis), but fails in subsequent runs. This indicates to me there could be some performance degradation over time.

Note: mocknet is not yet automatically updated in nightly, so there is no change to the network between these three runs, only time passing with them being up.

MaksymZavershynskyi · 2020-07-28T18:31:05Z

We also have this issue already opened for pagerduty alerts + grafana setup: nearprotocol/near-ops#61.

@chefsale Nice, so you are covering some health metrics.

@birchmd I agree, we should have special tests for degradation.

Note: mocknet is not yet automatically updated in nightly, so there is no change to the network between these three runs, only time passing with them being up.

What would be the purpose of running it on nightly if it is not automatically updated to the new code? I thought nightly runs attempt to fetch the most recent code and run tests against it.

Also, I think the currently nightly setup can be the platform for benchmarking and regression testing that we build on.

birchmd · 2020-07-28T19:13:34Z

@nearmax Nightly still fetches the most recent test code (the code it runs to execute tests), but it does not re-deploy a binary to the remote nodes which form the mocknet (they are all on their own machines in various gcloud regions). This is something we'll set up eventually, it's in the backlog. But for now, since I am actively working with the mocknet anyways, I manually update all the nodes once per day (by "manually", I mean "run a script" because there is already automation for this, just not integrated with nightly yet).

bowenwang1996 · 2021-06-29T22:00:29Z

We already have health metrics implemented and integrated with grafana. For the tests, I believe they are covered by the following issues: #3106, #3107, #3108, #3109, #3111, #3112, #3113, #4398

MaksymZavershynskyi assigned vgrichina, SkidanovAlex, chefsale, bowenwang1996, birchmd and stefanopepe Jul 27, 2020

weekly-digest bot mentioned this issue Jul 31, 2020

Weekly Digest (24 July, 2020 - 31 July, 2020) #3062

Closed

bowenwang1996 closed this as completed Jun 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] Benchmarking and regression testing setup #3048

[Discussion] Benchmarking and regression testing setup #3048

MaksymZavershynskyi commented Jul 27, 2020 •

edited

Loading

vgrichina commented Jul 27, 2020

MaksymZavershynskyi commented Jul 27, 2020

vgrichina commented Jul 27, 2020

chefsale commented Jul 28, 2020

birchmd commented Jul 28, 2020

MaksymZavershynskyi commented Jul 28, 2020

birchmd commented Jul 28, 2020

bowenwang1996 commented Jun 29, 2021

[Discussion] Benchmarking and regression testing setup #3048

[Discussion] Benchmarking and regression testing setup #3048

Comments

MaksymZavershynskyi commented Jul 27, 2020 • edited Loading

Motivation

Metrics

Core metrics

Chain metrics

Runtime metrics

Health metrics

Setups

Type of traffic

vgrichina commented Jul 27, 2020

MaksymZavershynskyi commented Jul 27, 2020

vgrichina commented Jul 27, 2020

chefsale commented Jul 28, 2020

birchmd commented Jul 28, 2020

MaksymZavershynskyi commented Jul 28, 2020

birchmd commented Jul 28, 2020

bowenwang1996 commented Jun 29, 2021

MaksymZavershynskyi commented Jul 27, 2020 •

edited

Loading