chore: Greater stability at 1TPS #10981

PhilWindle · 2024-12-31T14:23:11Z

This PR contains the following changes:

Updated the version of lmdb-js to fix crashes in the prover broker.
Added DB map size configuration for the prover broker.
Added historical header validation to sequencer validations.
Added heap size configurations to deployments.
Introduced local ssd configurations to deployments
Configured exp-1 test scenario with suitable values of memory and ssd capacities.
New 2 and 4 core node pools with local ssd options.
Serialises operations in the prover broker and prover node's facade interface
Fixes a bug where the prover broker was being 'started' twice

… pw/ssd

spalladino

Looks good, but this definitely needs @alexghr to take a look

spalladino · 2025-01-07T21:14:30Z

spartan/aztec-network/values.yaml

  storage: "8Gi"
  archiverPollingInterval: 1000
  archiverViemPollingInterval: 1000
  pollInterval: 1000
  viemPollingInterval: 1000
+  dataDir: "/data"
+  storageSize: "1Gi"


We have a storage entry defined a few lines above, should we delete it in favor of this one?

spalladino · 2025-01-07T21:16:07Z

spartan/aztec-network/values/exp-1.yaml

+      memory: "5Gi"
+      cpu: "1.5"
+      ephemeral-storage: "275Gi"
+  maxOldSpaceSize: "5120"


Shouldn't maxOldSpaceSize be slightly below the total memory available?

Yeah possibly, will modify

Looks like node.js recommends 0.5GB or so headroom. https://nodejs.org/api/cli.html#--max-old-space-sizesize-in-mib

spalladino · 2025-01-07T21:20:44Z

yarn-project/prover-client/src/prover-client/server-epoch-prover.ts

+import { type ProvingOrchestrator } from '../orchestrator/orchestrator.js';
+import { type BrokerCircuitProverFacade } from '../proving_broker/broker_prover_facade.js';
+
+/** Encapsulates the proving orchestrator and the broker facade */


Curious: why not just make the orchestrator start/stop the facade?

Well, the orchestrator just receives a ServerCircuitProver interface. This just contains methods such as:

getBaseParityProof( inputs: BaseParityInputs, signal?: AbortSignal, epochNumber?: number, ): Promise<PublicInputsAndRecursiveProof<ParityPublicInputs, typeof RECURSIVE_PROOF_LENGTH>>;

I felt that it is something of an implementation detail that the facade (an instance of a ServerCircuitProver) has the need to be started and stopped with each epoch.

Makes sense. FWIW I had been lazy in these situations and used a Maybe<Service> type (meaning the dependency may be stoppable) along with a tryStop in the parent:

aztec-packages/yarn-project/circuit-types/src/interfaces/service.ts

Lines 16 to 25 in 4600f54

/** Tries to call stop on a given object and awaits it. Logs any errors and does not rethrow. */

export async function tryStop(service: Maybe<Service>, logger?: Logger): Promise<void> {

try {

return typeof service === 'object' && service && 'stop' in service && typeof service.stop === 'function'

? await service.stop()

: Promise.resolve();

} catch (err) {

logger?.error(`Error stopping service ${(service as object).constructor?.name}: ${err}`);

}

}

Less clean, but saves from having to add another object just to manage the dependencies lifecycle.

(just in case: I'm not advocating for one approach or the other!)

spalladino · 2025-01-07T21:23:36Z

yarn-project/prover-client/src/proving_broker/broker_prover_facade.ts

+  private jobs: Map<ProvingJobId, ProvingJob> = new Map();
+  private runningPromise?: RunningPromise;
+  private timeOfLastSnapshotSync = Date.now();
+  private queue?: SerialQueue = new SerialQueue();


Why the ?, given it's initialized on construction?

Yeah that was a mistake. Nice catch.

spalladino · 2025-01-07T21:41:22Z

yarn-project/prover-client/src/proving_broker/proving_broker.ts

+    let numCleanups = 1;
+    let numEnqueue = 1;
+    let remaining = await this.requestQueue.put(() => this.cleanupStaleJobs());
+    while (remaining) {
+      remaining = await this.requestQueue.put(() => this.cleanupStaleJobs());
+      numCleanups++;
+    }
+    remaining = await this.requestQueue.put(() => this.reEnqueueExpiredJobs());
+    while (remaining) {
+      remaining = await this.requestQueue.put(() => this.reEnqueueExpiredJobs());
+      numEnqueue++;
+    }


Not new on this PR, but shouldn't we re-enqueue jobs before cleaning up stale ones? Seems like re-enqueuing is more time-pressing.

This has now changed. Cleaning up stale jobs no longer touches the DB so should not incur any latency. Neither does re-enqueueing jobs.

The DB cleanup is now performed after both of these operations.

spalladino · 2025-01-07T21:41:49Z

yarn-project/prover-client/src/proving_broker/proving_broker.ts

   */
  private epochHeight = 0;
  private maxEpochsToKeepResultsFor = 1;

+  private requestQueue: SerialQueue = new SerialQueue();


Curious: what prompted using a serial queue here?

Stability. Without the queue we effectively have an unbounded number of concurrent write transactions against the database. It's limited by the number of prover agents, but that could be thousands. LMDB only allows one write transaction at a time. It's the JS wrapper that is doing a lot of work behind the scenes to give the illusion of concurrency. But I worry about it's stability under such heavy concurrent access.

It maybe that as we try and scale to larger epochs, writing one job/result at a time is insufficient. However I think a better strategy there would be to still perform writes sequentially but write batches of updates instead of just 1.

spalladino · 2025-01-07T21:49:03Z

yarn-project/prover-client/src/proving_broker/broker_prover_facade.ts

+      // Job was not enqueued. It must be completed already, add to our set of already completed jobs
+      this.jobsToRetrieve.add(id);


Is it possible the job was just enqueued twice, so it is not yet complete? IIUC the broker will return false if the job is enqueued but it already has one with the same id, regardless of it being finished or not.

You are correct. This is now refactored to return better information as to whether the job was enqueued or not.

spalladino · 2025-01-07T22:07:41Z

yarn-project/prover-client/src/proving_broker/broker_prover_facade.ts

-        if (output.type === type) {
-          return output.result as ProvingJobResultsMap[T];
+        if (output.type === jobType) {
+          return { result: output.result as ProvingJobResultsMap[T], success: true, reason: '' };


Suggested change

return { result: output.result as ProvingJobResultsMap[T], success: true, reason: '' };

return { result: output.result as ProvingJobResultsMap[T], success: true };

Nit: let's not use empty strings instead of null/undefined.

alexghr

LGTM

alexghr · 2025-01-09T15:42:35Z

yarn-project/circuit-types/src/interfaces/proving-job.ts

+
+export const getEpochFromProvingJobId = (id: ProvingJobId) => {
+  const components = id.split(':');
+  return +components[0];


Not for now, but it might be worth throwing here if this is not a number (only benefits to catch tests that use old-style IDs)

alexghr · 2025-01-09T16:08:06Z

yarn-project/prover-client/src/proving_broker/broker_prover_facade.ts

-            // keep retrying until we time out
-          }
+      // Job was not enqueued. It must be completed already, add to our set of already completed jobs
+      this.jobsToRetrieve.add(id);


On the new else branch (now that we have status) we technically have the result in the status object and can resolve immediately, but an optimization for another time :)

This PR refactors the types of the prover broker and agent config to reuse more of the existing helpers. Built on top of #10981 Fix #10267

* master: (287 commits) feat: Sync from noir (#11051) chore(docs): Update tx concepts page (#10947) chore(docs): Edit Aztec.nr Guide section (#10866) chore: test:e2e defaults to no-docker (#10966) chore(avm): improve column stats (#11135) chore: Sanity checking of proving job IDs (#11134) feat: permutation argument optimizations (#10960) feat: single tx block root rollup (#11096) refactor: prover db config (#11126) feat: monitor event loop lag (#11127) chore: Greater stability at 1TPS (#10981) chore: Jest reporters for CI (#11125) fix: Sequencer times out L1 tx at end of L2 slot (#11112) feat: browser chunking (#11102) fix: Added start/stop guards to running promise and serial queue (#11120) fix: Don't retransmit txs upon node restart (#11123) fix: Prover node aborts execution at epoch end (#11111) feat: blob sink in sandbox without extra process (#11032) chore: log number of instructions executed for call in AVM. Misc fix. (#11110) git subrepo push --branch=master noir-projects/aztec-nr ...

PhilWindle added 23 commits December 24, 2024 12:07

WIP

749cf20

WIP

724a182

WIP

2038cc5

WIP

c8bdcdb

WIP

0c5f63f

WIP

b231a8c

WIP

45163a6

Fix

2c722b4

Fix

c5f15bd

WIP

51ddac2

More WIP

e9f4ca3

Fix

3b7522e

WIP

21b6915

WIP

a01ffb2

WIP

ebb85b6

WIP

f5cc8b2

Set max old space

f15352f

Prover broker map size

e9b8697

Prover broker map size

832f208

Fix

285d684

More TPS

48e5bda

Validate block headers

1b53617

Updated lmdb version

e2837f8

PhilWindle added e2e-all CI: Enables this CI job. network-all Run this CI job. labels Dec 31, 2024

PhilWindle and others added 5 commits December 31, 2024 17:26

Merge branch 'master' into pw/ssd

32ba085

Merge branch 'master' into pw/ssd

b30623e

WIP

cfb9200

Merge branch 'pw/ssd' of github.com:AztecProtocol/aztec-packages into…

af96f0c

… pw/ssd

WIP

5ff44e0

PhilWindle and others added 11 commits January 7, 2025 10:54

Cleanup

266ffa8

Merge remote-tracking branch 'origin/master' into pw/ssd

e223958

Formatting

99d8a76

Cleanup

2f1379d

Fixes

04312f5

Another attempt

27b15a3

Merge remote-tracking branch 'origin/master' into pw/ssd

f61c316

Merge fixes

5436239

Test fixes

c4a0bf6

Merge branch 'master' into pw/ssd

ee01558

Lock file

086b284

spalladino reviewed Jan 7, 2025

View reviewed changes

PhilWindle added 8 commits January 8, 2025 18:58

Don't delete individual jobs

b070521

Logging

d28fe4e

More tests

eaba030

Fixes

ca2d03e

Merge remote-tracking branch 'origin/master' into pw/ssd

45a441d

Cleanup

cfd0ebd

Cleanup

eeab53d

Fixes

83e47ad

alexghr mentioned this pull request Jan 9, 2025

test: verify proving is resumed after broker crash #11122

Open

PhilWindle linked an issue Jan 9, 2025 that may be closed by this pull request

Re-work interface between prover node and prover broker #11006

Closed

Merge remote-tracking branch 'origin/master' into pw/ssd

3fc753c

alexghr mentioned this pull request Jan 9, 2025

refactor: prover db config #11126

Merged

alexghr approved these changes Jan 9, 2025

View reviewed changes

PhilWindle merged commit 1c23662 into master Jan 9, 2025
78 checks passed

PhilWindle deleted the pw/ssd branch January 9, 2025 16:14

AztecBot mentioned this pull request Jan 9, 2025

chore(master): Release 0.70.0 #11107

Open

alexghr added a commit that referenced this pull request Jan 9, 2025

refactor: prover db config (#11126)

9d49393

This PR refactors the types of the prover broker and agent config to reuse more of the existing helpers. Built on top of #10981 Fix #10267

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Greater stability at 1TPS #10981

chore: Greater stability at 1TPS #10981

PhilWindle commented Dec 31, 2024 •

edited

Loading

spalladino left a comment

spalladino Jan 7, 2025

PhilWindle Jan 9, 2025

spalladino Jan 7, 2025

PhilWindle Jan 9, 2025

PhilWindle Jan 9, 2025

spalladino Jan 7, 2025

PhilWindle Jan 9, 2025

spalladino Jan 9, 2025

spalladino Jan 9, 2025

spalladino Jan 7, 2025

PhilWindle Jan 9, 2025

spalladino Jan 7, 2025

PhilWindle Jan 9, 2025

spalladino Jan 7, 2025

PhilWindle Jan 9, 2025

spalladino Jan 7, 2025

PhilWindle Jan 9, 2025

spalladino Jan 7, 2025

PhilWindle Jan 9, 2025

alexghr left a comment

alexghr Jan 9, 2025

alexghr Jan 9, 2025

	/** Tries to call stop on a given object and awaits it. Logs any errors and does not rethrow. */
	export async function tryStop(service: Maybe<Service>, logger?: Logger): Promise<void> {
	try {
	return typeof service === 'object' && service && 'stop' in service && typeof service.stop === 'function'
	? await service.stop()
	: Promise.resolve();
	} catch (err) {
	logger?.error(`Error stopping service ${(service as object).constructor?.name}: ${err}`);
	}
	}

		// Job was not enqueued. It must be completed already, add to our set of already completed jobs
		this.jobsToRetrieve.add(id);

	return { result: output.result as ProvingJobResultsMap[T], success: true, reason: '' };
	return { result: output.result as ProvingJobResultsMap[T], success: true };

chore: Greater stability at 1TPS #10981

chore: Greater stability at 1TPS #10981

Conversation

PhilWindle commented Dec 31, 2024 • edited Loading

spalladino left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexghr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PhilWindle commented Dec 31, 2024 •

edited

Loading