Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Greater stability at 1TPS #10981

Merged
merged 57 commits into from
Jan 9, 2025
Merged

chore: Greater stability at 1TPS #10981

merged 57 commits into from
Jan 9, 2025

Conversation

PhilWindle
Copy link
Collaborator

@PhilWindle PhilWindle commented Dec 31, 2024

This PR contains the following changes:

  1. Updated the version of lmdb-js to fix crashes in the prover broker.
  2. Added DB map size configuration for the prover broker.
  3. Added historical header validation to sequencer validations.
  4. Added heap size configurations to deployments.
  5. Introduced local ssd configurations to deployments
  6. Configured exp-1 test scenario with suitable values of memory and ssd capacities.
  7. New 2 and 4 core node pools with local ssd options.
  8. Serialises operations in the prover broker and prover node's facade interface
  9. Fixes a bug where the prover broker was being 'started' twice

@PhilWindle PhilWindle added e2e-all CI: Enables this CI job. network-all Run this CI job. labels Dec 31, 2024
Copy link
Collaborator

@spalladino spalladino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but this definitely needs @alexghr to take a look

storage: "8Gi"
archiverPollingInterval: 1000
archiverViemPollingInterval: 1000
pollInterval: 1000
viemPollingInterval: 1000
dataDir: "/data"
storageSize: "1Gi"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a storage entry defined a few lines above, should we delete it in favor of this one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

memory: "5Gi"
cpu: "1.5"
ephemeral-storage: "275Gi"
maxOldSpaceSize: "5120"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't maxOldSpaceSize be slightly below the total memory available?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah possibly, will modify

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like node.js recommends 0.5GB or so headroom. https://nodejs.org/api/cli.html#--max-old-space-sizesize-in-mib

import { type ProvingOrchestrator } from '../orchestrator/orchestrator.js';
import { type BrokerCircuitProverFacade } from '../proving_broker/broker_prover_facade.js';

/** Encapsulates the proving orchestrator and the broker facade */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious: why not just make the orchestrator start/stop the facade?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the orchestrator just receives a ServerCircuitProver interface. This just contains methods such as:

  getBaseParityProof(
    inputs: BaseParityInputs,
    signal?: AbortSignal,
    epochNumber?: number,
  ): Promise<PublicInputsAndRecursiveProof<ParityPublicInputs, typeof RECURSIVE_PROOF_LENGTH>>;

I felt that it is something of an implementation detail that the facade (an instance of a ServerCircuitProver) has the need to be started and stopped with each epoch.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. FWIW I had been lazy in these situations and used a Maybe<Service> type (meaning the dependency may be stoppable) along with a tryStop in the parent:

/** Tries to call stop on a given object and awaits it. Logs any errors and does not rethrow. */
export async function tryStop(service: Maybe<Service>, logger?: Logger): Promise<void> {
try {
return typeof service === 'object' && service && 'stop' in service && typeof service.stop === 'function'
? await service.stop()
: Promise.resolve();
} catch (err) {
logger?.error(`Error stopping service ${(service as object).constructor?.name}: ${err}`);
}
}

Less clean, but saves from having to add another object just to manage the dependencies lifecycle.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just in case: I'm not advocating for one approach or the other!)

private jobs: Map<ProvingJobId, ProvingJob> = new Map();
private runningPromise?: RunningPromise;
private timeOfLastSnapshotSync = Date.now();
private queue?: SerialQueue = new SerialQueue();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the ?, given it's initialized on construction?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that was a mistake. Nice catch.

Comment on lines 543 to 554
let numCleanups = 1;
let numEnqueue = 1;
let remaining = await this.requestQueue.put(() => this.cleanupStaleJobs());
while (remaining) {
remaining = await this.requestQueue.put(() => this.cleanupStaleJobs());
numCleanups++;
}
remaining = await this.requestQueue.put(() => this.reEnqueueExpiredJobs());
while (remaining) {
remaining = await this.requestQueue.put(() => this.reEnqueueExpiredJobs());
numEnqueue++;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not new on this PR, but shouldn't we re-enqueue jobs before cleaning up stale ones? Seems like re-enqueuing is more time-pressing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has now changed. Cleaning up stale jobs no longer touches the DB so should not incur any latency. Neither does re-enqueueing jobs.

The DB cleanup is now performed after both of these operations.

*/
private epochHeight = 0;
private maxEpochsToKeepResultsFor = 1;

private requestQueue: SerialQueue = new SerialQueue();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious: what prompted using a serial queue here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stability. Without the queue we effectively have an unbounded number of concurrent write transactions against the database. It's limited by the number of prover agents, but that could be thousands. LMDB only allows one write transaction at a time. It's the JS wrapper that is doing a lot of work behind the scenes to give the illusion of concurrency. But I worry about it's stability under such heavy concurrent access.

It maybe that as we try and scale to larger epochs, writing one job/result at a time is insufficient. However I think a better strategy there would be to still perform writes sequentially but write batches of updates instead of just 1.

Comment on lines +156 to +157
// Job was not enqueued. It must be completed already, add to our set of already completed jobs
this.jobsToRetrieve.add(id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible the job was just enqueued twice, so it is not yet complete? IIUC the broker will return false if the job is enqueued but it already has one with the same id, regardless of it being finished or not.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. This is now refactored to return better information as to whether the job was enqueued or not.

if (output.type === type) {
return output.result as ProvingJobResultsMap[T];
if (output.type === jobType) {
return { result: output.result as ProvingJobResultsMap[T], success: true, reason: '' };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return { result: output.result as ProvingJobResultsMap[T], success: true, reason: '' };
return { result: output.result as ProvingJobResultsMap[T], success: true };

Nit: let's not use empty strings instead of null/undefined.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@alexghr alexghr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


export const getEpochFromProvingJobId = (id: ProvingJobId) => {
const components = id.split(':');
return +components[0];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for now, but it might be worth throwing here if this is not a number (only benefits to catch tests that use old-style IDs)

// keep retrying until we time out
}
// Job was not enqueued. It must be completed already, add to our set of already completed jobs
this.jobsToRetrieve.add(id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the new else branch (now that we have status) we technically have the result in the status object and can resolve immediately, but an optimization for another time :)

@PhilWindle PhilWindle merged commit 1c23662 into master Jan 9, 2025
78 checks passed
@PhilWindle PhilWindle deleted the pw/ssd branch January 9, 2025 16:14
alexghr added a commit that referenced this pull request Jan 9, 2025
This PR refactors the types of the prover broker and agent config to
reuse more of the existing helpers.

Built on top of #10981 

Fix #10267
TomAFrench added a commit that referenced this pull request Jan 9, 2025
* master: (287 commits)
  feat: Sync from noir (#11051)
  chore(docs): Update tx concepts page (#10947)
  chore(docs): Edit Aztec.nr Guide section (#10866)
  chore: test:e2e defaults to no-docker (#10966)
  chore(avm): improve column stats (#11135)
  chore: Sanity checking of proving job IDs (#11134)
  feat: permutation argument optimizations  (#10960)
  feat: single tx block root rollup (#11096)
  refactor: prover db config (#11126)
  feat: monitor event loop lag (#11127)
  chore: Greater stability at 1TPS (#10981)
  chore: Jest reporters for CI (#11125)
  fix: Sequencer times out L1 tx at end of L2 slot (#11112)
  feat: browser chunking (#11102)
  fix: Added start/stop guards to running promise and serial queue (#11120)
  fix: Don't retransmit txs upon node restart (#11123)
  fix: Prover node aborts execution at epoch end (#11111)
  feat: blob sink in sandbox without extra process (#11032)
  chore: log number of instructions executed for call in AVM. Misc fix. (#11110)
  git subrepo push --branch=master noir-projects/aztec-nr
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
e2e-all CI: Enables this CI job. network-all Run this CI job.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Re-work interface between prover node and prover broker
3 participants