-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: Greater stability at 1TPS #10981
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but this definitely needs @alexghr to take a look
storage: "8Gi" | ||
archiverPollingInterval: 1000 | ||
archiverViemPollingInterval: 1000 | ||
pollInterval: 1000 | ||
viemPollingInterval: 1000 | ||
dataDir: "/data" | ||
storageSize: "1Gi" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a storage
entry defined a few lines above, should we delete it in favor of this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
memory: "5Gi" | ||
cpu: "1.5" | ||
ephemeral-storage: "275Gi" | ||
maxOldSpaceSize: "5120" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't maxOldSpaceSize
be slightly below the total memory available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah possibly, will modify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like node.js recommends 0.5GB or so headroom. https://nodejs.org/api/cli.html#--max-old-space-sizesize-in-mib
import { type ProvingOrchestrator } from '../orchestrator/orchestrator.js'; | ||
import { type BrokerCircuitProverFacade } from '../proving_broker/broker_prover_facade.js'; | ||
|
||
/** Encapsulates the proving orchestrator and the broker facade */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious: why not just make the orchestrator start/stop the facade?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the orchestrator just receives a ServerCircuitProver
interface. This just contains methods such as:
getBaseParityProof(
inputs: BaseParityInputs,
signal?: AbortSignal,
epochNumber?: number,
): Promise<PublicInputsAndRecursiveProof<ParityPublicInputs, typeof RECURSIVE_PROOF_LENGTH>>;
I felt that it is something of an implementation detail that the facade (an instance of a ServerCircuitProver) has the need to be start
ed and stop
ped with each epoch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. FWIW I had been lazy in these situations and used a Maybe<Service>
type (meaning the dependency may be stoppable) along with a tryStop
in the parent:
aztec-packages/yarn-project/circuit-types/src/interfaces/service.ts
Lines 16 to 25 in 4600f54
/** Tries to call stop on a given object and awaits it. Logs any errors and does not rethrow. */ | |
export async function tryStop(service: Maybe<Service>, logger?: Logger): Promise<void> { | |
try { | |
return typeof service === 'object' && service && 'stop' in service && typeof service.stop === 'function' | |
? await service.stop() | |
: Promise.resolve(); | |
} catch (err) { | |
logger?.error(`Error stopping service ${(service as object).constructor?.name}: ${err}`); | |
} | |
} |
Less clean, but saves from having to add another object just to manage the dependencies lifecycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(just in case: I'm not advocating for one approach or the other!)
private jobs: Map<ProvingJobId, ProvingJob> = new Map(); | ||
private runningPromise?: RunningPromise; | ||
private timeOfLastSnapshotSync = Date.now(); | ||
private queue?: SerialQueue = new SerialQueue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the ?
, given it's initialized on construction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that was a mistake. Nice catch.
let numCleanups = 1; | ||
let numEnqueue = 1; | ||
let remaining = await this.requestQueue.put(() => this.cleanupStaleJobs()); | ||
while (remaining) { | ||
remaining = await this.requestQueue.put(() => this.cleanupStaleJobs()); | ||
numCleanups++; | ||
} | ||
remaining = await this.requestQueue.put(() => this.reEnqueueExpiredJobs()); | ||
while (remaining) { | ||
remaining = await this.requestQueue.put(() => this.reEnqueueExpiredJobs()); | ||
numEnqueue++; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not new on this PR, but shouldn't we re-enqueue jobs before cleaning up stale ones? Seems like re-enqueuing is more time-pressing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has now changed. Cleaning up stale jobs no longer touches the DB so should not incur any latency. Neither does re-enqueueing jobs.
The DB cleanup is now performed after both of these operations.
*/ | ||
private epochHeight = 0; | ||
private maxEpochsToKeepResultsFor = 1; | ||
|
||
private requestQueue: SerialQueue = new SerialQueue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious: what prompted using a serial queue here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stability. Without the queue we effectively have an unbounded number of concurrent write transactions against the database. It's limited by the number of prover agents, but that could be thousands. LMDB only allows one write transaction at a time. It's the JS wrapper that is doing a lot of work behind the scenes to give the illusion of concurrency. But I worry about it's stability under such heavy concurrent access.
It maybe that as we try and scale to larger epochs, writing one job/result at a time is insufficient. However I think a better strategy there would be to still perform writes sequentially but write batches of updates instead of just 1.
// Job was not enqueued. It must be completed already, add to our set of already completed jobs | ||
this.jobsToRetrieve.add(id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible the job was just enqueued twice, so it is not yet complete? IIUC the broker will return false if the job is enqueued but it already has one with the same id, regardless of it being finished or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct. This is now refactored to return better information as to whether the job was enqueued or not.
if (output.type === type) { | ||
return output.result as ProvingJobResultsMap[T]; | ||
if (output.type === jobType) { | ||
return { result: output.result as ProvingJobResultsMap[T], success: true, reason: '' }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return { result: output.result as ProvingJobResultsMap[T], success: true, reason: '' }; | |
return { result: output.result as ProvingJobResultsMap[T], success: true }; |
Nit: let's not use empty strings instead of null/undefined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
export const getEpochFromProvingJobId = (id: ProvingJobId) => { | ||
const components = id.split(':'); | ||
return +components[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for now, but it might be worth throwing here if this is not a number (only benefits to catch tests that use old-style IDs)
// keep retrying until we time out | ||
} | ||
// Job was not enqueued. It must be completed already, add to our set of already completed jobs | ||
this.jobsToRetrieve.add(id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the new else branch (now that we have status) we technically have the result in the status object and can resolve immediately, but an optimization for another time :)
* master: (287 commits) feat: Sync from noir (#11051) chore(docs): Update tx concepts page (#10947) chore(docs): Edit Aztec.nr Guide section (#10866) chore: test:e2e defaults to no-docker (#10966) chore(avm): improve column stats (#11135) chore: Sanity checking of proving job IDs (#11134) feat: permutation argument optimizations (#10960) feat: single tx block root rollup (#11096) refactor: prover db config (#11126) feat: monitor event loop lag (#11127) chore: Greater stability at 1TPS (#10981) chore: Jest reporters for CI (#11125) fix: Sequencer times out L1 tx at end of L2 slot (#11112) feat: browser chunking (#11102) fix: Added start/stop guards to running promise and serial queue (#11120) fix: Don't retransmit txs upon node restart (#11123) fix: Prover node aborts execution at epoch end (#11111) feat: blob sink in sandbox without extra process (#11032) chore: log number of instructions executed for call in AVM. Misc fix. (#11110) git subrepo push --branch=master noir-projects/aztec-nr ...
This PR contains the following changes: