Tenure extends for increasing tenure budgets #5476
Replies: 12 comments 13 replies
-
Can you expand on this - specifically, why would the timer only reset when a block with contract calls is processed? Wouldn't that encourage spikiness? My assumption would be that this timer resets when a block includes a TenureExtend. |
Beta Was this translation helpful? Give feedback.
-
Yes, I think you’re right, the timer should start when a tenure begins (or the extension begins: basically the timer should start whenever there's a tenure change payload). However, the signers must also do some metering according to the block evaluation time. Right now, its the case that the tenure budget is expended often with just a few seconds of evaluation, but the cost tracker is an imperfect (and pessimistic) estimator of runtime. Things like cache locality in the MARF definitely impact block evaluation time, and so the signers should take that into account. This is simple enough for them to do naively by just tracking the wall clock time of processing the block proposals. I think the way to do this is to "bump" the budget timer by the amount of time they spend processing proposals during the tenure: so if a proposal takes them 1 minute to evaluate, they bump the budget timer by 1 minute (so if they would have allowed the budget to be reset at time |
Beta Was this translation helpful? Give feedback.
-
Could you have signers track based on the last time they saw a tenure change payload rather than a contract call? I ask because the tenure change payload is always guaranteed to be the very first transaction in the block so might be easier to track that rather than the last block with a contract call. |
Beta Was this translation helpful? Give feedback.
-
I had not thought about it this way, but I like the idea of factoring in the actual block processing time. To put it another way, we can think of it as the signer saying, "once I have seen X minutes of downtime, I will allow a tenure extend." So when the tenure starts or extends, I start a countdown at X minutes. When I get a block proposal, I pause the countdown, process the block, send my signature, and resume the countdown. When my countdown reaches 0, I will allow a tenure extension in the next block I process. One simple way to synchronize between miners and signers could be for the signer to include a flag in its block signature message indicating to the miner that it is ready for a tenure extend. When the miner sees that enough signers have set this flag, it can go ahead and issue one in its next block. |
Beta Was this translation helpful? Give feedback.
-
Assuming we do some measurement of wall-clock time on block processing, I just want to note that we should have the node track this and return it to the signer in the HTTP response to the block proposal. If the signer tries to track this, we can end up with too much variance from other reasons for latency. |
Beta Was this translation helpful? Give feedback.
-
@obycode, to enlist the help of @hstove and @jferrant can you split this into smaller issues that they can handle in parallel? |
Beta Was this translation helpful? Give feedback.
-
EDIT (by @aldur): See below for an updated design, this is left for historical references. Here are my initial thoughts for the design: OverviewThe task here is to allow a miner to extend its tenure based on time since the last tenure extension. The signers decide when a miner is allowed to extend, so we need some mechanism to communicate this between the miner and the signers. I propose adding a field, Signer DetailsThe signer configuration will specify a tenure extend time period. The first version of this to go live on mainnet should start off with this value defaulting to 10 minutes, to ensure minimal impact on the network. As we validate that these tenure extends do not cause any problems, we can spread the word to signers to iteratively lower this number. When a new burn block arrives, record the current time, If a block proposal arrives that contains a Miner DetailsThe miner needs to now keep track of the signers’ current idle time countdowns and decide when it can refresh its budget with an TestingDesigning good integration tests for this new behavior is important. We will need to test several different scenarios:
|
Beta Was this translation helpful? Give feedback.
-
Updated design after discussion with @jferrant and @hstove: OverviewThe task here is to allow a miner to extend its tenure based on time since the last tenure extension. The signers decide when a miner is allowed to extend, so we need some mechanism to communicate this between the miner and the signers. I propose adding a field, Signer DetailsThe signer configuration will specify a tenure extend time period. The first version of this to go live on mainnet should start off with this value defaulting to something like 5 minutes. As we validate that these tenure extends do not cause any problems, we can spread the word to signers to iteratively lower this number. When a new burn block arrives, record the current time, In the We keep track of "idle" time instead of just flat wall time because it allows the signers to factor in how long it actually takes to process the blocks. This will flatten out the total processing time in scenarios where the cost budgeting is overly pessimistic, causing us to see some blocks that can spend the entire budget and be processed in 3 seconds, while others that spend the entire budget take 3 minutes to process. If a block proposal arrives that contains a Miner DetailsThe miner needs to now keep track of the signers’ current idle timestamps and decide when it can refresh its budget with a tenure extension. A new component will process the StackerDB messages as they arrive, rather than directly in the sign coordinator. This is important because the sign coordinator stops listening for block responses from signers as soon as it hits the 70% threshold, but it is important for the miner to track the idle timestamps of all signers that report it. This component will be responsible for keeping track of the signers' latest idle timestamps, queryable from the miner. It will also provide the sign coordinator with block signatures. After each round of signing, the miner should record its estimated time to extend. It can compute this by ordering the countdown responses in ascending order, and selecting a time at which > 70% of the signing power will have passed their timestamp. Before each attempt to mine a block, check if this timestamp has passed and if so, issue the tenure extension. TestingDesigning good integration tests for this new behavior is important. We will need to test several different scenarios:
|
Beta Was this translation helpful? Give feedback.
-
The goal is that the budget makes the block take roughly 30s to process, but in practice, some full blocks can process in 2 seconds while others take 2 minutes to process. By measuring the idle time and allowing the tenure extends based on that, we counteract this discrepancy in actual processing time. |
Beta Was this translation helpful? Give feedback.
-
Those are fair points. Definitely agree with the idea of treating cache
hits differently from cache misses when it can be done. And, that
when-condition is where the gaps in my understanding lie.
Not all nodes' disk caches are created equal. Signer A could get 100% cache
hits while signer B gets 0% on the same block, and the observed cache
hit/miss ratio is determined by a vast set of local configuration
parameters we can't control. A signer's locally-measured idle time may not
reflect what a bootstrapping node encounters when they replay the block, or
even reflect what the signer itself sees in the event it replays the block
at a later date.
Fundamentally, the idle time approach assumes that signers' 30th percentile
idle time measurements are a good enough predictor of how much validation
wall-clock time other nodes will encounter on a block. This may well be
the case, and I'm trying to achieve an understanding of how well-founded
this assumption is. In particular, I'm trying to understand how much error
there can be in the measurement (and what the bounds on that error will
be). The advantage of using a globally-observed resource consumption rate
in this case is that nodes at least learn the worst-case processing time.
@kantai and I have spoken about this problem in the past regarding how to
discount reads that are cache hits. The conclusion we reached was that the
only way to do so safely would be to make the caching strategy part of
consensus, so nodes are guaranteed to avoid thrashing for want of bigger
disk cache.
…On Mon, Nov 18, 2024, 10:15 PM Brice ***@***.***> wrote:
I like the idea, but there is still a *lot* of variability in the
read/write access times, so the cost value that we report doesn't really
tell us much about actual processing time. I'm a bit concerned about
counting a block with 15,000 cache misses the same as another block with
15,000 hits. Tracking idle time solves that for us.
It could be that the technique you're suggesting is good enough and the
fact that it makes the computation more deterministic could be worth the
loss of accuracy.
—
Reply to this email directly, view it on GitHub
<#5476 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADQJK5WMILUWD7A3RIIW7D2BKUNBAVCNFSM6AAAAABSAWVR2OVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCMRZHE3TKNQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***
.com>
|
Beta Was this translation helpful? Give feedback.
-
Per the sprint call today, I'm satisfied and 100% onboard with this design. Thank you all for catching me up! |
Beta Was this translation helpful? Give feedback.
-
This discussion has been automatically locked since there has not been any recent activity after it was closed. |
Beta Was this translation helpful? Give feedback.
-
The goal of the performance improvements (#5430, #5431, #5432) is to make the
stacks-node
more performant right now, and in particular, free up CPU time in thestacks-node
such that if it is spending more time performing block processing, the nodes will continue to be able to stay in sync and responsive on their network interfaces (this is important for stackerdb messages to propagate, signers and miners to stay in sync, etc).Once these improvements are in place, the tenure budget can be safely increased. This can (and should) be done without consensus changes. This can be done by simply issuing a tenure extend from the miner and the signer set approving it.
The basic idea is to have the miner thread time the length of its tenure, along with a configuration setting that tells it when it should try to perform a tenure extend. The signer set will similarly hold timers (measured from when they last signed off on a block proposal which spent some amount of block budget: the timer isn’t reset when they sign a block with just transfers, but it is reset if they process a block with contract-calls, e.g.), and when the timer expires, they would allow a tenure extend.
This would still allow the “spikiness” in budget consumption that we have today (the spikiness issue is somewhat orthogonal, and would be treated by #5433), but the budget would itself be higher, and the timing of extends would enforce some metering of the spikes (so that contract call budgets would be reset at, e.g., every 5 minutes rather than every bitcoin block, or whatever the timeout is set to). During initial rollout, this timeout will need to be set conservatively, but could be made more aggressive through configuration changes in miners and signers.
Beta Was this translation helpful? Give feedback.
All reactions