-
Notifications
You must be signed in to change notification settings - Fork 834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Scheduler to Support Relay Chain Block Number Provider #6362
base: master
Are you sure you want to change the base?
Conversation
@@ -1157,24 +1181,30 @@ impl<T: Config> Pallet<T> { | |||
return | |||
} | |||
|
|||
let mut incomplete_since = now + One::one(); | |||
let mut when = IncompleteSince::<T>::take().unwrap_or(now); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why would not it work with IncompleteSince
, without the block Queue
?
How we determine the MaxScheduledBlocks
bound?
With the IncompleteSince
we iterate over blocks that might have no task to execute and this might make a situation with many incomplete blocks even worth. But probably not too much? One more read?
Both solutions need a strategy for a situation when there are two many tasks that can not be completed and the task queue only grow. If such strategy not yet in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the
IncompleteSince
we iterate over blocks that might have no task to execute and this might make a situation with many incomplete blocks even worth. But probably not too much? One more read?
Yes, but then this becomes unbounded in case too many blocks are skipped. The idea behind using the Queue
is to bound this to a sufficient number.
How we determine the MaxScheduledBlocks bound?
This should be determined similar to the existing MaxScheduledPerBlock
?
Both solutions need a strategy for a situation when there are two many tasks that can not be completed and the task queue only grow. If such strategy not yet in place.
There is already a retry mechanism and the task is purged if the retry count is exceeded (even if failed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Queue
not only bounds how many blocks gonna be processed from the past. It bounds for how many blocks we can schedule. If the number is 50
, we can schedule only 50
jobs with distinct schedule time.
The MaxScheduledPerBlock
for me seems simpler to define. Because the block size its exiting constrain the system have. But how many distinct schedule time points you can have is something new.
Retries work in case if a certain task fails while it's function call is being executed (not the scheduler fail). I meant a case when there are many (or few but too heavy) overdue tasks (task_block < now), so that the scheduler never (or needs too many time) to complete them and exist such overdue state to start processing tasks in time. Do we handle such case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Queue not only bounds how many blocks gonna be processed from the past. It bounds for how many blocks we can schedule. If the number is 50, we can schedule only 50 jobs with distinct schedule time
Indeed, I do not find it quite comfortable to run a for
loop with IncompleteSince
when there could be an unknown number of blocks passed between the successive runs. You could always keep the MaxScheduledBlocks
on the higher side that would give you a similar experience?
I meant a case when there are many (or few but too heavy) overdue tasks (task_block < now), so that the scheduler never (or needs too many time) to complete them and exist such overdue state to start processing tasks in time. Do we handle such case?
But this stays as an issue even in the current implementation? The change here just makes it bounded, so that the scheduling itself is blocked in such a case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can put a quite big bound on the MaxScheduledBlocks
, it is just a vec of block numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Playing devils advocate here since there could be parachains that only produce one block every two hours, which would get stuck without ever catching up the IncompleteSince.
I suggest focusing primarily on our use case, where Asset Hub sets up a scheduler with a Relay Chain block provider. If we can solve this problem with minimal code changes, better no code changes, I would prefer that approach. We can document our expectations for the BlockProvider
type next to it's declaration, and if we or someone else encounter the use case you described in the future, we can address it then.
I wouldn’t complicate this pallet for theoretically possible use cases. Instead, we should target this issue for resolution in the next SDK release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we constraint our usecase, then we can
-
1- keep as it was, iterating on every block.
-
2- bring one level faster: a new storage map that maps from
block_number / 256
to abit-array of size 256
. And we iterate on this map, meaning we iterate for each 256 block, so 25.6 minutes. We only have to iterate 56 times to do one day. 1700 for one month. PoV cost is only 256 bit (+keys) for our asset hub usecase. PoV is getting bad if we read many empty value, but I guess it is ok. -
(3- implement a priority queue on top of chunks in a storage map).
I suggest 1 or 2, it is good for almost everything IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest focusing primarily on our use case, where Asset Hub sets up a scheduler with a Relay Chain block provider.
But even in that case can the thing happen that i mentioned. There just is no guarantee as to how much history AH needs to check per block. So we either bound it with a queue or not. Maybe the 2) from Gui is a reasonable center point, but it still removes the property of the scheduler that every scheduled agenda will eventually be serviced since it could miss some.
Like if we only want to change it for AHM then maybe forking the pallet into the runtimes repo would work otherwise we may screw parachain teams by cutting corners here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the 2) from Gui is a reasonable center point, but it still removes the property of the scheduler that every scheduled agenda will eventually be serviced since it could miss some.
I missed this point, what can be missed?
I agree we should remove MaxScheduledBlocks
and MaxStaleTaskAge
and require in the pallet description that performance can significantly decrease if the chain blocks are too far apart.
As a note I think a priority queue on top of chunks should not be so hard to implement, and is probably the best solution.
substrate/frame/scheduler/src/lib.rs
Outdated
|
||
#[pallet::storage] | ||
pub type IncompleteSince<T: Config> = StorageValue<_, BlockNumberFor<T>>; | ||
/// Provider for the block number. Normally this is the `frame_system` pallet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally in what case? Parachain or relay/solo?
substrate/frame/scheduler/src/lib.rs
Outdated
/// Provider for the block number. Normally this is the `frame_system` pallet. | ||
type BlockNumberProvider: BlockNumberProvider; | ||
|
||
/// The maximum number of blocks that can be scheduled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any hints on how to configure this? Parachain teams will read this and not know what number to put.
substrate/frame/scheduler/src/lib.rs
Outdated
#[pallet::constant] | ||
type MaxScheduledBlocks: Get<u32>; | ||
|
||
/// The maximum number of blocks that a task can be stale for. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also maybe a hint for a sane default value.
substrate/frame/scheduler/src/lib.rs
Outdated
/// The queue of block numbers that have scheduled agendas. | ||
#[pallet::storage] | ||
pub(crate) type Queue<T: Config> = | ||
StorageValue<_, BoundedVec<BlockNumberFor<T>, T::MaxScheduledBlocks>, ValueQuery>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know if one vector is enough? I think the referenda pallet creates an alarm for each ref...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, if I get this, can you elaborate more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it okay, if I convert it to a vector of vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vector of vector would be the same as a vector in regards to PoV.
I think if we need to make a lot of schedule we will have to use multiple storage item
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For our use-case (Governance) it is probably fine to have just one vector i guess since we only put a single block number in there. So we could reasonable have 1000 in there or so. But yes if someone uses it for other things then it could cause high PoV on a parachain.
@@ -1157,24 +1181,30 @@ impl<T: Config> Pallet<T> { | |||
return | |||
} | |||
|
|||
let mut incomplete_since = now + One::one(); | |||
let mut when = IncompleteSince::<T>::take().unwrap_or(now); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the referenda pallet creates an alarm for every ref to check the voting turnout.
We have a problem with (3.2) case only. On the current version (without Queue) it will eventually handle the overdue blocks (we can even calculate how many blocks it will take, lets say if there is no tasks scheduled in that period).
Depends on how many blocks are produced. I guess when we assume that the parachain will produce blocks at least as fast as it can advance the scheduler then yes.
Playing devils advocate here since there could be parachains that only produce one block every two hours, which would get stuck without ever catching up the IncompleteSince
.
Conceptually, I believe that a priority Queue is the right data structure. We try to evaluate an ordered list of tasks by their order. It is exactly what a priority queue is good at. The issue with implementing this as a Vector is obviously the PoV.
Maybe we can implement the Queue as a B Tree? Then we can get the next task in log reads and insert in log writes. And it allows us to do exactly what we want: get the next pending task. It could be PoV optimized by using chunks as well.
To me it just seems that most of the pain here is that we are using the wrong data structure for the job.
substrate/frame/scheduler/src/lib.rs
Outdated
let mut iter = queue.iter(); | ||
let mut to_remove = Vec::new(); // Collect items to remove | ||
|
||
// Iterate and collect items to remove | ||
for _ in 0..index { | ||
iter.next(); | ||
} | ||
for item in iter { | ||
to_remove.push(*item); | ||
} | ||
|
||
// Now remove the collected items | ||
for item in to_remove { | ||
queue.remove(&item); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what is going on here, but maybe something like queue.iter().drain().take(index).map(drop);
would work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that gave me an error:
error[E0599]: no method named drain found for mutable reference &mut BTreeSet<<... as BlockNumberProvider>::BlockNumber> in the current scope
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the interface for bounded vec is too annoying you can convert to a vec and convert back. Because the code is difficult to read.
substrate/frame/scheduler/src/lib.rs
Outdated
incomplete_since = incomplete_since.min(when); | ||
let mut index = 0; | ||
let queue_len = queue.len(); | ||
for when in queue.iter().skip(index) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can store the iterator and make it mutable to call next
onto it. Something like
while let Some(value) = iter.next() {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the error was:
error[E0608]: cannot index into a value of type BTreeSet<<<T as pallet::Config>::BlockNumberProvider as sp_runtime::traits::BlockNumberProvider>::BlockNumber>
But I will try something similar again and see if that works
substrate/frame/scheduler/src/lib.rs
Outdated
let mut when = IncompleteSince::<T>::take().unwrap_or(now); | ||
let mut executed = 0; | ||
let queue = Queue::<T>::get(); | ||
let end_index = match queue.iter().position(|&x| x >= now) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? There are only two cases: Either the first element is >= now or it is not.
We can just check the first element in the loop below and then take it from there.
substrate/frame/scheduler/src/lib.rs
Outdated
return; | ||
}, | ||
}; | ||
if end_index == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The +1 above prevents this from ever happening.
substrate/frame/scheduler/src/lib.rs
Outdated
let queue_len = queue.len(); | ||
for when in queue.iter().skip(index) { | ||
if *when < now.saturating_sub(T::MaxStaleTaskAge::get()) { | ||
Agenda::<T>::remove(*when); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this MaxStaleTaskAge
? I think it should maybe rather be checked on inserting into an agenda instead of when servicing it, since in the service step we have nothing to lose from executing it.
But am not sure if we need it at all.
Am a bit torn here. I often think that just forking the pallets would make our lives easier since it is not a breaking change anymore... but it could mean increased maintenance..- @seadanda @kianenigma any opinion? |
@@ -32,7 +33,10 @@ type SystemCall<T> = frame_system::Call<T>; | |||
type SystemOrigin<T> = <T as frame_system::Config>::RuntimeOrigin; | |||
|
|||
const SEED: u32 = 0; | |||
const BLOCK_NUMBER: u32 = 2; | |||
|
|||
fn block_number<T: Config>() -> u32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming it max_scheduled_blocks
sounds more appropriate
From this thread: #6362 (comment)
I think this structure shouldn't be to much complexity added and will solve most parachain usecase. WDYT |
With async backing/elastic scaling, we break the assumption made in the scheduler and elsewhere that parachain block number represents a regular monotonic clock which can be used as a proxy to a wall clock. IMO the pallet should have had this feature from the start, but since this is the contract we released under I understand why making changes which decrease the benchmarked performance seems like a regression. In terms of functionality nothing changes for people who set the block provider to Maybe we could introduce a feature flag which removes the extra data structure and introduced complexity - not sure how disgusting that would be. Since the benchmarking depends on how it's configured, are we even sure there will be a weight impact if it's configured to use |
no feature flag plz. it is guaranteed to introduce disaster. what are the downside of making a new pallet? I see nothing. |
I propose that we iterate on this issue and split it into a few separate ones, prioritizing the most time-sensitive part, which I believe we can close soon enough to include in the next SDK release.
Next, let’s first see if we agree on (1). If so, we could ask @seemantaggarwal to create a separate PR for it. For (2), we could open a separate issue and discuss the solution for both (2) and (3), whether that ends up being a single or two different solutions. Why not complete the current solution? There’s a lot of disagreement, and we might not make it into the next release. Personally, I’m not happy with the solution either. The Queue, with its overhead, impacts scheduler setups using a local block provider, even those that don’t have issue (2). Additionally, it’s unclear how to determine a good MaxScheduledBlocks value. We can discuss this in a separate issue. @ggwpez @gui1117 @kianenigma @seadanda @xlc @seemantaggarwal |
All GitHub workflows were cancelled due to failure one of the required jobs. |
@muharem I agree with the disagreement parts, what I will do is, I will have a word with @gui1117 and potentially @seadanda I am not sure where the line can be drawn for each follow up. In the meantime, I will leave the current PR in a slightly better shape for any one to be able to pick it up later easily |
how about just copy & paste the pallet scheduler in this PR and called it something else? and if time allowed, abstract duplicated logic into some helper functions |
If we do (1) as Muharem suggest it only modifies an associated type and nothing else, I think it is fine. |
from discussion with @gui1117 I will create a separate issue for (1) for now, and then we can circle back to (2) at a later stage |
Issue created here: #7434 |
Follow up from #6362 (comment) The goal of this PR is to have the scheduler pallet work on a parachain which does not produce blocks on a regular schedule, thus can use the relay chain as a block provider. Because blocks are not produced regularly, we cannot make the assumption that block number increases monotonically, and thus have new logic to handle multiple spend periods passing between blocks. Requirement: instead of using the hard coded system block number. We add an associated type BlockNumberProvider --------- Signed-off-by: Oliver Tale-Yazdi <[email protected]> Co-authored-by: Oliver Tale-Yazdi <[email protected]>
paritytech#7441) Follow up from paritytech#6362 (comment) The goal of this PR is to have the scheduler pallet work on a parachain which does not produce blocks on a regular schedule, thus can use the relay chain as a block provider. Because blocks are not produced regularly, we cannot make the assumption that block number increases monotonically, and thus have new logic to handle multiple spend periods passing between blocks. Requirement: instead of using the hard coded system block number. We add an associated type BlockNumberProvider --------- Signed-off-by: Oliver Tale-Yazdi <[email protected]> Co-authored-by: Oliver Tale-Yazdi <[email protected]>
Step in #6297
This PR adds the ability for the Scheduler pallet to specify its source of the block number. This is needed for the scheduler pallet to work on a parachain which does not produce blocks on a regular schedule, thus can use the relay chain as a block provider. Because blocks are not produced regularly, we cannot make the assumption that the block number increases monotonically, and thus have a new logic via a
Queue
to handle multiple blocks with valid agenda passed between them.This change only needs a migration for the
Queue
:BlockNumberProvider
continues to use the system pallet's block numberrelay chain's block number
However, we would need separate migrations if the deployed pallets are upgraded on an existing parachain, and the
BlockNumberProvider
uses the relay chain block number.Todo