Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: EXC-1838 Run hook after CanisterWasmMemoryLimitExceeded error is fixed #3631

54 changes: 34 additions & 20 deletions rs/execution_environment/src/execution/update.rs
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ pub fn execute_update(
log_dirty_pages: FlagStatus,
deallocation_sender: &DeallocationSender,
) -> ExecuteMessageResult {
let (clean_canister, prepaid_execution_cycles, resuming_aborted) =
let (mut clean_canister, prepaid_execution_cycles, resuming_aborted) =
match prepaid_execution_cycles {
Some(prepaid_execution_cycles) => (clean_canister, prepaid_execution_cycles, true),
None => {
Expand Down Expand Up @@ -147,13 +147,31 @@ pub fn execute_update(
let helper = match UpdateHelper::new(&clean_canister, &original, deallocation_sender) {
Ok(helper) => helper,
Err(err) => {
if err.code() == ErrorCode::CanisterWasmMemoryLimitExceeded
&& original.call_or_task == CanisterCallOrTask::Task(CanisterTask::OnLowWasmMemory)
Comment on lines +150 to +151
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dragoljub-duric Wouldn't it be better to not perform any execution in this case (instead of spending cycles on an execution that fails)? Isn't it possible to check the limits in advance before running the execution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UpdateHelper::new immediately checks the limit, in this file in line 371 below. So we may move this check in UpdateHelper::new but it will require refactoring of the UpdateHelper::new because, in the case of the error, it should return a modified state (the state where we put the back hook on the task queue). Does that answer your question?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that answer your question?

Not really because it is not clear to me (without looking into the code that I'm not super familiar with) at what point in time the failure happens and if cycles are charged. Could you please clarify that a bit more?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context we concluded that: we're charging the base fee of 5M cycles per execution nonetheless, as update_message_execution_fee in prepay_execution_cycles which is not refunded in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quick fix I see is that in this case, we can refund an additional 5M. @mraszyk what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quick fix I see is that in this case

As a quick fix, it makes sense, but I wonder if the code doesn't become fragile due to such a fix. Do you see a way to avoid the refund by not preparing the execution (prepaying etc.) at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is double, I am trying to add check-in execute_call_or_task before calling prepay_execution_cycles.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check could then ideally also apply to global timer etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will apply to all updates/tasks.

{
//`OnLowWasmMemoryHook` is taken from task_queue (i.e. `OnLowWasmMemoryHookStatus` is `Executed`),
// but its was not executed due to error `WasmMemoryLimitExceeded`. To ensure that the hook is executed
// when the error is resolved we need to set `OnLowWasmMemoryHookStatus` to `Ready`. Because of
// the way `OnLowWasmMemoryHookStatus::update` is implemented we first need to remove it from the
// task_queue (which calls `OnLowWasmMemoryHookStatus::update(false)`) followed with `enqueue`
// (which calls `OnLowWasmMemoryHookStatus::update(true)`) to ensure desired behavior.
clean_canister
.system_state
.task_queue
.remove(ic_replicated_state::ExecutionTask::OnLowWasmMemory);
clean_canister
.system_state
.task_queue
.enqueue(ic_replicated_state::ExecutionTask::OnLowWasmMemory);
dragoljub-duric marked this conversation as resolved.
Show resolved Hide resolved
}
return finish_err(
clean_canister,
original.execution_parameters.instruction_limits.message(),
err,
original,
round,
)
);
}
};

Expand Down Expand Up @@ -341,26 +359,22 @@ impl UpdateHelper {

validate_message(&canister, &original.method)?;

if let CanisterCallOrTask::Call(_) = original.call_or_task {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this PR, we wouldn't be enforcing the limit for system tasks here: so I'm not sure why this PR is needed at all; the current effect of this PR seems to be as follows:

  • global timers and heartbeats fail if the wasm memory limit is exceeded initially (although they succeed if the wasm memory limit is exceeded during their execution): this behavior seems surprising to me
  • low on wasm memory hooks are retried if the wasm memory limit is exceeded initially (although they succeed if the wasm memory limit is exceeded during their execution): this behavior might be undesirable since the hook might be crucial in resolving the exceeded wasm memory limit and it wouldn't run due to this PR.

Copy link
Contributor Author

@dragoljub-duric dragoljub-duric Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

global timers and heartbeats fail if the wasm memory limit is exceeded initially (although they succeed if the wasm memory limit is exceeded during their execution): this behavior seems surprising to me

This sounds expected to me, and it will behave the same way as in the update case. In my opinion, having a homogenous behavior of tasks/updates is a plus.

low on wasm memory hooks are retried if the wasm memory limit is exceeded initially (although they succeed if the wasm memory limit is exceeded during their execution): this behavior might be undesirable since the hook might be crucial in resolving the exceeded wasm memory limit and it wouldn't run due to this PR.

I can see the point in this one, maybe you are right. If the developer uses the hook to notify himself that memory is below the threshold, having the hook stopped in this case may be unexpected.

// TODO(RUN-957): Enforce the limit in heartbeat and timer after
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still one more TODO(RUN-957) in the code to be resolved. CC @dragoljub-duric

Copy link
Contributor

@mraszyk mraszyk Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in my opinion, it seems safer to not enforce the limit during a system task, i.e., simply drop the other TODO(RUN-957), instead of trapping during a system task.

// canister logging ships by removing the `if` above.
mraszyk marked this conversation as resolved.
Show resolved Hide resolved
let wasm_memory_usage = canister
.execution_state
.as_ref()
.map_or(NumBytes::new(0), |es| {
num_bytes_try_from(es.wasm_memory.size).unwrap()
});

let wasm_memory_usage = canister
.execution_state
.as_ref()
.map_or(NumBytes::new(0), |es| {
num_bytes_try_from(es.wasm_memory.size).unwrap()
});
if let Some(wasm_memory_limit) = clean_canister.system_state.wasm_memory_limit {
// A Wasm memory limit of 0 means unlimited.
if wasm_memory_limit.get() != 0 && wasm_memory_usage > wasm_memory_limit {
let err = HypervisorError::WasmMemoryLimitExceeded {
bytes: wasm_memory_usage,
limit: wasm_memory_limit,
};
dragoljub-duric marked this conversation as resolved.
Show resolved Hide resolved

if let Some(wasm_memory_limit) = clean_canister.system_state.wasm_memory_limit {
// A Wasm memory limit of 0 means unlimited.
if wasm_memory_limit.get() != 0 && wasm_memory_usage > wasm_memory_limit {
let err = HypervisorError::WasmMemoryLimitExceeded {
bytes: wasm_memory_usage,
limit: wasm_memory_limit,
};
return Err(err.into_user_error(&canister.canister_id()));
}
return Err(err.into_user_error(&canister.canister_id()));
dragoljub-duric marked this conversation as resolved.
Show resolved Hide resolved
}
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1443,3 +1443,103 @@ fn on_low_wasm_memory_is_executed_after_growing_stable_memory() {
NumWasmPages::new(6)
);
}

#[test]
fn on_low_wasm_memory_hook_is_run_after_memory_surpass_limit() {
let mut test = ExecutionTestBuilder::new().with_manual_execution().build();

let update_grow_mem_size = 10;
let hook_grow_mem_size = 5;

let wat: String =
get_wat_with_update_and_hook_mem_grow(update_grow_mem_size, hook_grow_mem_size, true);

let canister_id = test.canister_from_wat(wat.as_str()).unwrap();

// Initially wasm_memory.size = 1
assert_eq!(
test.execution_state(canister_id).wasm_memory.size,
NumWasmPages::new(1)
);

test.ingress_raw(canister_id, "grow_mem", vec![]);

// First ingress messages gets executed.
// wasm_memory.size = 1 + 10 = 11
test.execute_slice(canister_id);

assert_eq!(
test.execution_state(canister_id).wasm_memory.size,
NumWasmPages::new(11)
);

// We update `wasm_memory_limit` to be smaller than `used_wasm_memory`.
test.canister_update_wasm_memory_limit_and_wasm_memory_threshold(
canister_id,
(10 * WASM_PAGE_SIZE_IN_BYTES as u64).into(),
(5 * WASM_PAGE_SIZE_IN_BYTES as u64).into(),
)
.unwrap();

// The update will also trigger `low_wasm_memory` hook.
assert_eq!(
test.state()
.canister_states
.get(&canister_id)
.unwrap()
.system_state
.task_queue
.peek_hook_status(),
OnLowWasmMemoryHookStatus::Ready
);

// Hook execution will not succeed since `used_wasm_memory` > `wasm_memory_limit`.
test.execute_slice(canister_id);

assert_eq!(
test.execution_state(canister_id).wasm_memory.size,
NumWasmPages::new(11)
);

// After execution of the hook fails, hook status will remain `Ready`.
assert_eq!(
test.state()
.canister_states
.get(&canister_id)
.unwrap()
.system_state
.task_queue
.peek_hook_status(),
OnLowWasmMemoryHookStatus::Ready
);

// We fix the error by setting `wasm_memory_limit` > `used_wasm_memory`.
// At the same time:
// `wasm_memory_limit` - `used_wasm_memory` < `wasm_memory_threshold`
// condition for `low_wasm_memory` hook remains satisfied.
// Hence, `low_wasm_memory` hook execution will follow.
test.canister_update_wasm_memory_limit_and_wasm_memory_threshold(
canister_id,
(20 * WASM_PAGE_SIZE_IN_BYTES as u64).into(),
(10 * WASM_PAGE_SIZE_IN_BYTES as u64).into(),
)
.unwrap();

test.execute_slice(canister_id);

assert_eq!(
test.execution_state(canister_id).wasm_memory.size,
NumWasmPages::new(16)
);

assert_eq!(
test.state()
.canister_states
.get(&canister_id)
.unwrap()
.system_state
.task_queue
.peek_hook_status(),
OnLowWasmMemoryHookStatus::Executed
);
}