Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: EXC-1838 Run hook after CanisterWasmMemoryLimitExceeded error is fixed #3631

Conversation

dragoljub-duric
Copy link
Contributor

@dragoljub-duric dragoljub-duric commented Jan 27, 2025

Problem:
As previously observed by @berestovskyy #3455 (comment) it may happen that execution of low_wasm_memory hook is stopped when wasm_memory_limit < used_wasm_memory.
Solution:
If that happens, run the hook after the error is fixed if the hook condition remains satisfied.

@dragoljub-duric dragoljub-duric changed the title Exc 1838 revisit hook status behavior when hook execution is stopped because wasm memory usage wasm memory limit fix: EXC-1838 Run hook after CanisterWasmMemoryLimitExceeded error is fixed Feb 3, 2025
@github-actions github-actions bot added the fix label Feb 3, 2025
@dragoljub-duric dragoljub-duric marked this pull request as ready for review February 3, 2025 13:02
@dragoljub-duric dragoljub-duric requested a review from a team as a code owner February 3, 2025 13:02
Copy link
Contributor

@berestovskyy berestovskyy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@dragoljub-duric dragoljub-duric added this pull request to the merge queue Feb 3, 2025
Merged via the queue into master with commit 773b035 Feb 3, 2025
25 checks passed
@dragoljub-duric dragoljub-duric deleted the EXC-1838-revisit-hook-status-behavior-when-hook-execution-is-stopped-because-wasm-memory-usage-wasm-memory-limit branch February 3, 2025 14:35
Comment on lines +150 to +151
if err.code() == ErrorCode::CanisterWasmMemoryLimitExceeded
&& original.call_or_task == CanisterCallOrTask::Task(CanisterTask::OnLowWasmMemory)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dragoljub-duric Wouldn't it be better to not perform any execution in this case (instead of spending cycles on an execution that fails)? Isn't it possible to check the limits in advance before running the execution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UpdateHelper::new immediately checks the limit, in this file in line 371 below. So we may move this check in UpdateHelper::new but it will require refactoring of the UpdateHelper::new because, in the case of the error, it should return a modified state (the state where we put the back hook on the task queue). Does that answer your question?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that answer your question?

Not really because it is not clear to me (without looking into the code that I'm not super familiar with) at what point in time the failure happens and if cycles are charged. Could you please clarify that a bit more?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context we concluded that: we're charging the base fee of 5M cycles per execution nonetheless, as update_message_execution_fee in prepay_execution_cycles which is not refunded in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quick fix I see is that in this case, we can refund an additional 5M. @mraszyk what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quick fix I see is that in this case

As a quick fix, it makes sense, but I wonder if the code doesn't become fragile due to such a fix. Do you see a way to avoid the refund by not preparing the execution (prepaying etc.) at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is double, I am trying to add check-in execute_call_or_task before calling prepay_execution_cycles.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check could then ideally also apply to global timer etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will apply to all updates/tasks.

@@ -341,26 +359,22 @@ impl UpdateHelper {

validate_message(&canister, &original.method)?;

if let CanisterCallOrTask::Call(_) = original.call_or_task {
// TODO(RUN-957): Enforce the limit in heartbeat and timer after
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still one more TODO(RUN-957) in the code to be resolved. CC @dragoljub-duric

Copy link
Contributor

@mraszyk mraszyk Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in my opinion, it seems safer to not enforce the limit during a system task, i.e., simply drop the other TODO(RUN-957), instead of trapping during a system task.

@@ -341,26 +359,22 @@ impl UpdateHelper {

validate_message(&canister, &original.method)?;

if let CanisterCallOrTask::Call(_) = original.call_or_task {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this PR, we wouldn't be enforcing the limit for system tasks here: so I'm not sure why this PR is needed at all; the current effect of this PR seems to be as follows:

  • global timers and heartbeats fail if the wasm memory limit is exceeded initially (although they succeed if the wasm memory limit is exceeded during their execution): this behavior seems surprising to me
  • low on wasm memory hooks are retried if the wasm memory limit is exceeded initially (although they succeed if the wasm memory limit is exceeded during their execution): this behavior might be undesirable since the hook might be crucial in resolving the exceeded wasm memory limit and it wouldn't run due to this PR.

Copy link
Contributor Author

@dragoljub-duric dragoljub-duric Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

global timers and heartbeats fail if the wasm memory limit is exceeded initially (although they succeed if the wasm memory limit is exceeded during their execution): this behavior seems surprising to me

This sounds expected to me, and it will behave the same way as in the update case. In my opinion, having a homogenous behavior of tasks/updates is a plus.

low on wasm memory hooks are retried if the wasm memory limit is exceeded initially (although they succeed if the wasm memory limit is exceeded during their execution): this behavior might be undesirable since the hook might be crucial in resolving the exceeded wasm memory limit and it wouldn't run due to this PR.

I can see the point in this one, maybe you are right. If the developer uses the hook to notify himself that memory is below the threshold, having the hook stopped in this case may be unexpected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants