-
Notifications
You must be signed in to change notification settings - Fork 632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: rework contract runtime error processing #4181
Conversation
runtime/near-vm-errors/src/lib.rs
Outdated
@@ -389,6 +383,9 @@ impl fmt::Display for WasmTrap { | |||
WasmTrap::GenericTrap => write!(f, "Generic trap."), | |||
WasmTrap::BreakpointTrap => write!(f, "Breakpoint trap."), | |||
WasmTrap::StackOverflow => write!(f, "Stack overflow."), | |||
WasmTrap::NondeterministicTrap(message) => { | |||
write!(f, "Non-deterministic trap: {}", message) | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Undelated to PR, but
The error message given by the Display representation of an error type should be lowercase without trailing punctuation, and typically concise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworked, but requires looking on that in separate PR.
/// It can be convert back by wasmer_vm::TrapCode::from_str | ||
Wasmer1Trap(String), | ||
/// Non-deterministic error. | ||
Nondeterministic(String), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this error deterministic, i.e, is it guaranteed that a nondeterministic error will be wrapped in this variant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, now it is.
VMError::FunctionCallError(FunctionCallError::WasmerRuntimeError(error_msg)) | ||
} | ||
} | ||
Ok(e) => return (&e).into(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can probably just do an if let
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
InvokeError::TrapCode { code, srcloc } => { | ||
VMError::FunctionCallError(FunctionCallError::Nondeterministic(format!( | ||
"Wasmer trap {} at {}", | ||
*code as u32, srcloc | ||
))) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, from the light reading of the code, I am not sure that this should always be nondeterministic. It seems that, eg, division by zero generates this trap, no? Do we have a test for division by zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, reworked that piece.
runtime/runtime/src/actions.rs
Outdated
FunctionCallError::CompilationError(_) | ||
| FunctionCallError::LinkError { msg: _ } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to the PR, but:
- LinkError should probably be a tuple -- it's the only one with
msg
field, which feels inconsistent - I am not entirely sure we want to distinguish between CompilationError and LinkError, they feel sufficiently similar to me in the context of wasm execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guess so, but this PR already become bigger than I want it to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I, actually, prefer it to be structured (and update the rest of variants as well if necessary) since String
does not capture the meaning of the text value 😐
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention here is to have error separated between contract executor and the rest of logic. We can make it a enum, but the only purpose of this error is to show error to humans, that's why I made this change.
/// It can be convert back by wasmer_vm::TrapCode::from_str | ||
Wasmer1Trap(String), | ||
/// Non-deterministic error. | ||
Nondeterministic(String), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt of still keep it WasmUnkownError is a good for wasmer 1. The internal of it is really unknown in wasmer 0.x (UnknownTrap etc.) but it's indeed known in wasmer 1. I suggest name it as WasmerRuntimeError or WasmerInternalError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworked, so that Nondeterministic
is properly returned.
Some(TrapCode::VMOutOfMemory) => { | ||
FunctionCallError::Nondeterministic("Wasmer out of memory".to_string()) | ||
} | ||
None => FunctionCallError::Nondeterministic(error_msg), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect, None only happens when wasmer returns a Generic Error or User Error that failed to downcast to VMLogicError: https://github.com/wasmerio/wasmer/blob/9e0122e1f7bdb6fedf5fc17cc9a3a0fda30031a0/lib/engine/src/trap/error.rs#L18
And Generic Error is constructed by wasmer RuntimeError::new: https://github.com/wasmerio/wasmer/blob/9e0122e1f7bdb6fedf5fc17cc9a3a0fda30031a0/lib/engine/src/trap/error.rs#L61
User Error is any struct that has Error trait.
But both of them must come from constructing such structs in wasm file manually. We only consturct VMLogicError in near-vm-logic/src/logic.rs so other User Error and GenericError should not happen. So it is not an Nondeterministic, should be debug_assert!(false) or unreachable!()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworked code around this place.
Overall looks good, this specific one looks questionable: #4181 (comment) |
Adding @volovyk-s to check on error reporting, since it affects near-api-js |
// TODO: shall we abort on unknown errors also? | ||
| FunctionCallError::WasmUnknownError(_) | ||
| FunctionCallError::HostError(_) | ||
| FunctionCallError::EvmError(_) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@artob @frankbraun @joshuajbouw Given that EVM execution now does not differ from any other Wasm contract execution, why do we still have a separate EvmError
?
result.result = Err(ActionErrorKind::FunctionCallError(err).into()); | ||
false | ||
} | ||
Some(VMError::FunctionCallError(err)) => match err { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is unexpected that we have structured FunctionCallError
information, but then we convert it to string here. What is the reason for doing it? This structure propagates structured errors all the way into RPC (CC @frol ) and we prefer to have error in RPC as structured as possible, so that we avoid bad JS code that parses error strings to understand whether there is a compilation error (which is something devtools would want to differentiate from execution errors).
We would need to cut a release for near-api-js when Wasm error structure changes due to Wasmer upgrade independently on whether we are returning it as string or strictly-typed structure. With strictly typed structure we just going to have less glue-code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's intentional, and for that particular distinction we can introduce CompilationError and ExecutionError, if this difference is indeed important. Currently, the only use of the error is showing it to developers in logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the only use of the error is showing it to developers in logs.
This is exactly what we want to improve by introducing sane granularity.
That's intentional, and for that particular distinction we can introduce CompilationError and ExecutionError, if this difference is indeed important.
Yes, that would be ideal. I think end-users care about three types of errors: UnknownMethod, CompilationError, ExecutionError.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, please introduce these 3 levels of errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, let's do that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@volovyk-s BadArgument
is a contract-specific error (because each contract decides to interpret it differently). We cannot differentiate it on a blockchain-level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@olonho It seems like we are still combining all three kinds of errors into ActionErrorKind::FunctionCallError
with string. Did you forget to push the commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and we prefer to have error in RPC as structured as possible
I want to push back a little bit against "as structured as possible errors". This is not strictly related to the PR, just the pattern I observer in Rust's error handling.
The issue with maximally stuctured errors is that they increase coupling, and sometimes it's better to stop at the goldilocks errors, which are structured just enough.
In the context of this PR, if we were to make error maximally structured, than we would need to expose wasmer 0.17 and wasmer 1.0 as is. And when we remove support for wasmer 0.17 or upgrade to (hypothetical) wasmer 2.0, we'd have to change the definition of the error type as well, and update the call-sites. This PR in some sense is the consequence of the fact that we did change error definition when upgrading to wasmer 1.0. This is a code smell, pointing out that the original error definition probably wasn't encapsulated enough.
But, clearly, exposing just the error message goes too far in the opposite direction, as it forces the clients to parse out the messages.
There's a goldilocks error pattern -- ErrorKind
. The best example of it is std::io::Error:
The Error
type itself is an opaque struct, which can be converted to string via Display
. There's a .kind()
method, which exposes non-exhaustive, field-less enum. This structure allows the client to behave differently for different sorts of errors, but allows the implementor to modify the errors without changes to the client.
To re-iterate, this is definitely not something we should do in the scope of this PR, and it's not necessary something we want to do even eventually (serializability of errors add an extra design dimension here which might change the tradeoffs). But it's useful to keep this bit of info in mind -- just exposing errors as is is the path of least resistance, but often not something you want to have at the API boundary.
References:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, it was intentional "right" coercion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matklad thanks for the detailed overview. I believe after some discussions we already arrived at the middle ground of introducing exactly the necessary variants that the consumer cares about 👍
core/primitives/src/errors.rs
Outdated
/// An error occurred during a `FunctionCall` Action, parameter is debug message. | ||
FunctionCallError(String), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@olonho We try to bring more structured errors into the codebase, and this change goes in an opposite direction, so I wonder about the motivation behind this change 🤔
@bowenwang1996 I assume changes to this enum require a protocol upgrade (since the errors are part of the proof), a db migration (since the values are serialized and stored as borsh values), and a network upgrade (?)
P.S. If we still want to change it, I suggest capturing the logic in the code instead of comments:
/// An error occurred during a `FunctionCall` Action, parameter is debug message. | |
FunctionCallError(String), | |
/// An error occurred during a `FunctionCall` Action. | |
FunctionCallError { debug_message: String }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not think, that protocol changes in any way, as for protocol it's important presence of FunctionCallError error, not exact type of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for proof it is important to have an identical hash for the execution outcome, so it will require a protocol change unless I misunderstand something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, when computing hash we explicitly ignore error messages, @matklad shown me the place where it's done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matklad shown me the place where it's done.
Over here:
nearcore/core/primitives/src/transaction.rs
Line 324 in b95eedf
ExecutionStatus::Failure(_) => PartialExecutionStatus::Failure, |
Naturally, I might be missing something, but it seems resonable that, from protocol's point of view, we remeber only a single bit: was there some error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, we don't need a protocol upgrade!
What about database migration?
I also see various ExecutionOutcome types used in chain/network. (ExecutionOutcome uses TxExecutionError which uses ActionError which uses ActionErrorKind and all of that gets borsh-serialized and should be compatible with the existing running nodes, right?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also do not think we shall store anything related to exact error kind in DB. If we do - guess it's a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@olonho we definitely store the errors in the storage since we receive them from RPC: https://explorer.near.org/transactions/GcRiQb5PFwMAogtu8te6wDKMRg9pefbWV4nJx3iqijny
http post https://archival-rpc.mainnet.near.org/ jsonrpc=2.0 method=tx params:='["GcRiQb5PFwMAogtu8te6wDKMRg9pefbWV4nJx3iqijny", "tom.zest.near"]' id=dontcare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do store them in database and if we've changed how the error serializes to borsh we should do a migration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so could we do migration in the followup CL? And would be amazing, if we could cooperate on the migration effort.
runtime/near-vm-errors/src/lib.rs
Outdated
@@ -31,15 +29,11 @@ pub enum FunctionCallError { | |||
MethodResolveError(MethodResolveError), | |||
/// A trap happened during execution of a binary | |||
WasmTrap(WasmTrap), | |||
WasmUnknownError, | |||
WasmUnknownError(String), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a description the the string value:
WasmUnknownError(String), | |
WasmUnknownError { debug_message: String }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
runtime/runtime/src/actions.rs
Outdated
FunctionCallError::CompilationError(_) | ||
| FunctionCallError::LinkError { msg: _ } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I, actually, prefer it to be structured (and update the rest of variants as well if necessary) since String
does not capture the meaning of the text value 😐
PTAL, merged suggestions. |
result.result = Err(ActionErrorKind::FunctionCallError(err).into()); | ||
false | ||
} | ||
Some(VMError::FunctionCallError(err)) => match err { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@olonho It seems like we are still combining all three kinds of errors into ActionErrorKind::FunctionCallError
with string. Did you forget to push the commit?
Addresses Max's request. |
Initially, we treated this error case as deterministic (so, we stored this error in our state, etc.) Then, in #4181 (comment) we reasoned that this error actually happens non-deterministically, so it's better to panic in this case. However, when rolling this out, we noticed that this error happens deterministically for at least one contract. So here we roll this back to a previous behavior and emit some deterministic error, which won't cause the node to panic.
Make sure errors processed in the similar manner between Wasmer 0.x and Wasmer 1.x.
Introduced explicit non-deterministic error.
Test plan
Reworked error cases tests to make them uniform, then
cargo test
.