-
Notifications
You must be signed in to change notification settings - Fork 632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fully switch to wasmer 1 by default #4076
Comments
@ailisp am I correct that enabling wasmer1 is just |
@bowenwang1996 is there anything else we can do besides "deploy to beta and observe" to make sure that transition goes smoothly? |
@chefsale @mhalambek let's do the following:
This should give us ~1 month of testing in production. @matklad are there other tests that you would like to see done? |
@matklad correct |
If one node behaves OK, we should try to deploy half of the nodes, to make sure that network with mixed versions work. |
I don't think that is needed because whether it works solely depends on the transactions, which means that it is sufficient to use a one-node canary that will apply all the transactions. Having more nodes does not improve testing coverage in this case. |
I talked to @matklad, we wanted to create enable the wasmer1 based on the current release 1.18.0 and 1.18.1. They do not contain the wasmer changes. Even if we decide to create a build and run one node based of the commit: 79f1732, we still will need for the next release which includes this change. So my proposal was to:
I think this is most likely a safer approach, not sure how burning this is. cc: @matklad |
Yes, I definitely agree that it's safer to release 1.19 without wasmer1, and than test wasmer1 on top of that, and that switch to wasmer in a separate (minor/patch) release. That being said, deploying a node today from 79f1732 will help us to discover and fix bugs earlier. It won't affect when we ship, but it would improve the quality of what we ship, as more time will pass between last bug-fixes and actual deployment. So, if that's not too difficult, I would prefer to have nodes from 79f1732 as an additional help. |
Yes, in that case I will create a seperate branch from 79f1732 which enabled wasmer in the Makefile and we will deploy this. |
Created a branch for the wasmer build: https://github.com/near/nearcore/tree/wasmer1-enabled |
Please review: https://github.com/near/near-ops/pull/445. This is creating the wasmer1 enabled canaries. @bowenwang1996 @mhalambek @matklad |
Machines are running: but they die:
|
@chefsale can you post the reason why the crash? What you posted is just a stacktrace from memory stats, which does not indicate why the node crashed. |
I mean the log is long, you can take a look at the full log at the machine: with
I assume this is what you were looking for. The stacktrace from above is from this error i assume. |
No. As I said, it is the result of enabling memory_stats.
Yes. This indicates that there is a database version mismatch, which is likely caused by you trying to run the node with a commit that has an older database version than what was run before. This has nothing to do with wasmer1. I'd suggest that you try a more recent commit. |
The error is generated by |
I just confirmed from build previous commit a637519 that this fail is not related to wasmer1. So there're two errors:
|
Machine on mainnet: I migrated testnet to the current master (rebased): https://github.com/near/nearcore/tree/wasmer1-enabled. It is running on the machine: Let me know if you need anything else and when do we plan to start migrating testnet first. Probably give it a few days running first. cc: @matklad @ailisp @bowenwang1996 |
One important thing we need to do before switching to wasm 1.0 is auditing wasmer 1 error paths, to make sure that non-deterministic errors a-la OOM crash the node, instead of being treated as deterministic failure to execute the contract. We have this logic for wasmer 0.7 here, it needs to be updated to cover wasmer 1.0 as well. @olonho volunteered to look into this, and maybe even write some fault-injection tests for this logic. |
@matklad @chefsale I thought about this again and realized that what I said before didn't make sense. We should run more nodes with wasmer1 since this will increase the chance of nondeterministic errors happening and can help better test wasmer1. I suggest the following:
What do you think? |
How hard would it be to switch only half of nodes at a time? I fear cases where we need to rollback wasmer 1.0 due to some yet unknown issue, and, after rollback, realize that wasmer 0.17 is broken as well, because we didn't use it on the test net for some time. Other than that, good idea. We probably need to merge #4181 first though (might be done with it today even?) |
Sure we can do that |
One more thing to do here: re-run params estimator, to makes sure that wasmer1 estimated fees are not worse than 0.17 fees. Note that we don't necessary want to lower the fees together with wasmer1. We can do that separately, as long as estimates are lower, and we might want to batch fees lowering with other changes, like addition of fees for contract compilation on deployment. |
@ailisp @olonho @bowenwang1996 I realized there's an unresolved question about wasmer 1 rollout. Namely: should we switch to wasmer 1 in lockstep with enabling compilation on deploy, removing memory cache and upgrade costs? Summary of the current state:
Cost measurements show that wasmer1 is not slower for all metrics (ref). Note that costs do not reflect in-memory cache -- during estimation, the cache is disabled. I see three ways forward:
To me, it seems that option 2. is the safest, as it minimizes the diff with the current state, and doesn't requite protocol upgrade as it doesn't touch costs. So, do you agree that we should do the following?
checkbox list to make sure we have consensus: |
Fine, let's make 4231 less intrusive. Also attached there latest costs update. |
This PR ports wasmer0 implementation of in-memory cache to wasme1 as well. The goal here is to keep the behavior between two wasmers as close as possible. We absolutely do want to get rid of the in-memory cache, but it's safer not to do that while also doing a major wasmer upgrade! See near#4076 (comment) for context
This PR ports wasmer0 implementation of in-memory cache to wasme1 as well. The goal here is to keep the behavior between two wasmers as close as possible. We absolutely do want to get rid of the in-memory cache, but it's safer not to do that while also doing a major wasmer upgrade! See #4076 (comment) for context Test Plan ----------- ``` $ cargo test -p near-vm-runner --features wasmer0_vm,wasmer1_vm,no_cache $ $ cargo test -p near-vm-runner --features wasmer0_vm,wasmer1_vm ```
@matklad should we close this one given that we are now planning to move to wasmer2? |
Yeah, let's close this for now: it's pretty clear that we need to do significant changes on the wasmer side before we can reconsider this, and that'll require re-starting the process here. In particular, wasmerio/wasmer#2329 is still outstanding. |
In #3799, @ailisp implementing support for wasmer 1.0 🎉
However, we are still using wasmer 0.17 by default -- major upgrade of vm version is not a laughing matter, we must make sure that the two behave identically. This issue exists to track our progress on switching to wasmer 1.0 by default, and eventual removal of support for wasemer 0.17 from the tree.
The next step in transition is to deploy a fraction of nodes with
wasmer1
to the beta net.The text was updated successfully, but these errors were encountered: