-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unreliable Language Server behavior in the new GUI #7898
Comments
It's make my recent testing quite hard, as I must try several times until I get updates (and reading updates I implement). |
This looks like a race-condition in the initialization and new GUI not waiting for the startup to complete?
|
There never was any message about "initialization completness" sent from engine, except first "execution status" (which we use to hide "compiling standard library") - but that requires creating execution context first, which often fails in GUI2. |
It's definitely related to that exception since in both cases that failed you were getting this NPE. I will dig into the sequence in more detail. |
The response to |
I've been analyzing the logs again and something isn't right. I can see 2 messages "RPC session initialized for client" for different clients in close succession.
|
|
OK. Then I'm not surprised of this ticket anymore. In theory, we are able to support multiple clients, especially with collaborative buffer. In practice we probably never exercised initialization enough, nor needed to, to be confident that this will not cause problems. |
Hubert Plociniczak reports a new STANDUP for yesterday (2023-10-04): Progress: Found the potential race condition when two clients initialize instrumentation. Not yet sure how to fix it as I'm not so familiar with that part. It should be finished by 2023-10-05. Next Day: Next day I will be working on the #7898 task. Come up with a fix. |
Hubert has "walked me thru the code" at enso/engine/runtime-instrument-common/src/main/scala/org/enso/interpreter/instrument/Handler.scala Line 108 in f135637
The value of Possibly, the set value of |
I feel like this more LS clients feature has opened a pandora's box. Here is another random failure:
(once I fixed the previous problem) |
Hubert Plociniczak reports a new STANDUP for yesterday (2023-10-05): Progress: Fixed two locations where we had potential race-condition. PR is up but needs more testing. I think I also saw some random failures still with new GUI. Trying to figure out what is with the failures on CI. It should be finished by 2023-10-05. Next Day: Next day I will be working on the #7898 task. More testing with new GUI. |
Here are the relevant locks that are in the deadlock state:
Essentially the bottleneck is in the endpoint handling of messages which is synchronous. |
Since it is probably not so obvious here is a sequence of events that triggers the problem. I'm pretty sure we might have experienced it occasionally with the old GUI except we never got a stracktrace and circumstances were less race-condition friendly:
In fact, |
Hubert Plociniczak reports a new 🔴 DELAY for the provided date (2023-10-06): Summary: There is 6 days delay in implementation of the Unreliable Language Server behavior in the new GUI (#7898) task. Delay Cause: Fixing one problem in the initialization with multiple clients, revealed more bugs that were lurking around. |
Hubert Plociniczak reports a new STANDUP for the provided date (2023-10-06): Progress: Discovered the problem with messages not being unstashed while waiting for initialization. But the fix revealed more transient failures (less frequent than originally reported and at a later stage). Got distracted by investigating CI problems. In the end turned out to be a regression and added a revert in #7953 PR. It should be finished by 2023-10-11. Next Day: Next day I will be working on the #7898 task. Discover the source of new timeouts in a PR for #7898. |
Hubert Plociniczak reports a new STANDUP for yesterday (2023-10-09): Progress: The source of sporadic timeouts is a rare but still possible deadlock between two types of locks (context locks and lock manager). Details are in the ticket. Still unsure how to solve this properly. Will sync with Radek the next today since he wrote the the latter. It should be finished by 2023-10-11. Next Day: Next day I will be working on the #7898 task. Find a fix for deadlock. |
This change partially reverts #6998 which introduced synchronous commands to a) avoid deadlocks b) ensure the correct execution of close/open commands c) avoid data corruption when applying edits. Sadly, it now turns out that the possibility of synchronous commands is at odds with how commands are expected to execute. More importantly, there are commands comming from Language Server but there are commands comming from Runtime as well. When combined with context locks it creates an environment for deadlocks as described in #7898. This change ensures that commands are always executed asynchronously but also we can also mark some commands to preserve the order of execution wrt to their submission. That way no deadlocks arise and there is do data corruption of the enso code.
This change partially reverts #6998 which introduced synchronous commands to a) avoid deadlocks b) ensure the correct execution of close/open commands c) avoid data corruption when applying edits. Sadly, it now turns out that the possibility of synchronous commands is at odds with how commands are expected to execute. More importantly, there are commands comming from Language Server but there are commands comming from Runtime as well. When combined with context locks it creates an environment for deadlocks as described in #7898. This change ensures that commands are always executed asynchronously but also we can also mark some commands to preserve the order of execution wrt to their submission. That way no deadlocks arise and there is do data corruption of the enso code.
This change partially reverts #6998 which introduced synchronous commands to a) avoid deadlocks b) ensure the correct execution of close/open commands c) avoid data corruption when applying edits. Sadly, it now turns out that the possibility of synchronous commands is at odds with how commands are expected to execute. More importantly, there are commands comming from Language Server but there are commands comming from Runtime as well. When combined with context locks it creates an environment for deadlocks as described in #7898. This change ensures that commands are always executed asynchronously but also we can also mark some commands to preserve the order of execution wrt to their submission. That way no deadlocks arise and there is do data corruption of the enso code.
Rather than using an overly complicated locking mechanism switched to a `BlockingQueue` and a separate thread pool quaranteeing that commands are executed in order but they are also non-blocking. This change partially reverts #6998 which introduced synchronous commands to a) avoid deadlocks b) ensure the correct execution of close/open commands c) avoid data corruption when applying edits. Sadly, it now turns out that the possibility of synchronous commands is at odds with how commands are expected to execute. More importantly, there are commands comming from Language Server but there are commands comming from Runtime as well. When combined with context locks it creates an environment for deadlocks as described in #7898.
Hubert Plociniczak reports a new STANDUP for yesterday (2023-10-10): Progress: Added ordering for a selected number of commands, thus eliminating a deadlock and still maintaining a fix for previous problems. It should be finished by 2023-10-11. Next Day: Next day I will be working on the #7898 task. More testing, address PR review. |
Hubert Plociniczak reports a new STANDUP for yesterday (2023-10-11): Progress: The previous fix was unnecessarily over-engineered. Re-written it based on the feedback and now the fix is much easier to maintain and review. It should be finished by 2023-10-11. Next Day: Next day I will be working on the #7978 task. See if my fix also fixed problems with edits people ocassionally see in the old GUI. |
It seems that Runtime Connector wasn't respecting the protocol it defined itself. The connector should be waiting on the `Api.InitializedNotification` message and only then start forwarding messages. So far it seems this hasn't been a problem, or at least wasn't reported as such, because initialization was fast enough. Modified `Handler` so that we are certain that its fields hold initialized values when being accessed by different threads. Should fix problems mentioned in #7898.
Created a follow up ticket for suggestions' db problems #8033. Other than that I no longer see timeouts. |
The new GUI sends its request faster than the old one, and it triggers issues sometimes.
To reproduce:
wip/farmaazon/create-execution-ctx
.npm install
thennpm --workspace=enso-gui2 run dev
Sometimes the execution context creation returns "request timeout" error
Sometimes the RPC methods are successful, but no update arrives. Sometimes it works properly.
Here are logs from my 3 attempts: first one was without updates, next two returned "request timeout" error.
enso-project-manager-2023-09-26.0.log
The text was updated successfully, but these errors were encountered: