Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unreliable Language Server behavior in the new GUI #7898

Closed
farmaazon opened this issue Sep 26, 2023 · 20 comments
Closed

Unreliable Language Server behavior in the new GUI #7898

farmaazon opened this issue Sep 26, 2023 · 20 comments
Assignees
Labels
--bug Type: bug -language-server p-highest Should be completed ASAP

Comments

@farmaazon
Copy link
Contributor

The new GUI sends its request faster than the old one, and it triggers issues sometimes.

To reproduce:

  1. Switch to wip/farmaazon/create-execution-ctx.
  2. run npm install then npm --workspace=enso-gui2 run dev
  3. run project manager (I tested on 09-26 nightly)
  4. open chrome and go to http://localhost:5173/
  5. open any project and check 127.0.0.1 connection in network tab

Sometimes the execution context creation returns "request timeout" error

{"jsonrpc":"2.0","id":"0","method":"session/initProtocolConnection","params":{"clientId":"6fbfded4-8884-45c5-849a-4694604ca355"}}	129	
13:48:13.194
{"jsonrpc":"2.0","id":"0","result":{"contentRoots":[{"type":"Project","id":"57592eda-e01f-493b-a864-e48aaa03a5fe"},{"type":"Home","id":"5e74e8d2-f18d-4e5a-acee-30b8f7567225"},{"type":"FileSystemRoot","id":"1d24271b-7eb7-4e66-b9b3-fb948f538a6c","path":"/"}]}}	258	
13:48:15.316
{"jsonrpc":"2.0","id":"1","method":"executionContext/create","params":{"contextId":"11cb9d39-d85e-49de-b401-c81624ca6f0b"}}	123	
13:48:15.316
{"jsonrpc":"2.0","id":"1","error":{"code":11,"message":"Request timeout"}}

Sometimes the RPC methods are successful, but no update arrives. Sometimes it works properly.

Here are logs from my 3 attempts: first one was without updates, next two returned "request timeout" error.

enso-project-manager-2023-09-26.0.log

@farmaazon
Copy link
Contributor Author

It's make my recent testing quite hard, as I must try several times until I get updates (and reading updates I implement).

@farmaazon farmaazon added the p-highest Should be completed ASAP label Sep 26, 2023
@hubertp
Copy link
Collaborator

hubertp commented Sep 26, 2023

This looks like a race-condition in the initialization and new GUI not waiting for the startup to complete?

[ERROR] [2023-09-26T14:41:03+02:00] [akka.actor.OneForOneStrategy] Cannot invoke "org.enso.interpreter.instrument.execution.CommandProcessor.invoke(org.enso.interpreter.instrument.command.Command)" because the return value of "org.enso.interpreter.instrument.Handler.commandProcessor()" is null
java.lang.NullPointerException: Cannot invoke "org.enso.interpreter.instrument.execution.CommandProcessor.invoke(org.enso.interpreter.instrument.command.Command)" because the return value of "org.enso.interpreter.instrument.Handler.commandProcessor()" is null
	at org.enso.interpreter.instrument.Handler.onMessage(Handler.scala:128)
	at org.enso.interpreter.instrument.Endpoint.$anonfun$sendBinary$1(Handler.scala:66)
	at org.enso.interpreter.instrument.Endpoint.$anonfun$sendBinary$1$adapted(Handler.scala:64)
	at scala.Option.foreach(Option.scala:437)
	at org.enso.interpreter.instrument.Endpoint.sendBinary(Handler.scala:64)
	at org.graalvm.truffle/com.oracle.truffle.api.instrumentation.TruffleInstrument$Env$MessageTransportProxy$MessageEndpointProxy.sendBinary(TruffleInstrument.java:1090)
	at org.enso.languageserver.runtime.RuntimeConnector$$anonfun$initialized$1.applyOrElse(RuntimeConnector.scala:79)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:35)
	at org.enso.logger.akka.ActorLoggingReceive.apply(ActorLoggingReceive.scala:35)
	at org.enso.logger.akka.ActorLoggingReceive.apply(ActorLoggingReceive.scala:14)
	at scala.PartialFunction.applyOrElse(PartialFunction.scala:214)
	at scala.PartialFunction.applyOrElse$(PartialFunction.scala:213)
	at org.enso.logger.akka.ActorLoggingReceive.applyOrElse(ActorLoggingReceive.scala:14)
	at scala.PartialFunction$Combined.applyOrElse(PartialFunction.scala:305)
	at akka.actor.Actor.aroundReceive(Actor.scala:537)
	at akka.actor.Actor.aroundReceive$(Actor.scala:535)
	at org.enso.languageserver.runtime.RuntimeConnector.aroundReceive(RuntimeConnector.scala:20)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:579)
	at akka.actor.ActorCell.invoke(ActorCell.scala:547)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)
	at akka.dispatch.Mailbox.run(Mailbox.scala:231)
	at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)

@farmaazon
Copy link
Contributor Author

And new GUI not waiting for the startup to complete?

There never was any message about "initialization completness" sent from engine, except first "execution status" (which we use to hide "compiling standard library") - but that requires creating execution context first, which often fails in GUI2.

@hubertp
Copy link
Collaborator

hubertp commented Sep 27, 2023

And new GUI not waiting for the startup to complete?

There never was any message about "initialization completness" sent from engine, except first "execution status" (which we use to hide "compiling standard library") - but that requires creating execution context first, which often fails in GUI2.

It's definitely related to that exception since in both cases that failed you were getting this NPE. I will dig into the sequence in more detail.

@4e6
Copy link
Contributor

4e6 commented Sep 27, 2023

There never was any message about "initialization completeness"

The response to session/initProtocolConnection should indicate that the language server is ready. We may need to wait for the Truffle context initialization before sending the response.

@hubertp
Copy link
Collaborator

hubertp commented Sep 27, 2023

I've been analyzing the logs again and something isn't right.

I can see 2 messages "RPC session initialized for client" for different clients in close succession.

  1. Is that normal for the new IDE?
  2. Isn't the new IDE trying to proceed with regular messages (open file etc) for the second client while only the first one finished the initialization and received the session/initProtocolConnection response?

@farmaazon
Copy link
Contributor Author

  1. Yes, it is. There are two clients, (we have a separate service which synchronizes y-js documents with the language sever),.
  2. I doubt, they are separate processes. You see the copied network log from browser in the issue's description - there is a clear response to initProtocol before proceeding to creating execution context.

@hubertp
Copy link
Collaborator

hubertp commented Sep 27, 2023

  1. Yes, it is. There are two clients, (we have a separate service which synchronizes y-js documents with the language sever),.

OK. Then I'm not surprised of this ticket anymore. In theory, we are able to support multiple clients, especially with collaborative buffer. In practice we probably never exercised initialization enough, nor needed to, to be confident that this will not cause problems.

@jdunkerley jdunkerley removed the triage label Sep 29, 2023
@jdunkerley jdunkerley moved this from ❓New to 📤 Backlog in Issues Board Oct 3, 2023
@hubertp hubertp moved this from 📤 Backlog to 🔧 Implementation in Issues Board Oct 4, 2023
@enso-bot
Copy link

enso-bot bot commented Oct 5, 2023

Hubert Plociniczak reports a new STANDUP for yesterday (2023-10-04):

Progress: Found the potential race condition when two clients initialize instrumentation. Not yet sure how to fix it as I'm not so familiar with that part. It should be finished by 2023-10-05.

Next Day: Next day I will be working on the #7898 task. Come up with a fix.

@JaroslavTulach
Copy link
Member

This looks like a race-condition in the initialization and new GUI not waiting for the startup to complete?

[ERROR] [2023-09-26T14:41:03+02:00] [akka.actor.OneForOneStrategy] Cannot invoke "org.enso.interpreter.instrument.execution.CommandProcessor.invoke(org.enso.interpreter.instrument.command.Command)" because the return value of "org.enso.interpreter.instrument.Handler.commandProcessor()" is null
java.lang.NullPointerException: Cannot invoke "org.enso.interpreter.instrument.execution.CommandProcessor.invoke(org.enso.interpreter.instrument.command.Command)" because the return value of "org.enso.interpreter.instrument.Handler.commandProcessor()" is null
	at org.enso.interpreter.instrument.Handler.onMessage(Handler.scala:128)

Hubert has "walked me thru the code" at

commandProcessor = new CommandExecutionEngine(interpreterCtx)
& co.

The value of commandProcessor is assigned on line 108 and only then the message InitializedNotification is sent. Still the field commandProcessor is null - this could happen because of violation of Java memory model.

Possibly, the set value of commandProcessor hasn't yet been stored into global memory when the message receiver tries to access it. Some for of synchronization, transient field or final field is needed.

@hubertp
Copy link
Collaborator

hubertp commented Oct 6, 2023

I feel like this more LS clients feature has opened a pandora's box. Here is another random failure:

[WARN] [2023-10-06T12:54:32+02:00] [enso] Execution of function main failed (Cannot invoke "com.oracle.truffle.api.source.SourceSection.getEndLine()" because the return value of "org.enso.interpreter.node.ClosureRootNode.getSourceSection()" is null).
java.lang.NullPointerException: Cannot invoke "com.oracle.truffle.api.source.SourceSection.getEndLine()" because the return value of "org.enso.interpreter.node.ClosureRootNode.getSourceSection()" is null
	at org.enso.interpreter.instrument.IdExecutionInstrument.bind(IdExecutionInstrument.java:255)
	at org.enso.interpreter.service.ExecutionService.lambda$execute$1(ExecutionService.java:170)
	at java.base/java.util.Optional.map(Optional.java:260)
	at org.enso.interpreter.service.ExecutionService.execute(ExecutionService.java:170)
	at org.enso.interpreter.service.ExecutionService.execute(ExecutionService.java:219)
	at org.enso.interpreter.instrument.job.ProgramExecutionSupport$.executeProgram(ProgramExecutionSupport.scala:111)
	at org.enso.interpreter.instrument.job.ProgramExecutionSupport$.$anonfun$runProgram$3(ProgramExecutionSupport.scala:220)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
	at cats.syntax.EitherObjectOps$.catchNonFatal$extension(either.scala:391)
	at org.enso.interpreter.instrument.job.ProgramExecutionSupport$.$anonfun$runProgram$2(ProgramExecutionSupport.scala:220)
	at scala.util.Either.flatMap(Either.scala:352)
	at org.enso.interpreter.instrument.job.ProgramExecutionSupport$.runProgram(ProgramExecutionSupport.scala:214)
	at org.enso.interpreter.instrument.job.ExecuteJob.run(ExecuteJob.scala:42)
	at org.enso.interpreter.instrument.job.ExecuteJob.run(ExecuteJob.scala:18)
	at org.enso.interpreter.instrument.execution.JobExecutionEngine.$anonfun$runInternal$1(JobExecutionEngine.scala:113)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
	at org.graalvm.truffle/com.oracle.truffle.polyglot.PolyglotThread.access$001(PolyglotThread.java:53)

(once I fixed the previous problem)

@enso-bot
Copy link

enso-bot bot commented Oct 6, 2023

Hubert Plociniczak reports a new STANDUP for yesterday (2023-10-05):

Progress: Fixed two locations where we had potential race-condition. PR is up but needs more testing. I think I also saw some random failures still with new GUI. Trying to figure out what is with the failures on CI. It should be finished by 2023-10-05.

Next Day: Next day I will be working on the #7898 task. More testing with new GUI.

@hubertp
Copy link
Collaborator

hubertp commented Oct 9, 2023

Here are the relevant locks that are in the deadlock state:

  • 2 compilation read locks
  • 1 ensure compiled job that is holding a compilation write lock

Essentially the bottleneck is in the endpoint handling of messages which is synchronous.

deadlock.log

@hubertp
Copy link
Collaborator

hubertp commented Oct 9, 2023

Since it is probably not so obvious here is a sequence of events that triggers the problem. I'm pretty sure we might have experienced it occasionally with the old GUI except we never got a stracktrace and circumstances were less race-condition friendly:

  1. PushContextCmd sends a message to Endpoint and starts a background EnsureCompiledJob
  2. OpenFileCmd is started around the same time
  3. EnsureCompiledJob acquires write lock
  4. OpenFileCmd attempts to acquire read lock but waits on (3) to finish first to get the lock
  5. EnsureCompiledJob in its implementation finds appropriate libraries and sends the message to endpoint
  6. (5) waits infinitely for a reply for its message from endpoint
  7. The reply will never arrive because endpoint is busy serving the OpenFileCmd which is stuck waiting on a lock
  8. Deadlock

In fact, EnsureCompiledJob might not necessarily be started by PushContextCmd. There might be just one started in the background some time ago by one of the edits. Also, any command requesting read compilation lock can experience this problem e.g., OpenFileCmd could be replaced by CloseFileCmd.

@enso-bot
Copy link

enso-bot bot commented Oct 9, 2023

Hubert Plociniczak reports a new 🔴 DELAY for the provided date (2023-10-06):

Summary: There is 6 days delay in implementation of the Unreliable Language Server behavior in the new GUI (#7898) task.
It will cause 0 days delay for the delivery of this weekly plan.

Delay Cause: Fixing one problem in the initialization with multiple clients, revealed more bugs that were lurking around.

@enso-bot
Copy link

enso-bot bot commented Oct 9, 2023

Hubert Plociniczak reports a new STANDUP for the provided date (2023-10-06):

Progress: Discovered the problem with messages not being unstashed while waiting for initialization. But the fix revealed more transient failures (less frequent than originally reported and at a later stage). Got distracted by investigating CI problems. In the end turned out to be a regression and added a revert in #7953 PR. It should be finished by 2023-10-11.

Next Day: Next day I will be working on the #7898 task. Discover the source of new timeouts in a PR for #7898.

@enso-bot
Copy link

enso-bot bot commented Oct 10, 2023

Hubert Plociniczak reports a new STANDUP for yesterday (2023-10-09):

Progress: The source of sporadic timeouts is a rare but still possible deadlock between two types of locks (context locks and lock manager). Details are in the ticket. Still unsure how to solve this properly. Will sync with Radek the next today since he wrote the the latter. It should be finished by 2023-10-11.

Next Day: Next day I will be working on the #7898 task. Find a fix for deadlock.

hubertp added a commit that referenced this issue Oct 11, 2023
This change partially reverts #6998 which introduced synchronous
commands to a) avoid deadlocks b) ensure the correct execution of
close/open commands c) avoid data corruption when applying edits.

Sadly, it now turns out that the possibility of synchronous commands is
at odds with how commands are expected to execute. More importantly,
there are commands comming from Language Server but there are commands
comming from Runtime as well. When combined with context locks it
creates an environment for deadlocks as described in #7898.

This change ensures that commands are always executed asynchronously but
also we can also mark some commands to preserve the order of execution
wrt to their submission. That way no deadlocks arise and there is do
data corruption of the enso code.
hubertp added a commit that referenced this issue Oct 11, 2023
This change partially reverts #6998 which introduced synchronous
commands to a) avoid deadlocks b) ensure the correct execution of
close/open commands c) avoid data corruption when applying edits.

Sadly, it now turns out that the possibility of synchronous commands is
at odds with how commands are expected to execute. More importantly,
there are commands comming from Language Server but there are commands
comming from Runtime as well. When combined with context locks it
creates an environment for deadlocks as described in #7898.

This change ensures that commands are always executed asynchronously but
also we can also mark some commands to preserve the order of execution
wrt to their submission. That way no deadlocks arise and there is do
data corruption of the enso code.
hubertp added a commit that referenced this issue Oct 11, 2023
This change partially reverts #6998 which introduced synchronous
commands to a) avoid deadlocks b) ensure the correct execution of
close/open commands c) avoid data corruption when applying edits.

Sadly, it now turns out that the possibility of synchronous commands is
at odds with how commands are expected to execute. More importantly,
there are commands comming from Language Server but there are commands
comming from Runtime as well. When combined with context locks it
creates an environment for deadlocks as described in #7898.

This change ensures that commands are always executed asynchronously but
also we can also mark some commands to preserve the order of execution
wrt to their submission. That way no deadlocks arise and there is do
data corruption of the enso code.
hubertp added a commit that referenced this issue Oct 11, 2023
Rather than using an overly complicated locking mechanism switched to a
`BlockingQueue` and a separate thread pool quaranteeing that commands
are executed in order but they are also non-blocking.

This change partially reverts #6998 which introduced synchronous
commands to a) avoid deadlocks b) ensure the correct execution of
close/open commands c) avoid data corruption when applying edits.

Sadly, it now turns out that the possibility of synchronous commands is
at odds with how commands are expected to execute. More importantly,
there are commands comming from Language Server but there are commands
comming from Runtime as well. When combined with context locks it
creates an environment for deadlocks as described in #7898.
@enso-bot
Copy link

enso-bot bot commented Oct 11, 2023

Hubert Plociniczak reports a new STANDUP for yesterday (2023-10-10):

Progress: Added ordering for a selected number of commands, thus eliminating a deadlock and still maintaining a fix for previous problems. It should be finished by 2023-10-11.

Next Day: Next day I will be working on the #7898 task. More testing, address PR review.

@hubertp hubertp moved this from 🔧 Implementation to 👁️ Code review in Issues Board Oct 12, 2023
@enso-bot
Copy link

enso-bot bot commented Oct 12, 2023

Hubert Plociniczak reports a new STANDUP for yesterday (2023-10-11):

Progress: The previous fix was unnecessarily over-engineered. Re-written it based on the feedback and now the fix is much easier to maintain and review. It should be finished by 2023-10-11.

Next Day: Next day I will be working on the #7978 task. See if my fix also fixed problems with edits people ocassionally see in the old GUI.

mergify bot pushed a commit that referenced this issue Oct 13, 2023
It seems that Runtime Connector wasn't respecting the protocol it defined itself. The connector should be waiting on the `Api.InitializedNotification` message and only then start forwarding messages. So far it seems this hasn't been a problem, or at least wasn't reported as such, because initialization was fast enough.

Modified `Handler` so that we are certain that its fields hold initialized values when being accessed by different threads.

Should fix problems mentioned in #7898.
@hubertp
Copy link
Collaborator

hubertp commented Oct 16, 2023

Created a follow up ticket for suggestions' db problems #8033. Other than that I no longer see timeouts.

@hubertp hubertp closed this as completed Oct 16, 2023
@github-project-automation github-project-automation bot moved this from 👁️ Code review to 🟢 Accepted in Issues Board Oct 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
--bug Type: bug -language-server p-highest Should be completed ASAP
Projects
Archived in project
Development

No branches or pull requests

5 participants