Run LSP handlers consecutively by default #361

lionel- · 2024-05-17T16:53:23Z

Branched from #360 (Follow this link to see preparations for this PR)

Addresses posit-dev/positron#2692.
Closes posit-dev/positron#2999.
Supersedes and closes #340.

This PR refactors our LSP server to solve persistent ordering issues of message ordering and corruption of internal state that have caused Ark to crash periodically (posit-dev/positron#271, posit-dev/positron#340). We've implemented a number of workarounds over the last year (ffd8b27, #45) but we still observe crashes caused by stale cursors and ranges (posit-dev/positron#2692). This has also been brought up by beta testers.

@DavisVaughan found out that the ruff project recently switched from tower-lsp to lsp-server (from the rust-analyzer project) for reasons relevant to us here: astral-sh/ruff#10158. See also this discussion on the tower-lsp repo: ebkalderon/tower-lsp#284. The gist is that while tower-lsp does call our message handlers in a single task (and thus on a single thread) in the correct order, any await point within the handler will cause a transfer of control to the next handler. In particular, we used to send log messages on handler entry and this was an await point so the ordering of our handlers was doomed from the get-go.

My first attempt to fix this was #340 which takes the approach of synchronising the handlers with a read-write lock. This lock allowed concurrent read handlers but forced write handlers that update the state to wait for the read handlers to finish before running. Any incoming requests or notifications at that point would also be queued by the RwLock until the write handler was done. However I ended up deciding against this approach for these reasons:

To allow Ark to scale to more complex code analysis, we need to preserve the ability to spawn longer-running tasks. And we can't wait for these to complete everytime we get a write request on our lock, as that would overly reduce our throughput. (This reason is no longer relevant now that WorldState is safely clonable though.)
I think it's safer to turn off all concurrency between handlers by default and enable it on a case by case basis after giving appropriate consideration. While the LSP protocol allows for out of order responses, it vaguely stipulates that doing so should not affect the correctness of the responses. I interpret this as meaning that requests responding with text edits (formatting, refactoring, but also some special completions) should not be reordered. Handling messages sequentially by default is a safer stance. The new setup is easier to reason about as a result.

In this PR, we now handle each message in turn. The handlers can still be async (though almost all of them are now synchronous) but they must resolve completely before the next handler can run. This is supported by a "main loop" to which we relay messages from the client (and the Jupyter kernel, see #359) via a channel. Handlers must return an anyhow::Result and errors are automatically logged (and propagated as a jsonrpc error response). The loop is very close to the one running in rust-analyzer: https://github.com/rust-lang/rust-analyzer/blob/83ba42043166948db91fcfcfe30e0b7eac10b3d5/crates/rust-analyzer/src/main_loop.rs#L155-L164. This loop owns the world state (see discussion in #358) and dispatches it to handlers.

The second part of fixing integrity issues is that the world state has become a pure value. All synchronisation and interior mutability (e.g. through the dash map of documents) has been removed. This means that we can clone the state to create a snapshot and for handlers running on long blocking tasks. If a document update arrives concurrently, it will not affect the integrity of these background tasks.

Long-running handlers on spawned tasks will respond to the client in arbitrary order. In the future we could synchronise these responses if needed, for all tasks or a subset of them.

There is also a separate auxiliary loop for latency sensitive tasks, in particular logging, but also things like diagnostics publication. The main loop is not appropriate for these because each tick might take milliseconds. We don't want log messages to be queued there as the latency would make it harder to understand the causality of events. This loop is also in charge of joining background tasks to immediately log any errors or panics that might have occurred.

Logging is no longer async nor blocking, and no longer requires a reference to the backend. I've added new macros such as lsp::log_error!() that can now be used anywhere in the LSP, including in synchronous contexts. I propose that we now consistently use these to log messages from the LSP. This will unclutter the Jupyter kernel log and allow to see the messages in their context (logged requests).

I've also added some utils to the lsp:: module, like lsp::spawn_blocking() or lsp::publish_diagnostics(). All of these are intended to be called with the lsp:: prefix.

I've removed the workarounds we implemented in Document::on_did_update(). These should no longer be necessary and were also making the incorrect assumption that document versions were consecutively increasing, whereas the LSP protocol allows clients to skip versions. The only guarantee is that the versions are monotonically increasing. We still check for this invariant and panic if that's not the case. I think there is no way to keep running with out of sync state. If this panic comes up in practice and is not the result of a synchronisation bug, we could replace the panic with an orderly shutdown to start over.

Orderly shutdowns should be easy to implement as both async loops, and all their associated tasks and state, are automatically dropped when the tower-lsp backend is dropped.

I've organised files in the following way:

main_loop.rs: Implements the main and auxiliary loops.
state.rs: Defines WorldState, the source of inputs for LSP handlers.
state_handlers.rs: Implements handlers for state-altering notifications. These require an exclusive reference to the world state.
handlers.rs: Implements read-only handlers. These take LSP inputs and prepare them before calling other entry points of the LSP.

DavisVaughan · 2024-05-29T17:13:20Z

crates/ark/src/interface.rs

            },
            Err(err) => log::error!("Can't retrieve console inputs: {err:?}"),
        }
-        self.send_lsp(LspEvent::RefreshAllDiagnostics());


I am noting that this has disappeared, but presumably is added back in elsewhere some other way?

Looks like it has been absorbed into this

pub(crate) fn did_change_console_inputs( inputs: ConsoleInputs, state: &mut WorldState, ) -> anyhow::Result<()> { state.console_scopes = inputs.console_scopes; state.installed_packages = inputs.installed_packages; // We currently rely on global console scopes for diagnostics, in particular // during package development in conjunction with `devtools::load_all()`. // Ideally diagnostics would not rely on these though, and we wouldn't need // to refresh from here. lsp::spawn_diagnostics_refresh_all(state.clone()); Ok(()) }

DavisVaughan · 2024-05-29T17:41:05Z

crates/ark/src/lsp/backend.rs

-            });
-        }
+        // Wait for response from main loop
+        response_rx.recv().await.unwrap()


IIUC, this await allows tower-lsp to switch away and start another handler, but all that can really do is "relay" an additional event to the main loop, which is in charge of keeping the requests and their responses in order

That's exactly right.

crates/ark/src/lsp/backend.rs

DavisVaughan · 2024-05-29T17:56:55Z

crates/ark/src/lsp/backend.rs

+fn new_jsonrpc_error(message: String) -> jsonrpc::Error {
+    jsonrpc::Error {
+        code: jsonrpc::ErrorCode::ServerError(-1),


Do you know where these errors end up getting logged?

I would imagine in VS Code's LSP client, they are probably turned into notifications or maybe console messages.

crates/ark/src/lsp/diagnostics.rs

crates/ark/src/lsp/main_loop.rs

DavisVaughan · 2024-05-29T20:49:35Z

crates/ark/src/lsp/state_handlers.rs

+    lsp::spawn_blocking(|| {
+        indexer::start(folders);


Adding a reminder here about doing the diagnostic refresh after each indexer finishes

I ended up doing the refresh on did_open, just like we do in did_update.

Note that there's a race condition between the initial indexer and our state handlers. There was already one for did_update and now there is one too for did_open. Nothing bad can happen AFAICS but it's possible we get the indexer in a weird state in rare cases. I expect that we'll completely review the way this works for RC though (move the index to the world state and compute it on demand with salsa caching).

I've confirmed that the the failure case you brought up in the last PR is now fixed.

DavisVaughan · 2024-05-29T21:01:58Z

crates/ark/src/lsp/main_loop.rs

+// The global instance of the auxiliary event channel, used for sending log
+// messages or spawning threads from free functions. Since this is an unbounded
+// channel, sending a log message is not async nor blocking. Tokio senders are
+// Send and Sync so this global variable can be safely shared across threads.
+static mut AUXILIARY_EVENT_TX: std::cell::OnceCell<TokioUnboundedSender<AuxiliaryEvent>> =
+    std::cell::OnceCell::new();


IIUC, it would be possible, but annoying, to instead have auxiliary_event_tx and auxiliary_event_rx live in GlobalState as well, rather than being a global object?

Like, every call to log_error!() would have to pass through auxiliary_event_tx, which would get passed through to main_loop::log() so that it could then send the message along. And to be able to do that you'd also have to pass auxiliary_event_tx through any handlers that log.

As annoying as that is, it might be worth considering it if it removes this global state. This is the only thing left that I'm a little uncomfortable with.

This would force all function calls that log an event to have a state parameter which would add a lot of friction. I think this is the sort of things for which the ergonomic gain from a global context is worth it.

crates/ark/src/lsp/main_loop.rs

To avoid write handlers blocking for too long when there are sharing refs (e.g. snapshots on a background thread)

Co-authored-by: Davis Vaughan <[email protected]>

lionel- force-pushed the feature/diagnostics-event-loop branch from 70a53ef to 096e23e Compare May 29, 2024 12:11

lionel- force-pushed the bugfix/lsp-sync branch from 919645e to 9763c7b Compare May 29, 2024 13:36

DavisVaughan approved these changes May 29, 2024

View reviewed changes

lionel- commented May 30, 2024

View reviewed changes

crates/ark/src/lsp/main_loop.rs Outdated Show resolved Hide resolved

lionel- force-pushed the feature/diagnostics-event-loop branch 2 times, most recently from fba54fa to f01c670 Compare May 30, 2024 08:07

Base automatically changed from feature/diagnostics-event-loop to main May 30, 2024 08:07

lionel- and others added 23 commits May 30, 2024 10:07

Handle LSP requests, events, and tasks in a loop

a233dfb

Don't propagate getNamespace() errors to LSP

40978b4

Handle log messages and task joins on auxiliary loop

a466607

Add helpers to log LSP messages

d9d65ff

Remove now unnecessary state synchronisation

0157aec

Replace RwLock by a mutex

2e11508

To avoid write handlers blocking for too long when there are sharing refs (e.g. snapshots on a background thread)

Remove workaround for out-of-sync updates

197b65f

Tokio senders are Sync

af851c1

Spawn tasks from lsp::spawn_blocking()

52f80e6

Simplify spawn of auxiliary loop

0bd0c40

Remove task events from the main loop

a8ca332

Return LSP events from joined tasks

fd15423

Add spawn_handler() for long-running handlers

00b062a

Deal with shutdown services better

70f712e

Log long-running handlers

5e8cde7

Reset aux channel properly on reconnection

3b42ad6

Extract out ParserState only used by exclusive ref handlers (#372)

cb4b93e

Apply suggestions from code review

fe9fc24

Co-authored-by: Davis Vaughan <[email protected]>

Clean up Backend

21f0ae7

Consistently use handle_ prefix

9156459

Don't pass tree alongside document

20ccd88

Remove state accessors

50730ac

Update auxiliary loop comment

139f8a2

lionel- added 2 commits May 30, 2024 10:07

Upkeep in indexer.rs

6f4b550

Update index before spawning initial diagnostics

f70c460

lionel- force-pushed the bugfix/lsp-sync branch from b125d2b to f70c460 Compare May 30, 2024 08:08

lionel- merged commit fecf690 into main May 30, 2024
1 check passed

lionel- deleted the bugfix/lsp-sync branch May 30, 2024 08:11

DavisVaughan mentioned this pull request Jun 10, 2024

Intermittent Request textDocument/completion failed from R LSP posit-dev/positron#3467

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run LSP handlers consecutively by default #361

Run LSP handlers consecutively by default #361

lionel- commented May 17, 2024 •

edited

Loading

DavisVaughan May 29, 2024

DavisVaughan May 29, 2024

DavisVaughan May 29, 2024

lionel- May 30, 2024

DavisVaughan May 29, 2024

lionel- May 30, 2024

DavisVaughan May 29, 2024

lionel- May 30, 2024

DavisVaughan May 29, 2024

lionel- May 30, 2024

Run LSP handlers consecutively by default #361

Run LSP handlers consecutively by default #361

Conversation

lionel- commented May 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lionel- commented May 17, 2024 •

edited

Loading