Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run LSP handlers consecutively by default #361

Merged
merged 25 commits into from
May 30, 2024
Merged

Run LSP handlers consecutively by default #361

merged 25 commits into from
May 30, 2024

Conversation

lionel-
Copy link
Contributor

@lionel- lionel- commented May 17, 2024

Branched from #360 (Follow this link to see preparations for this PR)

Addresses posit-dev/positron#2692.
Closes posit-dev/positron#2999.
Supersedes and closes #340.

This PR refactors our LSP server to solve persistent ordering issues of message ordering and corruption of internal state that have caused Ark to crash periodically (posit-dev/positron#271, posit-dev/positron#340). We've implemented a number of workarounds over the last year (ffd8b27, #45) but we still observe crashes caused by stale cursors and ranges (posit-dev/positron#2692). This has also been brought up by beta testers.

@DavisVaughan found out that the ruff project recently switched from tower-lsp to lsp-server (from the rust-analyzer project) for reasons relevant to us here: astral-sh/ruff#10158. See also this discussion on the tower-lsp repo: ebkalderon/tower-lsp#284. The gist is that while tower-lsp does call our message handlers in a single task (and thus on a single thread) in the correct order, any await point within the handler will cause a transfer of control to the next handler. In particular, we used to send log messages on handler entry and this was an await point so the ordering of our handlers was doomed from the get-go.

My first attempt to fix this was #340 which takes the approach of synchronising the handlers with a read-write lock. This lock allowed concurrent read handlers but forced write handlers that update the state to wait for the read handlers to finish before running. Any incoming requests or notifications at that point would also be queued by the RwLock until the write handler was done. However I ended up deciding against this approach for these reasons:

  • To allow Ark to scale to more complex code analysis, we need to preserve the ability to spawn longer-running tasks. And we can't wait for these to complete everytime we get a write request on our lock, as that would overly reduce our throughput. (This reason is no longer relevant now that WorldState is safely clonable though.)

  • I think it's safer to turn off all concurrency between handlers by default and enable it on a case by case basis after giving appropriate consideration. While the LSP protocol allows for out of order responses, it vaguely stipulates that doing so should not affect the correctness of the responses. I interpret this as meaning that requests responding with text edits (formatting, refactoring, but also some special completions) should not be reordered. Handling messages sequentially by default is a safer stance. The new setup is easier to reason about as a result.

In this PR, we now handle each message in turn. The handlers can still be async (though almost all of them are now synchronous) but they must resolve completely before the next handler can run. This is supported by a "main loop" to which we relay messages from the client (and the Jupyter kernel, see #359) via a channel. Handlers must return an anyhow::Result and errors are automatically logged (and propagated as a jsonrpc error response). The loop is very close to the one running in rust-analyzer: https://github.com/rust-lang/rust-analyzer/blob/83ba42043166948db91fcfcfe30e0b7eac10b3d5/crates/rust-analyzer/src/main_loop.rs#L155-L164. This loop owns the world state (see discussion in #358) and dispatches it to handlers.

The second part of fixing integrity issues is that the world state has become a pure value. All synchronisation and interior mutability (e.g. through the dash map of documents) has been removed. This means that we can clone the state to create a snapshot and for handlers running on long blocking tasks. If a document update arrives concurrently, it will not affect the integrity of these background tasks.

Long-running handlers on spawned tasks will respond to the client in arbitrary order. In the future we could synchronise these responses if needed, for all tasks or a subset of them.

There is also a separate auxiliary loop for latency sensitive tasks, in particular logging, but also things like diagnostics publication. The main loop is not appropriate for these because each tick might take milliseconds. We don't want log messages to be queued there as the latency would make it harder to understand the causality of events. This loop is also in charge of joining background tasks to immediately log any errors or panics that might have occurred.

Logging is no longer async nor blocking, and no longer requires a reference to the backend. I've added new macros such as lsp::log_error!() that can now be used anywhere in the LSP, including in synchronous contexts. I propose that we now consistently use these to log messages from the LSP. This will unclutter the Jupyter kernel log and allow to see the messages in their context (logged requests).

I've also added some utils to the lsp:: module, like lsp::spawn_blocking() or lsp::publish_diagnostics(). All of these are intended to be called with the lsp:: prefix.

I've removed the workarounds we implemented in Document::on_did_update(). These should no longer be necessary and were also making the incorrect assumption that document versions were consecutively increasing, whereas the LSP protocol allows clients to skip versions. The only guarantee is that the versions are monotonically increasing. We still check for this invariant and panic if that's not the case. I think there is no way to keep running with out of sync state. If this panic comes up in practice and is not the result of a synchronisation bug, we could replace the panic with an orderly shutdown to start over.

Orderly shutdowns should be easy to implement as both async loops, and all their associated tasks and state, are automatically dropped when the tower-lsp backend is dropped.

I've organised files in the following way:

  • main_loop.rs: Implements the main and auxiliary loops.
  • state.rs: Defines WorldState, the source of inputs for LSP handlers.
  • state_handlers.rs: Implements handlers for state-altering notifications. These require an exclusive reference to the world state.
  • handlers.rs: Implements read-only handlers. These take LSP inputs and prepare them before calling other entry points of the LSP.

@lionel- lionel- force-pushed the feature/diagnostics-event-loop branch from 70a53ef to 096e23e Compare May 29, 2024 12:11
@lionel- lionel- force-pushed the bugfix/lsp-sync branch from 919645e to 9763c7b Compare May 29, 2024 13:36
},
Err(err) => log::error!("Can't retrieve console inputs: {err:?}"),
}
self.send_lsp(LspEvent::RefreshAllDiagnostics());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am noting that this has disappeared, but presumably is added back in elsewhere some other way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it has been absorbed into this

pub(crate) fn did_change_console_inputs(
    inputs: ConsoleInputs,
    state: &mut WorldState,
) -> anyhow::Result<()> {
    state.console_scopes = inputs.console_scopes;
    state.installed_packages = inputs.installed_packages;

    // We currently rely on global console scopes for diagnostics, in particular
    // during package development in conjunction with `devtools::load_all()`.
    // Ideally diagnostics would not rely on these though, and we wouldn't need
    // to refresh from here.
    lsp::spawn_diagnostics_refresh_all(state.clone());

    Ok(())
}

});
}
// Wait for response from main loop
response_rx.recv().await.unwrap()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this await allows tower-lsp to switch away and start another handler, but all that can really do is "relay" an additional event to the main loop, which is in charge of keeping the requests and their responses in order

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly right.

crates/ark/src/lsp/backend.rs Outdated Show resolved Hide resolved
Comment on lines +397 to +395
fn new_jsonrpc_error(message: String) -> jsonrpc::Error {
jsonrpc::Error {
code: jsonrpc::ErrorCode::ServerError(-1),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know where these errors end up getting logged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would imagine in VS Code's LSP client, they are probably turned into notifications or maybe console messages.

crates/ark/src/lsp/diagnostics.rs Show resolved Hide resolved
crates/ark/src/lsp/main_loop.rs Outdated Show resolved Hide resolved
crates/ark/src/lsp/main_loop.rs Outdated Show resolved Hide resolved
crates/ark/src/lsp/main_loop.rs Outdated Show resolved Hide resolved
Comment on lines +74 to +77
lsp::spawn_blocking(|| {
indexer::start(folders);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a reminder here about doing the diagnostic refresh after each indexer finishes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up doing the refresh on did_open, just like we do in did_update.

Note that there's a race condition between the initial indexer and our state handlers. There was already one for did_update and now there is one too for did_open. Nothing bad can happen AFAICS but it's possible we get the indexer in a weird state in rare cases. I expect that we'll completely review the way this works for RC though (move the index to the world state and compute it on demand with salsa caching).

I've confirmed that the the failure case you brought up in the last PR is now fixed.

Comment on lines +36 to +42
// The global instance of the auxiliary event channel, used for sending log
// messages or spawning threads from free functions. Since this is an unbounded
// channel, sending a log message is not async nor blocking. Tokio senders are
// Send and Sync so this global variable can be safely shared across threads.
static mut AUXILIARY_EVENT_TX: std::cell::OnceCell<TokioUnboundedSender<AuxiliaryEvent>> =
std::cell::OnceCell::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, it would be possible, but annoying, to instead have auxiliary_event_tx and auxiliary_event_rx live in GlobalState as well, rather than being a global object?

Like, every call to log_error!() would have to pass through auxiliary_event_tx, which would get passed through to main_loop::log() so that it could then send the message along. And to be able to do that you'd also have to pass auxiliary_event_tx through any handlers that log.

As annoying as that is, it might be worth considering it if it removes this global state. This is the only thing left that I'm a little uncomfortable with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would force all function calls that log an event to have a state parameter which would add a lot of friction. I think this is the sort of things for which the ergonomic gain from a global context is worth it.

@lionel- lionel- force-pushed the feature/diagnostics-event-loop branch 2 times, most recently from fba54fa to f01c670 Compare May 30, 2024 08:07
Base automatically changed from feature/diagnostics-event-loop to main May 30, 2024 08:07
@lionel- lionel- force-pushed the bugfix/lsp-sync branch from b125d2b to f70c460 Compare May 30, 2024 08:08
@lionel- lionel- merged commit fecf690 into main May 30, 2024
1 check passed
@lionel- lionel- deleted the bugfix/lsp-sync branch May 30, 2024 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ark: Synchronise handling of LSP messages
2 participants