Skip to content
This repository has been archived by the owner on Dec 29, 2022. It is now read-only.

Managed concurrency #923

Merged
merged 1 commit into from
Jul 18, 2018
Merged

Managed concurrency #923

merged 1 commit into from
Jul 18, 2018

Conversation

matklad
Copy link
Member

@matklad matklad commented Jun 30, 2018

Hi!

In tests, we currently do synchronization via thread::sleep. It is rather problematic by itself, but it also shows that we don't have a very good control of concurrency inside RLS :(

I think this is a big problem: RLS is going to be a highly concurrent application, so it's better to establish reliable concurrency practices from the start. I don't have too much experience with designing concurrent though, so I am not sure if the proposed approach of dealing with the problem is not insane :D

In my experience, a lot of problems with managing concurrency stem from "fire an forget" approach: if a function schedules some work to be executed outside the dynamic extent of the function call itself (for example, by spawning a thread, or by submitting the work to some queue), and doesn't return a handle to the yet-to-be-computed result, then such function will be impossible to test without resorting to busy-waiting.

Current PR proposes to use a future-like ConcurrentJob object to keep track of tasks. Any background operation returns this object, which can be used to wait for that operation to finish. Dropping ConcurrentJob without waiting for the result in general results in a panic, to protect from accidentally forgetting about a background operation.

The end result is that LsService now has a wait_for_background_jobs which can be used to make sure that all background ops are finished.

@matklad matklad changed the title rManaged concurrency Managed concurrency Jul 1, 2018
@matklad matklad force-pushed the managed-concurrency branch from ba80d25 to 8d33e48 Compare July 1, 2018 07:17
@alexheretic
Copy link
Member

alexheretic commented Jul 1, 2018

Thread sleep is inelegant, but is it a big problem in tests? You wait a bit for what you expect and fail if it never arrives, or arrives wrongly. The absolute worst case is the tests can take a while to fail, and the current timeout seems a little high.

But I'm not sure it shows that we lack good control of concurrency in RLS.

Here I'd be tempted to define "good" as sufficient. Shouldn't we add this ability only when we have a production code need for it?

The other issue is I think we'd generally want to move the feature tests out into integration, or whatever you want to call it, style tests. Tests that start rls as a process and talk to it like a lsp client. If that's still true then in-process hooks won't be of any use to the tests.

@matklad
Copy link
Member Author

matklad commented Jul 1, 2018

Thread sleep is inelegant, but is it a big problem in tests?

It certainly is a problem: in #918 (comment), quite some cycles were spent on figuring out that test fails on CI but not locally due to the timing issue, and not because this is some windows specific behavior. In other projects I've worked on unrestricted concurrency also posed challenges to writing robust and fast tests.

In general, I strongly believe that logical clocks and causality should be used instead of timeouts if possible, and that APIs should be designed in such a way that using logical clocks is possible (i.e, if something spawns a concurrent thread of execution, it should return a handle which can be used to wait for that thread to finish). I don't have concrete strong arguments here except that it,... feels right I guess? :)

The other issue is I think we'd generally want to move the feature tests out into integration, or whatever you want to call it, style tests. Tests that start rls as a process and talk to it like a lsp client. If that's still true then in-process hooks won't be of any use to the tests.

I agree that we should add tests that literally spawn the RLS process and communicate with it via standard streams: currently this layer seems untested? However, while I in general a strong believer in integration tests, I personally don't think that moving all feature tests to integration testing will be a good idea, for two reasons:

  • Concurrency is easier to test from withing the project. For batch programs like cargo or rustc, just running the binary and asserting the output gives perfect feature test coverage, but for stateful servers you also need to check possible interactions with requests ordering and timing, and that is usually easier to do internally.
  • I expect that there'll be ton of test in RLS, because the amount of features you can add to IDE is enormous. As such, it'll be important to keep tests fast and to avoid spawning a process per test.

Long-term, I think we should split RLS into the "rls-library" layer, which is more or less oblivious of the details of the LSP and "rls-lsp binary", which wraps the API of the library into a specific protocol. That way, we'll have a smallish number of integration tests for binary, which spawn process and check that output indeed conforms to LSP, and most feature tests would be done on the library level.

@alexheretic
Copy link
Member

Well I do agree that eliminating the messaging timeout issue in the tests would be really nice. I'm just not sure keeping track of the jobs with this much extra prod-code is worth it.

For me the test code is there to make the prod code better, this feels like the other way around.

On the other hand perhaps it will serve the greater good by allowing better testing, and I do feel rls test code isn't really good enough at the moment.

@matklad
Copy link
Member Author

matklad commented Jul 2, 2018

I'm just not sure keeping track of the jobs with this much extra prod-code is worth it.

Agree that this brings-in a significant amount of "clever" code with tricks like uninhabited enums and drop-bombs, which is a big drawback.

A simpler alternative is possible: instead of keeping track of jobs, we can just drain all the work queues explicitly. That is, in wait_for_background_jobs we can call dispatcher.await_all_dispatched(), something similar for AnalysisQueue (the method needs to be introduce), and BuildQueue.block_on_build.

I think that explicit tracking of all concurrent tasks is better, long term, than tracking of potential sources of concurrency, although it's more code:

  • it is a generalization of the pattern we already use in Dispatcher. In this PR, in_flight_requests is removed and jobs are used instead.

  • JobToken could be turned into cancellation token

  • Threading of jobs through API signals to the reader where concurrency happens.

@matklad
Copy link
Member Author

matklad commented Jul 4, 2018

@nrc curious what do you think about the problem in general and this specific approach? Is it something reasonable, or should we just stick with thread::sleep for the time being?

@nrc
Copy link
Member

nrc commented Jul 15, 2018

Sorry for the delay. My initial opinions (I haven't looked at the code yet):

  • I agree that the tests running long (and often non-deterministically) is a big problem.
  • I am somewhat paranoid about the concurrency in the RLS. Historically this has caused some subtle and hard to fix bugs and generally been difficult to work on. I would love to do better (safer) here, even if it means some more code. On the other hand, we are very close to releasing a 1.0 version and I am frightened of changing this area very much until after the release.
  • I've been particularly unhappy about the 'fire and forget' stuff, it's often felt like there should be a better way, so the approach as described sounds appealing.

Here I'd be tempted to define "good" as sufficient.

I do agree with this sentiment too - there is something to be said for 'if it ain't broke, don't fix it'.

@@ -73,7 +74,7 @@ impl BlockingNotificationAction for DidOpenTextDocument {
}

impl BlockingNotificationAction for DidChangeTextDocument {
fn handle<O: Output>(params: Self::Params, ctx: &mut InitActionContext, out: O) -> Result<(), ()> {
fn handle<O: Output>(params: Self::Params, jobs: &mut Jobs, ctx: &mut InitActionContext, out: O) -> Result<(), ()> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could jobs be part of the context here, rather than adding a variable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved jobs to the Context. Initially I've avoided that because Context is clone, and so Arc<Mutex> is required. And having explicit &mut Jobs argument and -> ConcurrentJob return types gives better idea about when concurrency happens. However, that is indeed a lot of extra typing compared to just stuffing jobs into the ctx.

src/test/lens.rs Outdated
expect_messages(
results.clone(),
&[
ExpectedMessage::new(Some(0)).expect_contains(r#""codeLensProvider":{"resolveProvider":false}"#),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably check for this message

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed! Restored other messages as well to try to debug #925


impl Drop for ConcurrentJob {
fn drop(&mut self) {
if self.is_abandoned || self.is_completed() || thread::panicking() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we panic on abandoned jobs? Seems like a bad thing if Jobs is dropped without completing the jobs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, indeed, for normal termination process it is reasonable to require that all the jobs are awaited for!

Still not sure what's the best choice for abnormal termination (some thoughts in af2534f), but probably just abandoning jobs is fine, as things have gone south anyway.

@matklad matklad force-pushed the managed-concurrency branch from 8d33e48 to ad66472 Compare July 16, 2018 08:48
@nrc
Copy link
Member

nrc commented Jul 16, 2018

I think we should do this. I'm pretty excited to improve our flaky tests. Once we land it I'll publish a new version and that should show up any bugs pretty soon.

@matklad could you rebase and address the review comments, then I'll merge. Thanks!

@matklad
Copy link
Member Author

matklad commented Jul 16, 2018

@nrc rebased before your comment :) Fixing the comments now!

@matklad matklad force-pushed the managed-concurrency branch 3 times, most recently from ee1b647 to 4962bb7 Compare July 16, 2018 11:34
@matklad
Copy link
Member Author

matklad commented Jul 16, 2018

I think I've addressed all comments.

@matklad
Copy link
Member Author

matklad commented Jul 16, 2018

CI should be fixed by #944

@Xanewok
Copy link
Member

Xanewok commented Jul 16, 2018

@matklad #944 was merged, but there's a lockfile conflict now

@matklad matklad force-pushed the managed-concurrency branch from 4962bb7 to dd1f805 Compare July 16, 2018 17:00
@matklad
Copy link
Member Author

matklad commented Jul 16, 2018

Rebased!

@matklad
Copy link
Member Author

matklad commented Jul 16, 2018

bors try

We've tried to setup bors here, let's see if is working...

@bors-voyager
Copy link
Contributor

bors-voyager bot commented Jul 16, 2018

🔒 Permission denied

Existing reviewers: click here to make matklad a reviewer

@Xanewok
Copy link
Member

Xanewok commented Jul 16, 2018

I get a Permission denied when trying to add you as a reviewer; let's see if I'll have more luck...

bors try

cc @nrc

@bors-voyager
Copy link
Contributor

bors-voyager bot commented Jul 16, 2018

🔒 Permission denied

Existing reviewers: click here to make Xanewok a reviewer

@carols10cents
Copy link
Member

Sorry about that. I updated my bors-ng instance's code and synced reviewer+members with people on this repo who have push permission. Try again?

@nrc
Copy link
Member

nrc commented Jul 16, 2018

bors try (I don't need the @?)

bors-voyager bot added a commit that referenced this pull request Jul 16, 2018
@bors-voyager
Copy link
Contributor

bors-voyager bot commented Jul 16, 2018

@matklad
Copy link
Member Author

matklad commented Jul 16, 2018

@nrc yep, this the another bors, it does’t need an @. And you need bors r+ to do the merge. I’ve tried try just to check if bors is listening :-)

@nrc
Copy link
Member

nrc commented Jul 17, 2018

bors r+

bors-voyager bot added a commit that referenced this pull request Jul 17, 2018
923: Managed concurrency r=nrc a=matklad

Hi!

In tests, we currently do synchronization via `thread::sleep`. It is rather problematic by itself, but it also shows that we don't have a very good control of concurrency inside RLS :( 

I think this is a big problem: RLS is going to be a highly concurrent application, so it's better to establish reliable concurrency practices from the start. I don't have too much experience with designing concurrent though, so I am not sure if the proposed approach of dealing with the problem is not insane :D 

In my experience, a lot of problems with managing concurrency stem from "fire an forget" approach: if a function schedules some work to be executed outside the dynamic extent of the function call itself (for example, by spawning a thread, or by submitting the work to some queue), and doesn't return a handle to the yet-to-be-computed result, then such function will be impossible to test without resorting to busy-waiting. 

Current PR proposes to use a future-like `ConcurrentJob` object to keep track of tasks. Any background operation returns this object, which can be used to wait for that operation to finish. Dropping `ConcurrentJob` without waiting for the result in general results in a panic, to protect from accidentally forgetting about a background operation. 

The end result is that `LsService` now has a `wait_for_background_jobs` which can be used to make sure that all background ops are finished. 

Co-authored-by: Aleksey Kladov <[email protected]>
@bors-voyager
Copy link
Contributor

bors-voyager bot commented Jul 17, 2018

Build failed

@matklad
Copy link
Member Author

matklad commented Jul 17, 2018

Ok, one more round of clippy bumping!

#945

RLS is going to be a highly concurrent application, with three main
concurrency sources:

* user input/requests from the client editor
* changes from the file system
* internal parallelisation of CPU-heavy tasks

Because developing, debugging and testing concurrent applications is
hard, it makes sense to try to establish good concurrency patterns
from the start.

One concurrency anti-pattern is "fire & forget" spawning of background
tasks. This is problematic: without the ability to check if the task
is finished, you have to resort to busy-waiting in tests.

`concurrency` module introduces a `ConcurrentJob` object, which is a
handle to background job, which can be used to check if the job is
running and to wait for it to finish.

The idea is that any function, which schedules some job off the
current thread, should return a `ConcurrentJob`. Jobs are stored
inside the `Jobs` table, which gives a nice overview of all currently
active tasks and can be used to wait for all active tasks to finish.

All `ConcurrentJob`s must be stored in a `Jobs` table, it's and error
to drop a job on the floor.
@matklad matklad force-pushed the managed-concurrency branch from dd1f805 to 833a2a9 Compare July 17, 2018 21:30
@matklad
Copy link
Member Author

matklad commented Jul 17, 2018

Resolved conflict, squashing changes to a single commit in the process.

@nrc nrc merged commit f7b4a9b into rust-lang:master Jul 18, 2018
@Xanewok
Copy link
Member

Xanewok commented Jul 18, 2018

Nice! Didn’t mean to add you extra work with that 2018 edition commit... 😅

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants