refactor: use `papaya` instead of DashMap for workspace documents #4624

arendjr · 2024-11-24T10:06:15Z

Summary

This PR is a proof-of-concept to show that we can use papaya instead of DashMap for workspace documents. It doesn't need immediate merging, although if all tests are green and the team is on board, I'd be happy to merge this as a stepping stone.

Why papaya?

While papaya offers great read performance, the main reason why I want to make this change is because papaya is lock-free. Because of this, it cannot create dead locks. If we want to implement our own caching and file watching in the workspace, I would not want to run the risk of creating dead locks. They can be hard to catch, hard to debug, and generally drain contributor's energy. Let's try to avoid that :)

Tradeoffs

Because papaya is lock-free, it cannot offer a get_mut() method, meaning we cannot update values within a map in-place. This leads to the following trade-offs:

It is no longer easy to reuse a node cache that's already in the map. If we really care about reusing node caches, we can work around this using Cell<NodeCache>, but for this PR I've taken an even simpler route: Just don't persist the NodeCache anymore. If we implement long-term persistence and caching, node caches can theoretically grow without bounds over time, effectively creating a memory leak. Maybe in practice this isn't much of an issue, so we need to decide how much we care either way. Let me know which way you are leaning :)
Persisting parsed syntax separately from the documents means we need coordination to protect against the situation where separate threads try to access the syntax simultaneously. Previously this resulted in lock contention, which is already not ideal for performance, but in a lock-free environment we need to resolve this situation on our own. I think the simplest solution is simply to store the syntax with the documents and parse optimistically: After all, there are few situations in which we put documents in the workspace if we don't intend to parse them later anyway.

Future Work

I suspect we can also get rid of our DenseSlotMap in favor of papaya, in order to save on dependencies.
If we decide that the node cache doesn't need persistence, we can simplify our parser APIs a bit further.
Continue with file caching and watching.

Test Plan

CI should remain green.

codspeed-hq · 2024-11-24T10:47:32Z

CodSpeed Performance Report

Merging #4624 will not alter performance

_{Comparing arendjr:papaya (2917a35) with next (aa9582f)}

Summary

✅ 97 untouched benchmarks

ematipico · 2024-11-24T18:27:54Z

It seems that one of the dependencies of papaya doesn't support WASM :(

https://github.com/biomejs/biome/actions/runs/11998204835/job/33444689330?pr=4624#step:10:253

arendjr · 2024-11-24T21:24:57Z

That's unfortunate. I've opened an issue about it: ibraheemdev/papaya#32

I'm afraid for now I need to think of a different approach. Using papaya was never a requirement for the rest of the plan, although it would've been nice if we could guarantee the absence of deadlocks. Oh well...

arendjr · 2024-11-26T18:49:15Z

Let's see if it works with a WASM shim...

ematipico

The refactor showed a shortcoming in how we parse files that I didn't notice before. There's no need to block the PR, but we would need to fix it somehow

ematipico · 2024-11-27T15:49:13Z

crates/biome_service/src/workspace/server.rs

+    /// Returns an error if no file exists in the workspace with this path.
+    fn get_parse(&self, biome_path: &BiomePath) -> Result<AnyParse, WorkspaceError> {
+        self.documents
+            .pin()


Do we need to use pin with every papaya hashmap?

Yes, it is the equivalent of calling .read() or .write() on an RwLock. The pin is like a guard, but it’s called differently because it doesn’t lock the map. What it does do is that it gives you mutable access to the map and it prevents any references you take from getting cleaned up as long as you keep it pinned.

ematipico · 2024-11-27T15:57:05Z

crates/biome_service/src/workspace/server.rs

+        let parsed = self.parse(&params.path, &params.content, index)?;
+
+        if let Some(language) = parsed.language {
+            index = self.set_source(language);
+        }


I'm not sure we should go in this direction... we are now parsing any file regardless. This means that we parse even files that might be ignored. Before, get_parse was called by functions like format and pull_actions, but now we call it every time we open a file.

But I suppose also the previous logic was doing the same, but now it's more evident that we are making a mistake 🤔

Yeah, I agree we need to carefully think this through. In fact, with multi-file analysis we need to decide what “ignoring” a file even means. Does it mean that we should not show any diagnostics about the file, or does it mean we cannot even extract any information from it? For example:

A generated file may import non-generated files. If in turn a non-generated file imports the generated one, that may lead to a cycle. If we don’t analyze the generated file, we would fail to detect the cycle and cannot show the diagnostic on the non-generated file.

node_modules will typically be ignored even though we’ll want to extract type information from it.

These use cases make me think we should probably parse even the ignored files.

Note this doesn’t imply we need to parse every file inside node_modules. I would propose we start traversing from the included files and expand traversal to those that get imported from them. For those we have good reason to parse them, even if we never need to get diagnostics from them.

Finally, of course there are other reasons why users may want to exclude files (I’m intentionally avoided the word “ignore” here). A file may be too big, or it might contain invalid syntax. That’s a very different reason, and indeed we would want to skip parsing altogether in such cases. I’m not sure yet what’s the best way to distinguish between these cases however…

Oh, maybe one more point that we should consider. This code is inside the open_file() function. If the file should truly be excluded, I think we should not even call open_file() to begin with. So the intention should be, if we call open_file() we want to at least get some information out of it, meaning that parsing wouldn't be unnecessary.

Great. The CLI already does that, but not the LSP. So we should update the LSP code to check if the file is ignored

ematipico · 2024-11-27T16:01:52Z

crates/biome_service/src/workspace/server.rs

+            })
+            .ok_or_else(WorkspaceError::not_found)?;
+
+        let parsed = self.parse(&params.path, &params.content, index)?;


Same here. If the file is ignored, we should not parse it.

arendjr · 2024-11-28T08:20:13Z

I removed the shim again, since the latest Git version of papaya works on WASM too. (I confirmed locally, but CI should remain green.)

arendjr requested a review from a team November 24, 2024 10:06

github-actions bot added the A-Project Area: project label Nov 24, 2024

arendjr changed the title ~~Use papaya instead of DashMap for workspace documents~~ refactor: use papaya instead of DashMap for workspace documents Nov 24, 2024

arendjr force-pushed the papaya branch from 9aaf659 to 45a9b27 Compare November 24, 2024 17:33

arendjr force-pushed the papaya branch from 45a9b27 to fe8d8c0 Compare November 24, 2024 18:53

arendjr closed this Nov 24, 2024

arendjr reopened this Nov 26, 2024

arendjr force-pushed the papaya branch from f90fccb to 4ee4f29 Compare November 26, 2024 18:52

ematipico approved these changes Nov 27, 2024

View reviewed changes

ematipico force-pushed the next branch from 2d7c0fb to ab5f0cd Compare November 27, 2024 16:37

arendjr force-pushed the papaya branch from 4ee4f29 to 3e350f9 Compare November 27, 2024 19:06

Use papaya instead of DashMap for workspace documents

2917a35

arendjr force-pushed the papaya branch from 3e350f9 to 2917a35 Compare November 28, 2024 08:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: use `papaya` instead of DashMap for workspace documents #4624

refactor: use `papaya` instead of DashMap for workspace documents #4624

arendjr commented Nov 24, 2024 •

edited

Loading

codspeed-hq bot commented Nov 24, 2024 •

edited

Loading

ematipico commented Nov 24, 2024

arendjr commented Nov 24, 2024

arendjr commented Nov 26, 2024

ematipico left a comment

ematipico Nov 27, 2024

arendjr Nov 27, 2024

ematipico Nov 27, 2024

arendjr Nov 27, 2024 •

edited

Loading

arendjr Nov 28, 2024

ematipico Nov 28, 2024

ematipico Nov 27, 2024

arendjr commented Nov 28, 2024 •

edited

Loading

refactor: use papaya instead of DashMap for workspace documents #4624

Are you sure you want to change the base?

refactor: use papaya instead of DashMap for workspace documents #4624

Conversation

arendjr commented Nov 24, 2024 • edited Loading

Summary

Why papaya?

Tradeoffs

Future Work

Test Plan

codspeed-hq bot commented Nov 24, 2024 • edited Loading

CodSpeed Performance Report

Merging #4624 will not alter performance

Summary

ematipico commented Nov 24, 2024

arendjr commented Nov 24, 2024

arendjr commented Nov 26, 2024

ematipico left a comment

Choose a reason for hiding this comment

ematipico Nov 27, 2024

Choose a reason for hiding this comment

arendjr Nov 27, 2024

Choose a reason for hiding this comment

ematipico Nov 27, 2024

Choose a reason for hiding this comment

arendjr Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

arendjr Nov 28, 2024

Choose a reason for hiding this comment

ematipico Nov 28, 2024

Choose a reason for hiding this comment

ematipico Nov 27, 2024

Choose a reason for hiding this comment

arendjr commented Nov 28, 2024 • edited Loading

refactor: use `papaya` instead of DashMap for workspace documents #4624

refactor: use `papaya` instead of DashMap for workspace documents #4624

arendjr commented Nov 24, 2024 •

edited

Loading

codspeed-hq bot commented Nov 24, 2024 •

edited

Loading

arendjr Nov 27, 2024 •

edited

Loading

arendjr commented Nov 28, 2024 •

edited

Loading