-
Notifications
You must be signed in to change notification settings - Fork 410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Support for indexing the codebase #6762
Conversation
I'm worried about how this design produces a lot of intermediate artifacts that aren't always going to be useful. For example, the user might trigger a few recompilation of the project before actually using the symbols occurrences" feature. Did you explore a design where theses indices can be produced on demand? How big are these index files?
Does that mean this syntax extension msut be enabled in every project where this featured is to be used? That seems like a painful UI to me. IMO it should be be a workspace or a dune-config option.
Yes, that's "slow" but given the proposed design, I don't see a better way to do this. |
3ade745
to
06c8984
Compare
I did not explore any "on demand" design yet, I wanted to try and see if it would be sustainable to keep the index up-to-date in watch mode for small / medium size projects. And I think it is right now. An hybrid model would probably work best:
On Dune itself, the final index is |
Will we need the entire index to do any queries? E.g. if I want to search for uses of a function that is private to a library, do I need an index for the entire workspace?
The issue isn't only about size and performance, although those matter too. Suppose that the user makes an occurrence request while the index is still building, what's the failure mode? Do we tell the user to retry later? Do we delay his request until all the indexes are built? It would seem more sensible to me for merlin to ask dune to build whatever index file it needs as it's serving a query. In any case, this is already good. I'd just like to us to try and improve on the UX we already have with merlin/lsp rather than rely on our usual reliably but relatively awkward patterns. By the way, where can I find the corresponding merlin code that looks up these indices? |
With the current implementation the entire index is needed. But it is an interesting perspective! Not sure how we should do it however.
Yes I agree that we need to think about the UX, the current implementation is just laying the basis.
Just pushed my latest changes here, but it is still very wip: And the small ocaml-lsp changes (also uncleaned): |
Tiny observation: I've never used a project-wide-occurrences feature on any sizable project that didn't produce a loading indicator. :P I suspect this is the type of feature that doesn't necessarily need blazing-fast perf upon invocation to really be fantastically useful; especially if some/most occurrences can be produced quickly while the rest are populated in the background / asynchronously. VSCode even includes a 'refresh' button on the "Find All References" interface, which is very telling: Correspondingly, as someone who'd use a project-wide reference index at least weekly, I wouldn't personally want the regular build/watch feature slowed down by it, unless it was nearly imperceptible. Plenty of my coworkers would not use such a feature much-if-at-all, and we already have enough build-time issues. 😓 (Hopefully this is useful feedback; this work looks super cool, and I'm planning to check out your branches and give it a spin it against our codebase later today!) |
Thanks for working on this. This feature is eagerly awaited. If a project has hundreds of One way to potentially save space (and perhaps get some more robustness) is to use a single file which is a database. Why not use a single SQLITE db for all the indexing operations and store all your project-wide-occurences data in it? This db would store the content of individual Depending on how you design the schema, the sqlite db can greatly help with updating the index, querying etc. It may even help you with aggregation and reduce disk space by reducing the amount of redundant data that needs to be stored in each Since there is only one source of truth for the project everything becomes simpler. Since updates to the db can be made transactional, there will never be a scenario in which the project index will be in some inconsistent state when it is being updated. In other words the index is always queryable even when it is slightly outdated. This is admittedly a large change. This design approach may have already been considered in the past. Just mentioning in case this is appealing. |
Sorry for the naïve question, but could you explain what you mean by As a data point, we have a typed Have you done any benchmark to compare performance with and without the indexing step? (As it seems to add some complexitiy and performance cost.) Could it make sense to do a first implementation of this feature without the indexing step, and later, if performance is an issue, add the indexing step? |
An hybrid model would probably be the best in very large codebases: an asynchronously generated index for the complete codebase and a live re-indexing of the part of the project being worked on. Being aware of the visibility of the definition would also be way to cut the search tree as @rgrinberg suggested.
I agree that it probably should not be a part of the
The main pro of the "individual files" method is that it plays well with Dune's (and lot of other build systems) file-based model: this allows the build system to maximize parallelism and incrementality at the cost of bigger artifacts. It also fits closely OCaml separate compilation model so indexing can start when individual
By that I mean the ability to select a value in the buffer and display all the usages of that same value across the entire project. So far this only works in the active buffer.
That sounds like a great tool! Do you call it on every
Currently the An incremental rebuild of the index after a one-file change takes around
I find the current implementation quite satisfying (right now I am cleaning things up and about to un-draft the PR) and it works quite well in term of timings. It would require quite a lot of changes to have Merlin directly build the index and would probably be much slower without Dune's incredible parallelism and incrementality. |
Yes, that's indeed the case why we usually lean towards files. But I find that to be a crutch and a sorely missing feature in dune. With a proper database, we would be losing incrementality and parallelism only at the stage when we serialize the write into the database. The actual indexing operations would be retaining all those good properties. Nevertheless, using a database isn't realistic until we have a database that supports something like dune's "stale artifact deletion" for records.
It seems concerning that merlin is so tied to a particular implementation of the database index. Would it be possible to make sure that the db is a separate component with a well defined API? |
The "satisfying" part was the Dune rules :-) So far the format (basically a marshalled map) of the index is defined with some magic numbers and a shared module between Merlin and the indexing binary. One way to decouple this would be to define a small rpc (that doesn't need to run in the background) for Merlin to query the indexing binary for occurrences. It would be easier to extend the protocol if we want to index more thing. Is that the kind of API you had in mind ? |
It doesn't need to be RPC, a shared module should be enough. It just needs to encapsulate the details of marshalling and IO so that they only live in one place. |
Signed-off-by: Ulysse Gérard <[email protected]>
Signed-off-by: Ulysse Gérard <[email protected]>
Index generation opt-in with workspace option This means that monorepo are fully indexed if the host workspace has indexation activated. Signed-off-by: Ulysse Gérard <[email protected]>
Signed-off-by: Ulysse Gérard <[email protected]>
Signed-off-by: Ulysse Gérard <[email protected]>
Signed-off-by: Ulysse Gérard <[email protected]>
The remaining work to get this work merged, if I understand correctly, seems to be:
Is there any way I can contribute? I truly believe this is one of the big things holding back the 'newbie experience' in the OCaml ecosystem, would love to help push it forward! 😅 |
Closing in favor of #10422 that update these rules to the new process which involve the compiler pre-indexing cmt files. |
This PR is still a work in progress and is part of an effort to support project-wide occurrences when working with OCaml.
The current plan to support project-wide occurrences has three distinct parts:
cmt
:ocaml-uideps
ocaml-uideps
to generate an index of an entire project (this PR)The heart of this PR is the
uideps.ml
module, that exposes two functions,cctx_rules
andproject_rule
, that respectively provide rules to generate indexes for each compilation context and for the global project.The generation of the indexes can be summarized as:
Here a a few of the points that I think would be worth investigating / discussing:
@all
alias. That means the user must knowingly trigger the index build.Dune_file.fold_stanzas
. @emillon told me that it might be problematic. @rgrinberg do you have more knowledge about this ?Maybe we could have that discussing during on of Dune's dev meeting when people will be back from holidays.