Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Support for indexing the codebase #6762

Closed
wants to merge 14 commits into from
Closed

Conversation

voodoos
Copy link
Collaborator

@voodoos voodoos commented Dec 22, 2022

This PR is still a work in progress and is part of an effort to support project-wide occurrences when working with OCaml.

The current plan to support project-wide occurrences has three distinct parts:

  • A new tool that can index values in a cmt: ocaml-uideps
  • New Dune rules that orchestrate the calls to ocaml-uideps to generate an index of an entire project (this PR)
  • Merlin / Editor mode support to query that index

The heart of this PR is the uideps.ml module, that exposes two functions, cctx_rules and project_rule, that respectively provide rules to generate indexes for each compilation context and for the global project.

The generation of the indexes can be summarized as:

  • Index every CMT file in the project
  • For each lib and exe stanza, aggregate their cmt's indexes
  • Aggregate all these indexes into the projects's global index
                                        ──┐
              project.uideps              │
                    ▲                     │
                    │                     │
          ┌─────────┴─────────┐           │
          │                   │           │ "project rule"
          │                   │           │
          │                   │           │
                                          │
     cctx.uideps         cctx.uideps    ──┤
                                          │
          ▲                   ▲           │
          │                   │           │
     ┌────┴───┐               │           │
     │        │               │           │
     │        │               │           │
 a.uideps  b.uideps      main.uideps      │   "ctx rules"
                                          │
     ▲       ▲                ▲           │
     │       │                │           │
┌────┼───────┼────┐     ┌─────┼────┐      │
│    │       │    │     │     │    │      │
│  a.cmt    b.cmt │     │ main.cmt │      │
│  a.ml     b.ml  │     │ main.ml  │      │
│                 │     │          │    ──┘
│       Lib       │     │    Exe   │
└─────────────────┘     └──────────┘

Here a a few of the points that I think would be worth investigating / discussing:

  1. Indexing is, in essence, a slow process. There is a lot of room for performance improvement right now, and that is the main reason this PR is a draft. However it is to be expected that even after optimization, the indexing process might remains too slow to be considered a standard Dune target, especially on big projects. I can see several options:
    • We could remove this target from the @all alias. That means the user must knowingly trigger the index build.
    • Dune could treat progress distinctly: tell that the actual build is over but indexing is still in progress.
  2. I used a new "syntax extension" (with no actual syntax) to make the feature opt-in, but a dune-project option might be more suitable.
  3. To generate the "project" rule that merges indexes from all cctxes, I rely on Dune_file.fold_stanzas. @emillon told me that it might be problematic. @rgrinberg do you have more knowledge about this ?

Maybe we could have that discussing during on of Dune's dev meeting when people will be back from holidays.

@rgrinberg rgrinberg added requires-team-discussion This topic requires a team discussion merlin labels Dec 22, 2022
@rgrinberg
Copy link
Member

I'm worried about how this design produces a lot of intermediate artifacts that aren't always going to be useful. For example, the user might trigger a few recompilation of the project before actually using the symbols occurrences" feature. Did you explore a design where theses indices can be produced on demand?

How big are these index files?

I used a new "syntax extension" (with no actual syntax) to make the feature opt-in, but a dune-project option might be more suitable.

Does that mean this syntax extension msut be enabled in every project where this featured is to be used? That seems like a painful UI to me. IMO it should be be a workspace or a dune-config option.

To generate the "project" rule that merges indexes from all cctxes, I rely on Dune_file.fold_stanzas. @emillon told me that it might be problematic. @rgrinberg do you have more knowledge about this ?

Yes, that's "slow" but given the proposed design, I don't see a better way to do this.

@voodoos
Copy link
Collaborator Author

voodoos commented Dec 29, 2022

I'm worried about how this design produces a lot of intermediate artifacts that aren't always going to be useful. For example, the user might trigger a few recompilation of the project before actually using the symbols occurrences" feature. Did you explore a design where theses indices can be produced on demand?

I did not explore any "on demand" design yet, I wanted to try and see if it would be sustainable to keep the index up-to-date in watch mode for small / medium size projects. And I think it is right now.

An hybrid model would probably work best:

  • Allow users to enable always-on indexing if they want (via an option or simply the use of the correct alias)
  • Add a mechanism so that the editors plugins know if the index is up to date and can ask for a rebuild if it is not (probably via the RPC for OCaml-LSP).

How big are these index files?

On Dune itself, the final index is 5.0M but the total size with all the intermediaries is 53M. (for reference, the total size of the cmt files is 50M).
It is expected that the total size is higher than the final index because there is some redundancy and sharing is worst on individual files. This redundancy is necessary to get fast incremental builds.

@rgrinberg
Copy link
Member

Add a mechanism so that the editors plugins know if the index is up to date and can ask for a rebuild if it is not (probably via the RPC for OCaml-LSP).

Will we need the entire index to do any queries? E.g. if I want to search for uses of a function that is private to a library, do I need an index for the entire workspace?

I wanted to try and see if it would be sustainable to keep the index up-to-date in watch mode for small / medium size projects. And I think it is right now.

The issue isn't only about size and performance, although those matter too. Suppose that the user makes an occurrence request while the index is still building, what's the failure mode? Do we tell the user to retry later? Do we delay his request until all the indexes are built?

It would seem more sensible to me for merlin to ask dune to build whatever index file it needs as it's serving a query.

In any case, this is already good. I'd just like to us to try and improve on the UX we already have with merlin/lsp rather than rely on our usual reliably but relatively awkward patterns.

By the way, where can I find the corresponding merlin code that looks up these indices?

@voodoos
Copy link
Collaborator Author

voodoos commented Dec 30, 2022

Will we need the entire index to do any queries? E.g. if I want to search for uses of a function that is private to a library, do I need an index for the entire workspace?

With the current implementation the entire index is needed. But it is an interesting perspective! Not sure how we should do it however.

The issue isn't only about size and performance, although those matter too. Suppose that the user makes an occurrence request while the index is still building, what's the failure mode? Do we tell the user to retry later? Do we delay his request until all the indexes are built?

It would seem more sensible to me for merlin to ask dune to build whatever index file it needs as it's serving a query.

In any case, this is already good. I'd just like to us to try and improve on the UX we already have with merlin/lsp rather than rely on our usual reliably but relatively awkward patterns.

Yes I agree that we need to think about the UX, the current implementation is just laying the basis.

By the way, where can I find the corresponding merlin code that looks up these indices?

Just pushed my latest changes here, but it is still very wip:
https://github.com/ocaml/merlin/compare/master...voodoos:merlin:project-wide-occurrences?expand=1

And the small ocaml-lsp changes (also uncleaned):
https://github.com/ocaml/ocaml-lsp/compare/master...voodoos:ocaml-lsp:project-wide-occurrences?expand=1

@ELLIOTTCABLE
Copy link
Contributor

ELLIOTTCABLE commented Jan 10, 2023

Do we tell the user to retry later? Do we delay his request until all the indexes are built?

Tiny observation: I've never used a project-wide-occurrences feature on any sizable project that didn't produce a loading indicator. :P

I suspect this is the type of feature that doesn't necessarily need blazing-fast perf upon invocation to really be fantastically useful; especially if some/most occurrences can be produced quickly while the rest are populated in the background / asynchronously. VSCode even includes a 'refresh' button on the "Find All References" interface, which is very telling:

image

Correspondingly, as someone who'd use a project-wide reference index at least weekly, I wouldn't personally want the regular build/watch feature slowed down by it, unless it was nearly imperceptible. Plenty of my coworkers would not use such a feature much-if-at-all, and we already have enough build-time issues. 😓

(Hopefully this is useful feedback; this work looks super cool, and I'm planning to check out your branches and give it a spin it against our codebase later today!)

@sidkshatriya
Copy link

Thanks for working on this. This feature is eagerly awaited.

If a project has hundreds of cmt files, there will be a lot of uideps artifacts. Even if an artifact is quite small, it will occupy a minimum space in the filesystem. Then there is a problem of combining each of these files into an aggregate uideps file. When there are so many files to handle, there is a potential for more things to go wrong.

One way to potentially save space (and perhaps get some more robustness) is to use a single file which is a database. Why not use a single SQLITE db for all the indexing operations and store all your project-wide-occurences data in it? This db would store the content of individual uideps and aggregate uideps.

Depending on how you design the schema, the sqlite db can greatly help with updating the index, querying etc. It may even help you with aggregation and reduce disk space by reducing the amount of redundant data that needs to be stored in each uideps file.

Since there is only one source of truth for the project everything becomes simpler. Since updates to the db can be made transactional, there will never be a scenario in which the project index will be in some inconsistent state when it is being updated. In other words the index is always queryable even when it is slightly outdated.

This is admittedly a large change. This design approach may have already been considered in the past. Just mentioning in case this is appealing.

@nojb
Copy link
Collaborator

nojb commented Jan 11, 2023

The current plan to support project-wide occurrences has three distinct parts:

Sorry for the naïve question, but could you explain what you mean by project-wide occurrences concretely?

As a data point, we have a typed .cmt-based "grep" tool at LexiFi which one can use to search for structural OCaml terms (eg if (__1 : float) <> __1 then __ else List.length __ matches anything of the form if e1 <> e1 then e2 else List.length e3 where e1 is of type float), and we don't index them, we just scan them for every query. We use this quite a lot (on both Linux and Windows), and is quite usable in terms of performance.

Have you done any benchmark to compare performance with and without the indexing step? (As it seems to add some complexitiy and performance cost.)

Could it make sense to do a first implementation of this feature without the indexing step, and later, if performance is an issue, add the indexing step?

@voodoos
Copy link
Collaborator Author

voodoos commented Jan 11, 2023

@ELLIOTTCABLE

I suspect this is the type of feature that doesn't necessarily need blazing-fast perf upon invocation to really be fantastically useful

An hybrid model would probably be the best in very large codebases: an asynchronously generated index for the complete codebase and a live re-indexing of the part of the project being worked on. Being aware of the visibility of the definition would also be way to cut the search tree as @rgrinberg suggested.

Correspondingly, as someone who'd use a project-wide reference index at least weekly, I wouldn't personally want the regular build/watch feature slowed down by it, unless it was nearly imperceptible.

I agree that it probably should not be a part of the all target, even when the feature is activated manually via an option in the dune-project file.

@sidkshatriya

Why not use a single SQLITE db for all the indexing operations and store all your project-wide-occurences data in it?

The main pro of the "individual files" method is that it plays well with Dune's (and lot of other build systems) file-based model: this allows the build system to maximize parallelism and incrementality at the cost of bigger artifacts. It also fits closely OCaml separate compilation model so indexing can start when individual cmt files are ready. For cases when the simplicity+time vs size compromise is not viable I plan to add support for different indexing strategies in the indexing tool itself (like calling the tool once for every cmt, or using a database as a store).

@nojb

Sorry for the naïve question, but could you explain what you mean by project-wide occurrences concretely?

By that I mean the ability to select a value in the buffer and display all the usages of that same value across the entire project. So far this only works in the active buffer.

As a data point, we have a typed .cmt-based "grep" tool at LexiFi which one can use to search for structural OCaml terms

That sounds like a great tool! Do you call it on every cmt file in the code-base and it iterates on the whole typed-tree?
I guess it is done in parallel ? I can see two possible differences with value indexing:

  • We need to iterate on the typedtree and identify the exact definition of every value we found. This is done by reducing "shapes" and is quite expensive even if we memoïze a lot of the computation. So I would guess that it is more expensive than the computation you do.
  • Iterating on the cmt to look for structural terms has no dependency on other cmts, right? In our case when we reduce shapes we might need to load a lot of other cmt files to find the definition of a value, which is also quite expensive and makes parallelization more difficult to achieve (in fact in the current version I reduce shapes partially in individual cmt files and only perform full reduction when aggregating indexes to take advantage of caching).

Have you done any benchmark to compare performance with and without the indexing step?

Currently the ocaml-lsp project (80K lines) it takes 2.5s to build the index on my (M1 pro) machine when the project has already been built without the index. The difference when comparing a completely from-scratch build with and without the index is smaller: around 1.5s of a total of 10s.

An incremental rebuild of the index after a one-file change takes around 0.2s more than a rebuild without the indexing.

Could it make sense to do a first implementation of this feature without the indexing step

I find the current implementation quite satisfying (right now I am cleaning things up and about to un-draft the PR) and it works quite well in term of timings. It would require quite a lot of changes to have Merlin directly build the index and would probably be much slower without Dune's incredible parallelism and incrementality.

@rgrinberg
Copy link
Member

The main pro of the "individual files" method is that it plays well with Dune's (and lot of other build systems) file-based model: this allows the build system to maximize parallelism and incrementality at the cost of bigger artifacts. It also fits closely OCaml separate compilation model so indexing can start when individual cmt files are ready. For cases when the simplicity+time vs size compromise is not viable I plan to add support for different indexing strategies in the indexing tool itself (like calling the tool once for every cmt, or using a database as a store).

Yes, that's indeed the case why we usually lean towards files. But I find that to be a crutch and a sorely missing feature in dune. With a proper database, we would be losing incrementality and parallelism only at the stage when we serialize the write into the database. The actual indexing operations would be retaining all those good properties. Nevertheless, using a database isn't realistic until we have a database that supports something like dune's "stale artifact deletion" for records.

I find the current implementation quite satisfying (right now I am cleaning things up and about to un-draft the PR) and it works quite well in term of timings. It would require quite a lot of changes to have Merlin directly build the index and would probably be much slower without Dune's incredible parallelism and incrementality.

It seems concerning that merlin is so tied to a particular implementation of the database index. Would it be possible to make sure that the db is a separate component with a well defined API?

@voodoos
Copy link
Collaborator Author

voodoos commented Jan 12, 2023

The "satisfying" part was the Dune rules :-)

So far the format (basically a marshalled map) of the index is defined with some magic numbers and a shared module between Merlin and the indexing binary.

One way to decouple this would be to define a small rpc (that doesn't need to run in the background) for Merlin to query the indexing binary for occurrences. It would be easier to extend the protocol if we want to index more thing. Is that the kind of API you had in mind ?

@rgrinberg
Copy link
Member

It doesn't need to be RPC, a shared module should be enough. It just needs to encapsulate the details of marshalling and IO so that they only live in one place.

@ELLIOTTCABLE
Copy link
Contributor

I find the current implementation quite satisfying (right now I am cleaning things up and about to un-draft the PR) and it works quite well in term of timings.

The remaining work to get this work merged, if I understand correctly, seems to be:

  1. Abstract the index-access,
  2. rebase/merge up to the latest of the various involved projects,
  3. then get these PRs opened for review/merge?

Is there any way I can contribute? I truly believe this is one of the big things holding back the 'newbie experience' in the OCaml ecosystem, would love to help push it forward! 😅

@voodoos
Copy link
Collaborator Author

voodoos commented Apr 15, 2024

Closing in favor of #10422 that update these rules to the new process which involve the compiler pre-indexing cmt files.

@voodoos voodoos closed this Apr 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merlin requires-team-discussion This topic requires a team discussion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants