Collect reverse dependencies #802

pepeiborra · 2021-03-16T10:46:40Z

This PR makes the following changes:

extends Shake to collect the reverse dependencies of every 'Result' by performing some bookkeeping on every evaluation. The amount of work done is O(E) on the number of edges in the graph and I have measured it to be around 3% for a very dense graph with 150.000 nodes and 1e8 edges, so it's not bad at all, and could be put behind a flag if desired. Note that in this PR the set of reverse dependencies is overestimated since ephemeral reverse edges are never deleted.
adds a new function shakeRunDatabaseFromKeys which extends shakeRunDatabase by accepting, in addition to a set of actions (the goals), a set D of keys that have changed since the last run. The transitive closure of D is the set of all keys that have potentially changed since the last run - all keys that do not belong to this set are guaranteed unchanged and can be excluded from the buildRunDependenciesChanged query.

When the set D is small compared to the size of the build graph, this optimisation renders performance gains of one order of magnitude, up to ~~70X~~ 20X in some cases. I have extended[1] HLS to keep track of what has changed since the last full Shake run and performed several experiments:

Applied the ghcide benchmark suite to the standard Cabal-3.0.0.0 example and obtained the results in the comments below. The edit experiment runs in 55% of the time . Full results
Applied the ghcide benchmark suite to the standard lsp-1.0 example and obtained the results in the comments below. The edit experiment runs in 73% of the time. Full results
Applied the ghcide benchmark suite to a very large example involving a module with >17.000 transitive dependencies and measured the performance of the edit benchmark, which runs in 4.75% of the time i.e. >21X faster.

[1] - https://github.com/pepeiborra/ide/tree/keysChanged

Closes #800

EDIT: updated performance numbers

pepeiborra · 2021-03-21T12:03:01Z

There is a tricky issue with async exceptions that I am struggling to solve.

To me, shakeRunDatabaseForKeys is an unsound approximation since it bypasses the version number checks. It is possible that a key K is not up to date and yet, because it doesn't belong to the dirty set, gets reused without updating.

In what cases can this happen? The most common one is when a previous call to shakeRunDatabaseForKeys marked K dirty, but the evaluation didn't touch K so it didn't get recomputed. If the current call to shakeRunDatabaseForKeys isn't aware of this, and if the new dirty set doesn't include K but the evaluation uses it, the stale value will be reused.

To avoid this, I have added state to the database to track the dirty set across runs. A key is removed from the persisted dirty set as it is refreshed. The tricky bit with async exceptions is that a call to shakeRunDatabaseForKeys can get interrupted before it has a chance to add the keys to the dirty set.

I'm not sure how to solve this. The obvious solution is to mask, but it seems ugly and out of place in the Shake codebase. We could put the mask in the caller, by changing shakeRunDatabaseForKeys to return an IO (IO ()), but this probably requires some major changes in Shake. An alternative solution is to track the dirty set in the caller (ghcide) - the problem is that the caller doesn't know a) the transitive closure, and b) when to undirty elements. While we could expose shakeDatabase functions to compute the transitive closure and to inspect the keys that did evaluate in the last run, the end result will probably not be a nice, user friendly API.

pepeiborra · 2021-03-21T23:43:58Z

There is a tricky issue with async exceptions that I am struggling to solve.

To me, shakeRunDatabaseForKeys is an unsound approximation since it bypasses the version number checks. It is possible that a key K is not up to date and yet, because it doesn't belong to the dirty set, gets reused without updating.

In what cases can this happen? The most common one is when a previous call to shakeRunDatabaseForKeys marked K dirty, but the evaluation didn't touch K so it didn't get recomputed. If the current call to shakeRunDatabaseForKeys isn't aware of this, and if the new dirty set doesn't include K but the evaluation uses it, the stale value will be reused.

To avoid this, I have added state to the database to track the dirty set across runs. A key is removed from the persisted dirty set as it is refreshed. The tricky bit with async exceptions is that a call to shakeRunDatabaseForKeys can get interrupted before it has a chance to add the keys to the dirty set.

I'm not sure how to solve this. The obvious solution is to mask, but it seems ugly and out of place in the Shake codebase. We could put the mask in the caller, by changing shakeRunDatabaseForKeys to return an IO (IO ()), but this probably requires some major changes in Shake. An alternative solution is to track the dirty set in the caller (ghcide) - the problem is that the caller doesn't know a) the transitive closure, and b) when to undirty elements. While we could expose shakeDatabase functions to compute the transitive closure and to inspect the keys that did evaluate in the last run, the end result will probably not be a nice, user friendly API.

I opted for masking in Shake. The masking minimal and happens in shakeRunDatabaseForKeys.

rayshih · 2021-03-22T16:24:27Z

src/Development/Shake/Database.hs

+--   of actions to run after the database was closed, as added with
+--   'Development.Shake.runAfter' and 'Development.Shake.removeFilesAfter'.
+shakeRunDatabaseForKeys
+    :: Maybe [SomeShakeValue]       -- ^ Set of keys changed since last run


Can't find the caller where Just value is used. Am I missing something?

This is an external API. See for example how it is used in HLS:

https://github.com/pepeiborra/ide/blob/keysChanged/ghcide/src/Development/IDE/Core/Shake.hs#L758

pepeiborra · 2021-05-23T11:48:00Z

There is a tricky issue with async exceptions that I am struggling to solve.
To me, shakeRunDatabaseForKeys is an unsound approximation since it bypasses the version number checks. It is possible that a key K is not up to date and yet, because it doesn't belong to the dirty set, gets reused without updating.
In what cases can this happen? The most common one is when a previous call to shakeRunDatabaseForKeys marked K dirty, but the evaluation didn't touch K so it didn't get recomputed. If the current call to shakeRunDatabaseForKeys isn't aware of this, and if the new dirty set doesn't include K but the evaluation uses it, the stale value will be reused.
To avoid this, I have added state to the database to track the dirty set across runs. A key is removed from the persisted dirty set as it is refreshed. The tricky bit with async exceptions is that a call to shakeRunDatabaseForKeys can get interrupted before it has a chance to add the keys to the dirty set.
I'm not sure how to solve this. The obvious solution is to mask, but it seems ugly and out of place in the Shake codebase. We could put the mask in the caller, by changing shakeRunDatabaseForKeys to return an IO (IO ()), but this probably requires some major changes in Shake. An alternative solution is to track the dirty set in the caller (ghcide) - the problem is that the caller doesn't know a) the transitive closure, and b) when to undirty elements. While we could expose shakeDatabase functions to compute the transitive closure and to inspect the keys that did evaluate in the last run, the end result will probably not be a nice, user friendly API.

I opted for masking in Shake. The masking minimal and happens in shakeRunDatabaseForKeys.

I think this masking is not essential and can be removed, since the caller is always able to find out whether a rule was actually evaluated and use this information to keep an up-to-date state of the dirty keys on the caller side.

If the caller provides us with a set of keys that have changed, the reverse dependencies are used to compute the transitive dependencies changed avoiding a call to firstJustWaitUnordered

SomeShakeValue is a solution for heterogeneous lists of keys

shakeRunDatabaseForKeys is an unsound approximation since it bypasses the version number checks. It is possible that a key K is not up to date and yet, because it doesn't belong to the dirty set, gets reused without updating. In what cases can this happen? The most common one is when a previous call to shakeRunDatabaseForKeys marked K dirty, but the evaluation didn't touch K so it didn't get recomputed. The current call to shakeRunDatabaseForKeys doesn't know this, and if the new dirty set doesn't include K but the evaluation uses it, the stale value will be reused. To avoid this, we persist (in memory only) the dirty set and remember it across runs. A key is removed from the persisted dirty set as it is refreshed.

Add an option to enable the tracking of reverse dependencies. Zero overhead for clients who do not care about them

When I ported reverse dependencies from Shake[1] I missed an important detail. While Shake models alwaysRerun as a dependency on an actual rule (AlwaysRerun), hls-graph models alwaysRerun by setting actionDeps to Nothing. This is important because dependencies are not computed for these rules, and therefore reverse dependency tracking doesn't do anything, which breaks correctness of dirty rebuilds This commit adds dependency tracking for alwaysRerun rules, and fixes reverse dependency tracking. The alternative would be following the Shake approach but I'm not sure what other implications this might have. [1] - ndmitchell/shake#802

pepeiborra · 2021-12-15T23:56:05Z

Closing this since the work has moved on to HLS-graph

ndmitchell · 2022-01-02T16:06:21Z

Thanks for all the hard work doing this @pepeiborra - it's very nice work. I think the end conclusion is that HLS-graph and Shake have diverged significantly enough that doing one for both isn't all that helpful. I shared some more thoughts about the size of build systems and where Shake fits in https://neilmitchell.blogspot.com/2021/09/reflecting-on-shake-build-system.html

pepeiborra force-pushed the reverse-deps branch 3 times, most recently from d31fc6d to d48eddc Compare March 17, 2021 20:08

pepeiborra changed the title ~~RFC: Collect reverse dependencies~~ Collect reverse dependencies Mar 18, 2021

pepeiborra force-pushed the reverse-deps branch 3 times, most recently from 58be020 to b96ee3b Compare March 21, 2021 11:58

pepeiborra force-pushed the reverse-deps branch 5 times, most recently from 43b94b8 to e811813 Compare March 21, 2021 23:42

pepeiborra force-pushed the reverse-deps branch 4 times, most recently from 5971acc to c0b281e Compare March 22, 2021 07:52

rayshih reviewed Mar 22, 2021

View reviewed changes

pepeiborra mentioned this pull request Apr 18, 2021

Add hls-graph abstracting over shake haskell/haskell-language-server#1748

Merged

This was referenced May 15, 2021

Track changes and inform Shake haskell/haskell-language-server#1834

Closed

Minimal Shake rebuilds pepeiborra/ide#9

Draft

pepeiborra force-pushed the reverse-deps branch 2 times, most recently from 09e5396 to 857044d Compare May 23, 2021 10:08

pepeiborra mentioned this pull request May 23, 2021

Technology preview: Keep track of changes to minimize rebuilds haskell/haskell-language-server#1862

Merged

1 task

pepeiborra force-pushed the reverse-deps branch from 857044d to 862deae Compare June 21, 2021 20:57

pepeiborra added 3 commits July 4, 2021 21:58

Collect reverse dependencies

4bf8dbb

If the caller provides us with a set of keys that have changed, the reverse dependencies are used to compute the transitive dependencies changed avoiding a call to firstJustWaitUnordered

don't use return

7cd0112

backwards compat.

c150aea

pepeiborra added 18 commits July 4, 2021 21:58

avoid copying Result records

ba0fbe1

do not mark AlwaysRerun as dirty by default

32a4f6c

Existential wrapper for ShakeValue

d01c40c

SomeShakeValue is a solution for heterogeneous lists of keys

fix typo

0dfbd0a

microopt: store the RDeps in an IORef

834a764

use getResult

d357f9f

Increase diagnostic output

67d2501

mask until the dirty set is updated

0f29380

redundant import

52ed87e

comments

e0ee243

Store the reverse dependencies in an Ids array

4270b2e

imports

0313892

Compat. with GHC 8.0

e6c0199

shakeReverseDependencies

6669768

Add an option to enable the tracking of reverse dependencies. Zero overhead for clients who do not care about them

Special handling of alwaysRerun

0a7ce83

Visit only the dirty dependencies

ec620ac

fix imports

35cdf3a

pepeiborra force-pushed the reverse-deps branch from 862deae to 35cdf3a Compare July 4, 2021 20:58

pepeiborra mentioned this pull request Aug 17, 2021

Fix filepath identity in cradle dependencies when using reactive change tracking haskell/haskell-language-server#2106

Merged

pepeiborra mentioned this pull request Oct 24, 2021

Fix reverse dep. tracking for alwaysRerun rules haskell/haskell-language-server#2298

Merged

pepeiborra closed this Dec 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collect reverse dependencies #802

Collect reverse dependencies #802

pepeiborra commented Mar 16, 2021 •

edited

Loading

pepeiborra commented Mar 21, 2021 •

edited

Loading

pepeiborra commented Mar 21, 2021

rayshih Mar 22, 2021

pepeiborra Mar 22, 2021

pepeiborra commented May 23, 2021

pepeiborra commented Dec 15, 2021

ndmitchell commented Jan 2, 2022

Collect reverse dependencies #802

Collect reverse dependencies #802

Conversation

pepeiborra commented Mar 16, 2021 • edited Loading

pepeiborra commented Mar 21, 2021 • edited Loading

pepeiborra commented Mar 21, 2021

rayshih Mar 22, 2021

Choose a reason for hiding this comment

pepeiborra Mar 22, 2021

Choose a reason for hiding this comment

pepeiborra commented May 23, 2021

pepeiborra commented Dec 15, 2021

ndmitchell commented Jan 2, 2022

pepeiborra commented Mar 16, 2021 •

edited

Loading

pepeiborra commented Mar 21, 2021 •

edited

Loading