-
-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "workspace invalidation" sources support for shell / adhoc backends #21051
Conversation
This PR supersedes #20996. |
Interesting approach 👍 I've not read the code in detail yet, but just asking some questions to help set my context when I find a moment:
|
IIUC this is entirely a performance thing? It would do no harm, from a correctness perspective, to materialize these files into the sandbox, it would just be wasted work? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So IIUC there is an inherent race condition here:
- We hash these sources to generate a cache key
- The user edits the sources.
- We run some adhoc tool on these sources.
We're now caching the result against the wrong key, no?
@@ -253,6 +253,22 @@ class AdhocToolOutputRootDirField(StringField): | |||
) | |||
|
|||
|
|||
class AdhocToolHashOnlySourcesGlobsField(StringSequenceField): | |||
alias: ClassVar[str] = "hash_only_sources_globs" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be named just hash_only_sources
? The regular sources
field can take globs but we don't put "_globs" in its name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may also want to bikeshed the name further, but not yet...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be named just
hash_only_sources
? The regularsources
field can take globs but we don't put "_globs" in its name.
Fine by me. I am not wedded to any particular name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about unmaterialized_sources
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or indirect_sources
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like unmaterialized_sources
, although unmaterialized
might be opaque jargon to many users (i.e. require them to read the docs to make any reasonable guess at what it means).
Just brainstorming some other ideas:
extra_invalidation_sources
non_sandbox_sources
workspace_invalidation_sources
workspace_only_sources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your suggestions of workspace_invalidation_sources
and workspace_only_sources
because they incorporate "workspace" since this feature is intended to be paired with use of workspace_environment
. Indeed, it points to restricting this feature to only be enabled when the workspace environment is set on the adhoc_tool
/ shell_command
target. If the user is using a regular environment, then they should be relying on ordinary dependencies to express inputs.
I will modify the PR to go with workspace_invalidation_sources
for now and error if the field is set for non-workspace environments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I am just going to document it should only be used with workspace_environment
. No need to error or ignore just yet.
Not related to this change, but the above scenario is already true during pantsd bootstrapping. I've observed getting cached results where the result != what should've been given with the current source contents, caused by a "late" save during the initial moments of a pants run. Killing pantsd has been the way to get out of that situation (as editing the file back and forth results in getting the "wrong" result back). |
This seems like a problem due to inotify triggering an invalidation at a certain point in time and then Pants actually reading the disk contents at a later time. Given this PR uses the same This PR should not be any worse than the existing state in that regard. |
It would be wasted work. Such dependencies would be materialized into the temporary directory made even for workspace environment executions. It is also a DX issue: The client of mine who is paying for this work expressed the preference to not have to duplicate any targets in Pants which are already defined in Bazel (which is their motivating use case). Their glob would likely just be |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, other than some minor doc details and the naming.
Also: maybe we could expand the environments.mdx
section about workspace_environment
, so that it mentions/links to this as a breadcrumb for people to follow. Just noting that that section currently has the :::caution
about caching that this would fit into well!
@@ -253,6 +253,22 @@ class AdhocToolOutputRootDirField(StringField): | |||
) | |||
|
|||
|
|||
class AdhocToolHashOnlySourcesGlobsField(StringSequenceField): | |||
alias: ClassVar[str] = "hash_only_sources_globs" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like unmaterialized_sources
, although unmaterialized
might be opaque jargon to many users (i.e. require them to read the docs to make any reasonable guess at what it means).
Just brainstorming some other ideas:
extra_invalidation_sources
non_sandbox_sources
workspace_invalidation_sources
workspace_only_sources
help = help_text( | ||
""" | ||
Path globs for source files on which this target depends indirectly, but which should not be | ||
materlized into the execution sandbox. Pants will compute the hash of all of the files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
materlized into the execution sandbox. Pants will compute the hash of all of the files | |
materialized into the execution sandbox. Pants will compute the hash of all of the files |
Path globs for source files on which this target depends indirectly, but which should not be | ||
materlized into the execution sandbox. Pants will compute the hash of all of the files | ||
references by the globs and include that hash as part of the cache key for the | ||
process to be executed (as an environment variable). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be good to be explicit about the consequence/purpose of this, e.g. something like:
process to be executed (as an environment variable). | |
process to be executed (as an environment variable), so that the command re-runs if these files change. |
b1d7740
to
74233d5
Compare
Will do. |
185ab2a
to
06d432f
Compare
06d432f
to
0f638c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for iterating!
It is also a DX issue: The client of mine who is paying for this work expressed the preference to not have to duplicate any targets in Pants which are already defined in Bazel (which is their motivating use case). Their glob would likely just be ["subdir/**/*"].
Just thinking on this a bit more: on the surface, it seems like it wouldn't be a major imposition to define a single catch-all files(name="bazel-files", sources=["subdir/**/*"])
target. But, even if that was acceptable, just doing this as a performance optimisation seems perfectly appropriate!
…ds (pantsbuild#21051) Add support for "workspace invalidation" sources for the `adhoc_tool` and `shell_command` target types. This supports allows those targets to depend on the content of files in the repository without materializing those sources in the execution sandbox. This support is intended to be used in conjunction with the workspace environment where execution does not take place in a sandbox. The new field `workspace_invalidation_sources` on both target types is a list of globs into the repository. The digest of the referenced files will be inserted as an environment variable in the process executed (which makes it part of the process's cache key).
Manually cherry picked to |
I don't think this is true. A Digest is fingerprinted after it's captured. We are guaranteed that the SHA in the Digest correctly hashes the contents we're materializing in the sandbox, working on, and caching against. It may not be what's in the workspace at a later time, but it guaranteed that the inputs fingerprinted by the Digest are what we operated on to produce the cached result. But AFAICT that is not true here, when we're not working in a sandbox. This is why invalidating via an mtime would be fine, but invalidating via a content hash is not. I think this is still a substantial concern. |
Why did we move away from using mtimes for this? IIUC is worse than the issue @kaos raises, since we'd not just be badly memoizing, but polluting a persistent cache. |
The existing code uses the same pants/src/python/pants/engine/internals/graph.py Line 1204 in e5aff16
HydratedSources is captured using PathGlobs -> Digest -> Snapshot ). If capturing the "workspace invalidation" globs from the repository has a race condition, wouldn't regular source file capture from the repository also have a problem?
As for why I switched, I sensed lots of push back on #20996 which adds an intrinsic for obtaining mtime (and other metadata) on repository paths. Given I want to get this project done in a timely manner, I chose to switch to what seems to me to be a simpler and hopefully more acceptable solution since it would not need a new intrinsic rule. |
Regarding the concern that a workspace file could be overwritten after execution of a workspace process has already started (but cached under the digest of prior copy of the file), I agree mtime in the cache key would be less bad than that sort of cache poisoning. In which case I would really like to discuss in real time with someone what the push back on #20996 is? The DX for mtime versus content would be the same |
To clarify - you are correct that the race condition still exists in terms of which "point in time" is captured (in fact we may be sweeping through multiple files just as they are being changed, so the captured state may never have existed at a specific point in time!) BUT once captured, we only operate on that state in a sandbox, with no further edits. So we know with certainty that the cached output of a process is exactly the reproducible result of running that process on the inputs with the given Digest. Real world scenario: The user is editing during snapshotting, specifically: changing state from A to B and then back to A. In the sandbox world, if we happened to snapshot at state B we will run the process at state B. Which may not be what the user intended. But if the user then re-runs at state A, we will re-run the process, because we see different inputs. But in the workspace world, we might snapshot at state B, then run at state A. Therefore we will cache the result at A against the Digest at B. The user will see a wrong result, re-run, but we will retrieve the incorrect, cached result. The only cure will be to nuke the cache. Unless I'm missing something? This sort of thing is presumably why Make and Cargo and friends use mtimes and have no caching. Absent sandboxed execution of captured inputs, you can't know that the workspace state you're operating against is the one you digested. cc @huonw and @kaos to check my logic and discuss the pushback on #20996 . |
This sounds plausible. Just spelling out a specific sequence:
(I think in particular step 6 and 7 need to happen in that order: if file watching is notified first, the process will be killed and not cached.) AIUI, the thinking is that mtimes solve this because (assuming user doesn't play games with
For me, for the bigger picture of #20996, I was more feeling like I didn't fully understand the background, rather than necessarily pushing back (I didn't know enough to push forward or back). Sorry if it was coming across as more negative than indented! After all this discussion I'm definitely understanding it better! (I was also diving into the code improvements, which I'm sure could come across as negative!) |
Yes, mtimes solve this because there is no caching at all, effectively, just invalidation. Technically stuff is written to the cache, but it can never be retrieved once a file changes. The downside is that remote caching for CI becomes useless - the newly cloned repo in CI will have fresh mtimes. |
#21092 switches |
…kspace_invalidation_sources` (#21092) Switch to using metadata-based invalidation instead of content-based invalidation for any sources referenced by the `workspace_invalidation_sources` field. This is necessary because `workspace_invalidation_sources` is intended to be used with the `experimental_workspace_environment` and there is a problem with content-based invalidation in that scenario: There is a potential cache poisoning scenario where Pants computes a content digest but then the user overwrites the digested sources before Pants has executed the applicable `adhoc_tool` / `shell_command` process. The cache will now have a result stored under the digest of the original file version even though the file content changed. See #21051 (comment) for expanded discussion.
Add support for "workspace invalidation" sources for the
adhoc_tool
andshell_command
target types. This supports allows those targets to depend on the content of files in the repository without materializing those sources in the execution sandbox. This support is intended to be used in conjunction with the workspace environment where execution does not take place in a sandbox.The new field
workspace_invalidation_sources
on both target types is a list of globs into the repository. The digest of the referenced files will be inserted as an environment variable in the process executed (which makes it part of the process's cache key).