Add "workspace invalidation" sources support for shell / adhoc backends #21051

tdyas · 2024-06-11T18:14:54Z

Add support for "workspace invalidation" sources for the adhoc_tool and shell_command target types. This supports allows those targets to depend on the content of files in the repository without materializing those sources in the execution sandbox. This support is intended to be used in conjunction with the workspace environment where execution does not take place in a sandbox.

The new field workspace_invalidation_sources on both target types is a list of globs into the repository. The digest of the referenced files will be inserted as an environment variable in the process executed (which makes it part of the process's cache key).

tdyas · 2024-06-11T18:15:46Z

This PR supersedes #20996.

huonw · 2024-06-12T01:36:36Z

Interesting approach 👍

I've not read the code in detail yet, but just asking some questions to help set my context when I find a moment:

What happens if a shell_command using a workspace environment specifies some execution_dependencies? How/if are those incorporated into the cache key/invalidation? (This isn't directly related to this PR... but helps set the context for the next question)
With the glob approach, it seems like this is particularly targeted at files that aren't tracked by Pants (i.e. don't have a target, even files(...)). Is that a correct understanding?

benjyw · 2024-06-12T09:20:12Z

IIUC this is entirely a performance thing? It would do no harm, from a correctness perspective, to materialize these files into the sandbox, it would just be wasted work?

benjyw

So IIUC there is an inherent race condition here:

We hash these sources to generate a cache key
The user edits the sources.
We run some adhoc tool on these sources.

We're now caching the result against the wrong key, no?

benjyw · 2024-06-12T09:21:35Z

src/python/pants/backend/adhoc/target_types.py

@@ -253,6 +253,22 @@ class AdhocToolOutputRootDirField(StringField):
    )


+class AdhocToolHashOnlySourcesGlobsField(StringSequenceField):
+    alias: ClassVar[str] = "hash_only_sources_globs"


Maybe this should be named just hash_only_sources? The regular sources field can take globs but we don't put "_globs" in its name.

I may also want to bikeshed the name further, but not yet...

Maybe this should be named just hash_only_sources? The regular sources field can take globs but we don't put "_globs" in its name.

Fine by me. I am not wedded to any particular name.

What about unmaterialized_sources?

Or indirect_sources?

I like unmaterialized_sources, although unmaterialized might be opaque jargon to many users (i.e. require them to read the docs to make any reasonable guess at what it means).

Just brainstorming some other ideas:

extra_invalidation_sources

non_sandbox_sources

workspace_invalidation_sources

workspace_only_sources

I like your suggestions of workspace_invalidation_sources and workspace_only_sources because they incorporate "workspace" since this feature is intended to be paired with use of workspace_environment. Indeed, it points to restricting this feature to only be enabled when the workspace environment is set on the adhoc_tool / shell_command target. If the user is using a regular environment, then they should be relying on ordinary dependencies to express inputs.

I will modify the PR to go with workspace_invalidation_sources for now ~~and error if the field is set for non-workspace environments~~.

For now, I am just going to document it should only be used with workspace_environment. No need to error or ignore just yet.

kaos · 2024-06-12T11:20:00Z

So IIUC there is an inherent race condition here:

We hash these sources to generate a cache key

The user edits the sources.

We run some adhoc tool on these sources.

We're now caching the result against the wrong key, no?

Not related to this change, but the above scenario is already true during pantsd bootstrapping. I've observed getting cached results where the result != what should've been given with the current source contents, caused by a "late" save during the initial moments of a pants run. Killing pantsd has been the way to get out of that situation (as editing the file back and forth results in getting the "wrong" result back).

tdyas · 2024-06-12T16:48:42Z

Not related to this change, but the above scenario is already true during pantsd bootstrapping. I've observed getting cached results where the result != what should've been given with the current source contents, caused by a "late" save during the initial moments of a pants run. Killing pantsd has been the way to get out of that situation (as editing the file back and forth results in getting the "wrong" result back).

This seems like a problem due to inotify triggering an invalidation at a certain point in time and then Pants actually reading the disk contents at a later time. Given this PR uses the same Digest capturing logic as used for sources on other targets, I would expect that problem to exist for regular sources too depending on the timing of the inotify watch and subsequent writes to the file.

This PR should not be any worse than the existing state in that regard.

tdyas · 2024-06-12T16:51:40Z

IIUC this is entirely a performance thing? It would do no harm, from a correctness perspective, to materialize these files into the sandbox, it would just be wasted work?

It would be wasted work. Such dependencies would be materialized into the temporary directory made even for workspace environment executions.

It is also a DX issue: The client of mine who is paying for this work expressed the preference to not have to duplicate any targets in Pants which are already defined in Bazel (which is their motivating use case). Their glob would likely just be ["subdir/**/*"].

tdyas · 2024-06-12T16:56:16Z

Interesting approach 👍

I've not read the code in detail yet, but just asking some questions to help set my context when I find a moment:

1. What happens if a `shell_command` using a workspace environment specifies some `execution_dependencies`? How/if are those incorporated into the cache key/invalidation? (This isn't directly related to this PR... but helps set the context for the next question)

2. With the glob approach, it seems like this is particularly targeted at files that _aren't_ tracked by Pants (i.e. don't have a target, even `files(...)`). Is that a correct understanding?

Regular dependencies will be materialized in the execution sandbox for the shell command and are part of the "input digest" to the Process representing the execution. Any change to the input digest changes the cache key. For workspace environment executions, the input digest is materialized still in a temoporary directory available during the workspace execution so regular dependencies will still be part of the input digest.
Correct. The motivating use case is executing Bazel in the workspace environment. The user should not need to repeat in Pants anything related to Bazel targets and configuration. The goal is to tell Pants (if any change happens to these paths) then re-invoke Bazel to actually figure out what to rebuild.

huonw

Looks good, other than some minor doc details and the naming.

Also: maybe we could expand the environments.mdx section about workspace_environment, so that it mentions/links to this as a breadcrumb for people to follow. Just noting that that section currently has the :::caution about caching that this would fit into well!

huonw · 2024-06-15T02:52:12Z

src/python/pants/backend/adhoc/target_types.py

@@ -253,6 +253,22 @@ class AdhocToolOutputRootDirField(StringField):
    )


+class AdhocToolHashOnlySourcesGlobsField(StringSequenceField):
+    alias: ClassVar[str] = "hash_only_sources_globs"


I like unmaterialized_sources, although unmaterialized might be opaque jargon to many users (i.e. require them to read the docs to make any reasonable guess at what it means).

Just brainstorming some other ideas:

extra_invalidation_sources

non_sandbox_sources

workspace_invalidation_sources

workspace_only_sources

huonw · 2024-06-15T02:53:50Z

src/python/pants/backend/adhoc/target_types.py

+    help = help_text(
+        """
+        Path globs for source files on which this target depends indirectly, but which should not be
+        materlized into the execution sandbox. Pants will compute the hash of all of the files


Suggested change

materlized into the execution sandbox. Pants will compute the hash of all of the files

materialized into the execution sandbox. Pants will compute the hash of all of the files

huonw · 2024-06-15T02:54:22Z

src/python/pants/backend/adhoc/target_types.py

+        Path globs for source files on which this target depends indirectly, but which should not be
+        materlized into the execution sandbox. Pants will compute the hash of all of the files
+        references by the globs and include that hash as part of the cache key for the
+        process to be executed (as an environment variable).


I think it'd be good to be explicit about the consequence/purpose of this, e.g. something like:

Suggested change

process to be executed (as an environment variable).

process to be executed (as an environment variable), so that the command re-runs if these files change.

tdyas · 2024-06-15T04:09:23Z

Also: maybe we could expand the environments.mdx section about workspace_environment, so that it mentions/links to this as a breadcrumb for people to follow. Just noting that that section currently has the :::caution about caching that this would fit into well!

Will do.

huonw

Looks good, thanks for iterating!

It is also a DX issue: The client of mine who is paying for this work expressed the preference to not have to duplicate any targets in Pants which are already defined in Bazel (which is their motivating use case). Their glob would likely just be ["subdir/**/*"].

Just thinking on this a bit more: on the surface, it seems like it wouldn't be a major imposition to define a single catch-all files(name="bazel-files", sources=["subdir/**/*"]) target. But, even if that was acceptable, just doing this as a performance optimisation seems perfectly appropriate!

docs/notes/2.23.x.md

…ds (pantsbuild#21051) Add support for "workspace invalidation" sources for the `adhoc_tool` and `shell_command` target types. This supports allows those targets to depend on the content of files in the repository without materializing those sources in the execution sandbox. This support is intended to be used in conjunction with the workspace environment where execution does not take place in a sandbox. The new field `workspace_invalidation_sources` on both target types is a list of globs into the repository. The digest of the referenced files will be inserted as an environment variable in the process executed (which makes it part of the process's cache key).

tdyas · 2024-06-17T03:51:44Z

Manually cherry picked to 2.22.x branch by #21075. This support is necessary for proper use of the workspace environment with shell_command and adhoc_tool.

benjyw · 2024-06-17T13:59:41Z

Not related to this change, but the above scenario is already true during pantsd bootstrapping. I've observed getting cached results where the result != what should've been given with the current source contents, caused by a "late" save during the initial moments of a pants run. Killing pantsd has been the way to get out of that situation (as editing the file back and forth results in getting the "wrong" result back).

This seems like a problem due to inotify triggering an invalidation at a certain point in time and then Pants actually reading the disk contents at a later time. Given this PR uses the same Digest capturing logic as used for sources on other targets, I would expect that problem to exist for regular sources too depending on the timing of the inotify watch and subsequent writes to the file.

This PR should not be any worse than the existing state in that regard.

I don't think this is true. A Digest is fingerprinted after it's captured. We are guaranteed that the SHA in the Digest correctly hashes the contents we're materializing in the sandbox, working on, and caching against. It may not be what's in the workspace at a later time, but it guaranteed that the inputs fingerprinted by the Digest are what we operated on to produce the cached result.

But AFAICT that is not true here, when we're not working in a sandbox.

This is why invalidating via an mtime would be fine, but invalidating via a content hash is not. I think this is still a substantial concern.

benjyw · 2024-06-17T14:03:11Z

But AFAICT that is not true here, when we're not working in a sandbox.

This is why invalidating via an mtime would be fine, but invalidating via a content hash is not. I think this is still a substantial concern.

Why did we move away from using mtimes for this? IIUC is worse than the issue @kaos raises, since we'd not just be badly memoizing, but polluting a persistent cache.

tdyas · 2024-06-18T02:52:04Z

I don't think this is true. A Digest is fingerprinted after it's captured. We are guaranteed that the SHA in the Digest correctly hashes the contents we're materializing in the sandbox, working on, and caching against. It may not be what's in the workspace at a later time, but it guaranteed that the inputs fingerprinted by the Digest are what we operated on to produce the cached result.

But AFAICT that is not true here, when we're not working in a sandbox.

This is why invalidating via an mtime would be fine, but invalidating via a content hash is not. I think this is still a substantial concern.

The existing code uses the same PathGlobs -> Digest intrinsic used to capture sources. See

pants/src/python/pants/engine/internals/graph.py

Line 1204 in e5aff16

snapshot = await Get(Snapshot, PathGlobs, path_globs)

(where HydratedSources is captured using PathGlobs -> Digest -> Snapshot). If capturing the "workspace invalidation" globs from the repository has a race condition, wouldn't regular source file capture from the repository also have a problem?

As for why I switched, I sensed lots of push back on #20996 which adds an intrinsic for obtaining mtime (and other metadata) on repository paths. Given I want to get this project done in a timely manner, I chose to switch to what seems to me to be a simpler and hopefully more acceptable solution since it would not need a new intrinsic rule.

tdyas · 2024-06-18T03:02:35Z

Regarding the concern that a workspace file could be overwritten after execution of a workspace process has already started (but cached under the digest of prior copy of the file), I agree mtime in the cache key would be less bad than that sort of cache poisoning. In which case I would really like to discuss in real time with someone what the push back on #20996 is?

The DX for mtime versus content would be the same workspace_invalidation_sources field already proposed. It is very easy to switch over once something like #20996 is available.

benjyw · 2024-06-18T04:49:49Z

To clarify - you are correct that the race condition still exists in terms of which "point in time" is captured (in fact we may be sweeping through multiple files just as they are being changed, so the captured state may never have existed at a specific point in time!)

BUT once captured, we only operate on that state in a sandbox, with no further edits. So we know with certainty that the cached output of a process is exactly the reproducible result of running that process on the inputs with the given Digest.

Real world scenario: The user is editing during snapshotting, specifically: changing state from A to B and then back to A. In the sandbox world, if we happened to snapshot at state B we will run the process at state B. Which may not be what the user intended. But if the user then re-runs at state A, we will re-run the process, because we see different inputs.

But in the workspace world, we might snapshot at state B, then run at state A. Therefore we will cache the result at A against the Digest at B. The user will see a wrong result, re-run, but we will retrieve the incorrect, cached result. The only cure will be to nuke the cache.

Unless I'm missing something?

This sort of thing is presumably why Make and Cargo and friends use mtimes and have no caching. Absent sandboxed execution of captured inputs, you can't know that the workspace state you're operating against is the one you digested.

cc @huonw and @kaos to check my logic and discuss the pushback on #20996 .

huonw · 2024-06-18T05:14:58Z

But in the workspace world, we might snapshot at state B, then run at state A. Therefore we will cache the result at A against the Digest at B. The user will see a wrong result, re-run, but we will retrieve the incorrect, cached result. The only cure will be to nuke the cache.

This sounds plausible. Just spelling out a specific sequence:

Files in state A
Execute workspace pants command that runs some fast-running procsss
Pants reads file system/hashes the sources in state A
Edit files to state B
Process starts, running on the files in state B
Process ends and is cached (using key for state A)
(OS tells pants file watcher about edit of files to state B)
Edit files back to state A
Re-run process, which is served from incorrect cache of execution on state B

(I think in particular step 6 and 7 need to happen in that order: if file watching is notified first, the process will be killed and not cached.)

AIUI, the thinking is that mtimes solve this because (assuming user doesn't play games with touch), they'll only ever increase, so the A -> B -> A problem cannot happen? Even if the file contents are identical, the timestamps will not.

discuss the pushback on #20996

For me, for the bigger picture of #20996, I was more feeling like I didn't fully understand the background, rather than necessarily pushing back (I didn't know enough to push forward or back). Sorry if it was coming across as more negative than indented!

After all this discussion I'm definitely understanding it better!

(I was also diving into the code improvements, which I'm sure could come across as negative!)

benjyw · 2024-06-18T08:19:31Z

Yes, mtimes solve this because there is no caching at all, effectively, just invalidation. Technically stuff is written to the cache, but it can never be retrieved once a file changes.

The downside is that remote caching for CI becomes useless - the newly cloned repo in CI will have fresh mtimes.

tdyas · 2024-06-20T16:06:10Z

#21092 switches adhoc_tool / shell_command to using metadata-based invalidation for workspace_invaliation_sources.

…kspace_invalidation_sources` (#21092) Switch to using metadata-based invalidation instead of content-based invalidation for any sources referenced by the `workspace_invalidation_sources` field. This is necessary because `workspace_invalidation_sources` is intended to be used with the `experimental_workspace_environment` and there is a problem with content-based invalidation in that scenario: There is a potential cache poisoning scenario where Pants computes a content digest but then the user overwrites the digested sources before Pants has executed the applicable `adhoc_tool` / `shell_command` process. The cache will now have a result stored under the digest of the original file version even though the file content changed. See #21051 (comment) for expanded discussion.

tdyas requested review from benjyw and huonw June 11, 2024 18:14

tdyas added the category:new feature label Jun 11, 2024

tdyas mentioned this pull request Jun 11, 2024

add intrinsic rule to request metadata about paths in filesystem #20996

Merged

benjyw reviewed Jun 12, 2024

View reviewed changes

huonw reviewed Jun 15, 2024

View reviewed changes

"hash-only" sources support for shell / adhoc backends

7fefaac

tdyas changed the title ~~Add "hash-only" sources support for shell / adhoc backends~~ Add "workspace invalidation" sources support for shell / adhoc backends Jun 15, 2024

rename to workspace_invalidation_sources

74233d5

tdyas force-pushed the shell_hash_only_sources branch from b1d7740 to 74233d5 Compare June 15, 2024 04:08

tdyas force-pushed the shell_hash_only_sources branch from 185ab2a to 06d432f Compare June 15, 2024 04:13

document in environments.mdx

0f638c5

tdyas force-pushed the shell_hash_only_sources branch from 06d432f to 0f638c5 Compare June 15, 2024 04:15

sleep to avoid race with inotify watcher

81a1665

huonw approved these changes Jun 16, 2024

View reviewed changes

docs/notes/2.23.x.md Outdated Show resolved Hide resolved

fix typo

a258266

tdyas merged commit fcac40f into pantsbuild:main Jun 17, 2024
25 checks passed

tdyas deleted the shell_hash_only_sources branch June 17, 2024 03:44

benjyw mentioned this pull request Jun 17, 2024

Add "workspace invalidation" sources support for shell / adhoc backends (Cherry pick of #21051) #21075

Closed

tdyas mentioned this pull request Jun 20, 2024

shell/adhoc: use metadata hashing instead of content hashing for workspace_invalidation_sources #21092

Merged

huonw mentioned this pull request Jun 24, 2024

Replace non-custom flake8 linting with ruff #21037

Open

purajit mentioned this pull request Jun 26, 2024

Bump default terraform version to 1.9.0, add all known terraform versions inbetween #21110

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "workspace invalidation" sources support for shell / adhoc backends #21051

Add "workspace invalidation" sources support for shell / adhoc backends #21051

tdyas commented Jun 11, 2024 •

edited

Loading

tdyas commented Jun 11, 2024

huonw commented Jun 12, 2024

benjyw commented Jun 12, 2024

benjyw left a comment

benjyw Jun 12, 2024

benjyw Jun 12, 2024

tdyas Jun 12, 2024

tdyas Jun 14, 2024

tdyas Jun 14, 2024

huonw Jun 15, 2024

tdyas Jun 15, 2024 •

edited

Loading

tdyas Jun 15, 2024

kaos commented Jun 12, 2024

tdyas commented Jun 12, 2024

tdyas commented Jun 12, 2024 •

edited

Loading

tdyas commented Jun 12, 2024

huonw left a comment

huonw Jun 15, 2024

huonw Jun 15, 2024

huonw Jun 15, 2024

tdyas commented Jun 15, 2024

huonw left a comment

tdyas commented Jun 17, 2024

benjyw commented Jun 17, 2024

benjyw commented Jun 17, 2024

tdyas commented Jun 18, 2024

tdyas commented Jun 18, 2024

benjyw commented Jun 18, 2024

huonw commented Jun 18, 2024

benjyw commented Jun 18, 2024 •

edited

Loading

tdyas commented Jun 20, 2024

	materlized into the execution sandbox. Pants will compute the hash of all of the files
	materialized into the execution sandbox. Pants will compute the hash of all of the files

	process to be executed (as an environment variable).
	process to be executed (as an environment variable), so that the command re-runs if these files change.

Add "workspace invalidation" sources support for shell / adhoc backends #21051

Add "workspace invalidation" sources support for shell / adhoc backends #21051

Conversation

tdyas commented Jun 11, 2024 • edited Loading

tdyas commented Jun 11, 2024

huonw commented Jun 12, 2024

benjyw commented Jun 12, 2024

benjyw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdyas Jun 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaos commented Jun 12, 2024

tdyas commented Jun 12, 2024

tdyas commented Jun 12, 2024 • edited Loading

tdyas commented Jun 12, 2024

huonw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdyas commented Jun 15, 2024

huonw left a comment

Choose a reason for hiding this comment

tdyas commented Jun 17, 2024

benjyw commented Jun 17, 2024

benjyw commented Jun 17, 2024

tdyas commented Jun 18, 2024

tdyas commented Jun 18, 2024

benjyw commented Jun 18, 2024

huonw commented Jun 18, 2024

benjyw commented Jun 18, 2024 • edited Loading

tdyas commented Jun 20, 2024

tdyas commented Jun 11, 2024 •

edited

Loading

tdyas Jun 15, 2024 •

edited

Loading

tdyas commented Jun 12, 2024 •

edited

Loading

benjyw commented Jun 18, 2024 •

edited

Loading