Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "workspace invalidation" sources support for shell / adhoc backends #21051

Merged
merged 5 commits into from
Jun 17, 2024

Conversation

tdyas
Copy link
Contributor

@tdyas tdyas commented Jun 11, 2024

Add support for "workspace invalidation" sources for the adhoc_tool and shell_command target types. This supports allows those targets to depend on the content of files in the repository without materializing those sources in the execution sandbox. This support is intended to be used in conjunction with the workspace environment where execution does not take place in a sandbox.

The new field workspace_invalidation_sources on both target types is a list of globs into the repository. The digest of the referenced files will be inserted as an environment variable in the process executed (which makes it part of the process's cache key).

@tdyas
Copy link
Contributor Author

tdyas commented Jun 11, 2024

This PR supersedes #20996.

@huonw
Copy link
Contributor

huonw commented Jun 12, 2024

Interesting approach 👍

I've not read the code in detail yet, but just asking some questions to help set my context when I find a moment:

  1. What happens if a shell_command using a workspace environment specifies some execution_dependencies? How/if are those incorporated into the cache key/invalidation? (This isn't directly related to this PR... but helps set the context for the next question)
  2. With the glob approach, it seems like this is particularly targeted at files that aren't tracked by Pants (i.e. don't have a target, even files(...)). Is that a correct understanding?

@benjyw
Copy link
Contributor

benjyw commented Jun 12, 2024

IIUC this is entirely a performance thing? It would do no harm, from a correctness perspective, to materialize these files into the sandbox, it would just be wasted work?

Copy link
Contributor

@benjyw benjyw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So IIUC there is an inherent race condition here:

  1. We hash these sources to generate a cache key
  2. The user edits the sources.
  3. We run some adhoc tool on these sources.

We're now caching the result against the wrong key, no?

@@ -253,6 +253,22 @@ class AdhocToolOutputRootDirField(StringField):
)


class AdhocToolHashOnlySourcesGlobsField(StringSequenceField):
alias: ClassVar[str] = "hash_only_sources_globs"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be named just hash_only_sources? The regular sources field can take globs but we don't put "_globs" in its name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may also want to bikeshed the name further, but not yet...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be named just hash_only_sources? The regular sources field can take globs but we don't put "_globs" in its name.

Fine by me. I am not wedded to any particular name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about unmaterialized_sources?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or indirect_sources?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like unmaterialized_sources, although unmaterialized might be opaque jargon to many users (i.e. require them to read the docs to make any reasonable guess at what it means).

Just brainstorming some other ideas:

  • extra_invalidation_sources
  • non_sandbox_sources
  • workspace_invalidation_sources
  • workspace_only_sources

Copy link
Contributor Author

@tdyas tdyas Jun 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your suggestions of workspace_invalidation_sources and workspace_only_sources because they incorporate "workspace" since this feature is intended to be paired with use of workspace_environment. Indeed, it points to restricting this feature to only be enabled when the workspace environment is set on the adhoc_tool / shell_command target. If the user is using a regular environment, then they should be relying on ordinary dependencies to express inputs.

I will modify the PR to go with workspace_invalidation_sources for now and error if the field is set for non-workspace environments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I am just going to document it should only be used with workspace_environment. No need to error or ignore just yet.

@kaos
Copy link
Member

kaos commented Jun 12, 2024

So IIUC there is an inherent race condition here:

  1. We hash these sources to generate a cache key
  2. The user edits the sources.
  3. We run some adhoc tool on these sources.

We're now caching the result against the wrong key, no?

Not related to this change, but the above scenario is already true during pantsd bootstrapping. I've observed getting cached results where the result != what should've been given with the current source contents, caused by a "late" save during the initial moments of a pants run. Killing pantsd has been the way to get out of that situation (as editing the file back and forth results in getting the "wrong" result back).

@tdyas
Copy link
Contributor Author

tdyas commented Jun 12, 2024

Not related to this change, but the above scenario is already true during pantsd bootstrapping. I've observed getting cached results where the result != what should've been given with the current source contents, caused by a "late" save during the initial moments of a pants run. Killing pantsd has been the way to get out of that situation (as editing the file back and forth results in getting the "wrong" result back).

This seems like a problem due to inotify triggering an invalidation at a certain point in time and then Pants actually reading the disk contents at a later time. Given this PR uses the same Digest capturing logic as used for sources on other targets, I would expect that problem to exist for regular sources too depending on the timing of the inotify watch and subsequent writes to the file.

This PR should not be any worse than the existing state in that regard.

@tdyas
Copy link
Contributor Author

tdyas commented Jun 12, 2024

IIUC this is entirely a performance thing? It would do no harm, from a correctness perspective, to materialize these files into the sandbox, it would just be wasted work?

It would be wasted work. Such dependencies would be materialized into the temporary directory made even for workspace environment executions.

It is also a DX issue: The client of mine who is paying for this work expressed the preference to not have to duplicate any targets in Pants which are already defined in Bazel (which is their motivating use case). Their glob would likely just be ["subdir/**/*"].

@tdyas
Copy link
Contributor Author

tdyas commented Jun 12, 2024

Interesting approach 👍

I've not read the code in detail yet, but just asking some questions to help set my context when I find a moment:

1. What happens if a `shell_command` using a workspace environment specifies some `execution_dependencies`? How/if are those incorporated into the cache key/invalidation? (This isn't directly related to this PR... but helps set the context for the next question)

2. With the glob approach, it seems like this is particularly targeted at files that _aren't_ tracked by Pants (i.e. don't have a target, even `files(...)`). Is that a correct understanding?
  1. Regular dependencies will be materialized in the execution sandbox for the shell command and are part of the "input digest" to the Process representing the execution. Any change to the input digest changes the cache key. For workspace environment executions, the input digest is materialized still in a temoporary directory available during the workspace execution so regular dependencies will still be part of the input digest.
  2. Correct. The motivating use case is executing Bazel in the workspace environment. The user should not need to repeat in Pants anything related to Bazel targets and configuration. The goal is to tell Pants (if any change happens to these paths) then re-invoke Bazel to actually figure out what to rebuild.

Copy link
Contributor

@huonw huonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, other than some minor doc details and the naming.

Also: maybe we could expand the environments.mdx section about workspace_environment, so that it mentions/links to this as a breadcrumb for people to follow. Just noting that that section currently has the :::caution about caching that this would fit into well!

@@ -253,6 +253,22 @@ class AdhocToolOutputRootDirField(StringField):
)


class AdhocToolHashOnlySourcesGlobsField(StringSequenceField):
alias: ClassVar[str] = "hash_only_sources_globs"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like unmaterialized_sources, although unmaterialized might be opaque jargon to many users (i.e. require them to read the docs to make any reasonable guess at what it means).

Just brainstorming some other ideas:

  • extra_invalidation_sources
  • non_sandbox_sources
  • workspace_invalidation_sources
  • workspace_only_sources

help = help_text(
"""
Path globs for source files on which this target depends indirectly, but which should not be
materlized into the execution sandbox. Pants will compute the hash of all of the files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
materlized into the execution sandbox. Pants will compute the hash of all of the files
materialized into the execution sandbox. Pants will compute the hash of all of the files

Path globs for source files on which this target depends indirectly, but which should not be
materlized into the execution sandbox. Pants will compute the hash of all of the files
references by the globs and include that hash as part of the cache key for the
process to be executed (as an environment variable).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be good to be explicit about the consequence/purpose of this, e.g. something like:

Suggested change
process to be executed (as an environment variable).
process to be executed (as an environment variable), so that the command re-runs if these files change.

@tdyas tdyas changed the title Add "hash-only" sources support for shell / adhoc backends Add "workspace invalidation" sources support for shell / adhoc backends Jun 15, 2024
@tdyas tdyas force-pushed the shell_hash_only_sources branch from b1d7740 to 74233d5 Compare June 15, 2024 04:08
@tdyas
Copy link
Contributor Author

tdyas commented Jun 15, 2024

Also: maybe we could expand the environments.mdx section about workspace_environment, so that it mentions/links to this as a breadcrumb for people to follow. Just noting that that section currently has the :::caution about caching that this would fit into well!

Will do.

@tdyas tdyas force-pushed the shell_hash_only_sources branch from 185ab2a to 06d432f Compare June 15, 2024 04:13
@tdyas tdyas force-pushed the shell_hash_only_sources branch from 06d432f to 0f638c5 Compare June 15, 2024 04:15
Copy link
Contributor

@huonw huonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for iterating!

It is also a DX issue: The client of mine who is paying for this work expressed the preference to not have to duplicate any targets in Pants which are already defined in Bazel (which is their motivating use case). Their glob would likely just be ["subdir/**/*"].

Just thinking on this a bit more: on the surface, it seems like it wouldn't be a major imposition to define a single catch-all files(name="bazel-files", sources=["subdir/**/*"]) target. But, even if that was acceptable, just doing this as a performance optimisation seems perfectly appropriate!

@tdyas tdyas merged commit fcac40f into pantsbuild:main Jun 17, 2024
25 checks passed
@tdyas tdyas deleted the shell_hash_only_sources branch June 17, 2024 03:44
tdyas added a commit to tdyas/pants that referenced this pull request Jun 17, 2024
…ds (pantsbuild#21051)

Add support for "workspace invalidation" sources for the `adhoc_tool`
and `shell_command` target types. This supports allows those targets to
depend on the content of files in the repository without materializing
those sources in the execution sandbox. This support is intended to be
used in conjunction with the workspace environment where execution does
not take place in a sandbox.

The new field `workspace_invalidation_sources` on both target types is a
list of globs into the repository. The digest of the referenced files
will be inserted as an environment variable in the process executed
(which makes it part of the process's cache key).
@tdyas
Copy link
Contributor Author

tdyas commented Jun 17, 2024

Manually cherry picked to 2.22.x branch by #21075. This support is necessary for proper use of the workspace environment with shell_command and adhoc_tool.

@benjyw
Copy link
Contributor

benjyw commented Jun 17, 2024

Not related to this change, but the above scenario is already true during pantsd bootstrapping. I've observed getting cached results where the result != what should've been given with the current source contents, caused by a "late" save during the initial moments of a pants run. Killing pantsd has been the way to get out of that situation (as editing the file back and forth results in getting the "wrong" result back).

This seems like a problem due to inotify triggering an invalidation at a certain point in time and then Pants actually reading the disk contents at a later time. Given this PR uses the same Digest capturing logic as used for sources on other targets, I would expect that problem to exist for regular sources too depending on the timing of the inotify watch and subsequent writes to the file.

This PR should not be any worse than the existing state in that regard.

I don't think this is true. A Digest is fingerprinted after it's captured. We are guaranteed that the SHA in the Digest correctly hashes the contents we're materializing in the sandbox, working on, and caching against. It may not be what's in the workspace at a later time, but it guaranteed that the inputs fingerprinted by the Digest are what we operated on to produce the cached result.

But AFAICT that is not true here, when we're not working in a sandbox.

This is why invalidating via an mtime would be fine, but invalidating via a content hash is not. I think this is still a substantial concern.

@benjyw
Copy link
Contributor

benjyw commented Jun 17, 2024

But AFAICT that is not true here, when we're not working in a sandbox.

This is why invalidating via an mtime would be fine, but invalidating via a content hash is not. I think this is still a substantial concern.

Why did we move away from using mtimes for this? IIUC is worse than the issue @kaos raises, since we'd not just be badly memoizing, but polluting a persistent cache.

@tdyas
Copy link
Contributor Author

tdyas commented Jun 18, 2024

I don't think this is true. A Digest is fingerprinted after it's captured. We are guaranteed that the SHA in the Digest correctly hashes the contents we're materializing in the sandbox, working on, and caching against. It may not be what's in the workspace at a later time, but it guaranteed that the inputs fingerprinted by the Digest are what we operated on to produce the cached result.

But AFAICT that is not true here, when we're not working in a sandbox.

This is why invalidating via an mtime would be fine, but invalidating via a content hash is not. I think this is still a substantial concern.

The existing code uses the same PathGlobs -> Digest intrinsic used to capture sources. See

snapshot = await Get(Snapshot, PathGlobs, path_globs)
(where HydratedSources is captured using PathGlobs -> Digest -> Snapshot). If capturing the "workspace invalidation" globs from the repository has a race condition, wouldn't regular source file capture from the repository also have a problem?

As for why I switched, I sensed lots of push back on #20996 which adds an intrinsic for obtaining mtime (and other metadata) on repository paths. Given I want to get this project done in a timely manner, I chose to switch to what seems to me to be a simpler and hopefully more acceptable solution since it would not need a new intrinsic rule.

@tdyas
Copy link
Contributor Author

tdyas commented Jun 18, 2024

Regarding the concern that a workspace file could be overwritten after execution of a workspace process has already started (but cached under the digest of prior copy of the file), I agree mtime in the cache key would be less bad than that sort of cache poisoning. In which case I would really like to discuss in real time with someone what the push back on #20996 is?

The DX for mtime versus content would be the same workspace_invalidation_sources field already proposed. It is very easy to switch over once something like #20996 is available.

@benjyw
Copy link
Contributor

benjyw commented Jun 18, 2024

To clarify - you are correct that the race condition still exists in terms of which "point in time" is captured (in fact we may be sweeping through multiple files just as they are being changed, so the captured state may never have existed at a specific point in time!)

BUT once captured, we only operate on that state in a sandbox, with no further edits. So we know with certainty that the cached output of a process is exactly the reproducible result of running that process on the inputs with the given Digest.

Real world scenario: The user is editing during snapshotting, specifically: changing state from A to B and then back to A. In the sandbox world, if we happened to snapshot at state B we will run the process at state B. Which may not be what the user intended. But if the user then re-runs at state A, we will re-run the process, because we see different inputs.

But in the workspace world, we might snapshot at state B, then run at state A. Therefore we will cache the result at A against the Digest at B. The user will see a wrong result, re-run, but we will retrieve the incorrect, cached result. The only cure will be to nuke the cache.

Unless I'm missing something?

This sort of thing is presumably why Make and Cargo and friends use mtimes and have no caching. Absent sandboxed execution of captured inputs, you can't know that the workspace state you're operating against is the one you digested.

cc @huonw and @kaos to check my logic and discuss the pushback on #20996 .

@huonw
Copy link
Contributor

huonw commented Jun 18, 2024

But in the workspace world, we might snapshot at state B, then run at state A. Therefore we will cache the result at A against the Digest at B. The user will see a wrong result, re-run, but we will retrieve the incorrect, cached result. The only cure will be to nuke the cache.

This sounds plausible. Just spelling out a specific sequence:

  1. Files in state A
  2. Execute workspace pants command that runs some fast-running procsss
  3. Pants reads file system/hashes the sources in state A
  4. Edit files to state B
  5. Process starts, running on the files in state B
  6. Process ends and is cached (using key for state A)
  7. (OS tells pants file watcher about edit of files to state B)
  8. Edit files back to state A
  9. Re-run process, which is served from incorrect cache of execution on state B

(I think in particular step 6 and 7 need to happen in that order: if file watching is notified first, the process will be killed and not cached.)

AIUI, the thinking is that mtimes solve this because (assuming user doesn't play games with touch), they'll only ever increase, so the A -> B -> A problem cannot happen? Even if the file contents are identical, the timestamps will not.

discuss the pushback on #20996

For me, for the bigger picture of #20996, I was more feeling like I didn't fully understand the background, rather than necessarily pushing back (I didn't know enough to push forward or back). Sorry if it was coming across as more negative than indented!

After all this discussion I'm definitely understanding it better!

(I was also diving into the code improvements, which I'm sure could come across as negative!)

@benjyw
Copy link
Contributor

benjyw commented Jun 18, 2024

Yes, mtimes solve this because there is no caching at all, effectively, just invalidation. Technically stuff is written to the cache, but it can never be retrieved once a file changes.

The downside is that remote caching for CI becomes useless - the newly cloned repo in CI will have fresh mtimes.

@tdyas
Copy link
Contributor Author

tdyas commented Jun 20, 2024

#21092 switches adhoc_tool / shell_command to using metadata-based invalidation for workspace_invaliation_sources.

tdyas added a commit that referenced this pull request Jun 25, 2024
…kspace_invalidation_sources` (#21092)

Switch to using metadata-based invalidation instead of content-based
invalidation for any sources referenced by the
`workspace_invalidation_sources` field.

This is necessary because `workspace_invalidation_sources` is intended
to be used with the `experimental_workspace_environment` and there is a
problem with content-based invalidation in that scenario: There is a
potential cache poisoning scenario where Pants computes a content digest
but then the user overwrites the digested sources before Pants has
executed the applicable `adhoc_tool` / `shell_command` process. The
cache will now have a result stored under the digest of the original
file version even though the file content changed. See
#21051 (comment)
for expanded discussion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants