Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add intrinsic rule to request metadata about paths in filesystem #20996

Merged
merged 20 commits into from
Jun 20, 2024

Conversation

tdyas
Copy link
Contributor

@tdyas tdyas commented Jun 4, 2024

Add a new path_metadata_request intrinsic rule to allow rule code to request metadata about paths in the filesystem.

The intended uses are to support:

  • Some form of metadata-based invalidation for adhoc_tool and shell_command targets, as explored in the proof of concept in PoC for metadata-based invalidation #20914. (This PR in fact is split out from that other PR.)
  • Future potential work to avoid indefinite negative caching of PATH-style lookups (which is a problem of the existing BinaryPath lookups) by switching to direct monitoring of system-level paths. This API would likely be extended to that use case. (But I am not yet committing to any particular solution. Still, some form of metadata API will be useful.)

Metadata is represented by the PathMetadata dataclass. PathMetadataRequest and PathMetadataResult are the input/output types, respectively, for the intrinsic.

Note: NodeKey::fs_path_to_watch is introduced to allow NodeKey::PathMetadata to watch the parent directory even though fs_subject remains its configured path. This is necessary because the watch code will error with "file not found" if the path to be watched does not exist. The solution is to watch the parent directory and wait for creation / removal events related to the fs_subject.

@tdyas tdyas requested review from kaos and benjyw June 4, 2024 04:17
@tdyas tdyas marked this pull request as ready for review June 4, 2024 04:17
@tdyas tdyas changed the title add intrinsic rule to request metdat about paths in filesystem add intrinsic rule to request metdata about paths in filesystem Jun 4, 2024
@tdyas tdyas requested a review from stuhood June 4, 2024 04:18
class PathMetadataRequest:
"""Request the full metadata of a single path in the filesystem.

Note: This API is symlink aware and will distinguish between symlinks and regular files.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we allow the rule code to configure symlink oblivious mode?

@tdyas tdyas requested a review from huonw June 4, 2024 04:24
@tdyas tdyas changed the title add intrinsic rule to request metdata about paths in filesystem add intrinsic rule to request metadata about paths in filesystem Jun 4, 2024
@tdyas tdyas force-pushed the path_metadata_request_intrinsic branch from 50d96c2 to e888535 Compare June 4, 2024 05:10
(e.g., the `is_executable` flag).

Some of the fields are optional because not all fields may be supported by the underlying system (e.g.,
the repository filesystem or an REAPI directopry tree).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the repository filesystem or an REAPI directopry tree).
the repository filesystem or an REAPI directory tree).

@huonw
Copy link
Contributor

huonw commented Jun 5, 2024

We've just branched for 2.22, so merging this pull request now will come out in 2.23, please move the release notes updates to docs/notes/2.23.x.md. Thank you!

Copy link
Contributor

@huonw huonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in principle, but lots of questions 😄

Copy link
Member

@kaos kaos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to see others reviewing this, so I'll bow out :)

@tdyas tdyas force-pushed the path_metadata_request_intrinsic branch from e888535 to 5377ace Compare June 5, 2024 13:33
@tdyas tdyas requested review from huonw and benjyw June 6, 2024 23:05
@tdyas tdyas force-pushed the path_metadata_request_intrinsic branch from eaa884d to dd24198 Compare June 7, 2024 03:36
Copy link
Contributor

@huonw huonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating.

I was thinking about the workspace_environment use-case for this: I only vaguely understand the moving parts so far... but a potential other way to handle the cache invalidation for that might be loading a digest/snapshot of the dependencies and/or globs? That is, even though they're not an input to any sandbox construction, we could still be comparing hashes?

I feel strongly like I'm missing something, though 😄

(Also, minor process thing, but the force-pushs made it hard to tell what's changed for a re-review. At least while the PR is under-discussion I personally find it helpful to just have commits appended (potentially including merges with main of course, annoyingly).)

@@ -22,6 +22,7 @@ lazy_static = { workspace = true }
log = { workspace = true }
parking_lot = { workspace = true }
protos = { path = "../protos" }
pyo3 = { workspace = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a bit unfortunate to be pushing the pyo3 dep down into engine/fs here. It feels a bit like a layering violation, where this is theoretically Python-independent... except for the two types that currently make sense to be exposed directly and so now need annotation.

I think this in particular may make builds slower, by forcing "big" dependencies like pyo3 to be used earlier in the build process, and thus making the critical path longer (I haven't verified this for this particular case, though).

The alternatives I know of are a bit boilerplate-y although likely tolerable. Define pyo3 types in engine itself (i.e. engine/src/...), that either:

  • wraps the Rust one and exposing the properties as getters, e.g. similar to PyDigest
    #[pyclass(name = "Digest")]
    #[derive(Clone, Debug, PartialEq, Eq)]
    pub struct PyDigest(pub DirectoryDigest);
    impl fmt::Display for PyDigest {
    fn fmt(&self, f: &mut fmt::Formatter) -> std::fmt::Result {
    let digest = self.0.as_digest();
    write!(
    f,
    "Digest('{}', {})",
    digest.hash.to_hex(),
    digest.size_bytes,
    )
    }
    }
    #[pymethods]
    impl PyDigest {
    /// NB: This constructor is only safe for use in testing, or when there is some other guarantee
    /// that the Digest has been persisted.
    #[new]
    fn __new__(fingerprint: &str, serialized_bytes_length: usize) -> PyResult<Self> {
    let fingerprint = Fingerprint::from_hex_string(fingerprint)
    .map_err(|e| PyValueError::new_err(format!("Invalid digest hex: {e}")))?;
    Ok(Self(DirectoryDigest::from_persisted_digest(Digest::new(
    fingerprint,
    serialized_bytes_length,
    ))))
    }
    fn __hash__(&self) -> u64 {
    self.0.as_digest().hash.prefix_hash()
    }
    fn __repr__(&self) -> String {
    format!("{self}")
    }
    fn __richcmp__(&self, other: &PyDigest, op: CompareOp, py: Python) -> PyObject {
    match op {
    CompareOp::Eq => (self == other).into_py(py),
    CompareOp::Ne => (self != other).into_py(py),
    _ => py.NotImplemented(),
    }
    }
    #[getter]
    fn fingerprint(&self) -> String {
    self.0.as_digest().hash.to_hex()
    }
    #[getter]
    fn serialized_bytes_length(&self) -> usize {
    self.0.as_digest().size_bytes
    }
    }
  • duplicates the fields and just copies them across (this might be the only option for the enum version)

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is also what we do for the native options parser. There is some repetition, but it keeps the Rust side free of pyo3. This is especially important if we end up using the fs code in other rust code.

Copy link
Contributor Author

@tdyas tdyas Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me like this would not work for enums though because annotating an outer PyPathMetadataLind type with #[pyclass] would not generate the enum variants visible in Python which are generated currently by pyo3 when the pyclass annoation is directly on the PathMetadataKind.

pyo3 does not appear to have an equivalent to the serde crate's #[serde(flatten)] attribute which removes a layer of indirection with nested types.

I could define the enumeration in Python instead of it being a Rust type or expose some other type. Indeed, an earlier iteration of this PR did in fact expose the Rust enumeration to Python as a string, but the earlier review comments asked me to use pyo3's enum functionality ...

So I can restore my earlier approach but we should decide on some acceptable approach to avoid development going in circles.

Copy link
Contributor

@huonw huonw Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the enum, I'd suggest going for the duplication approach (i.e. two Rust enums)

It's annoying, but if we use (exhaustive) match to convert between them python version and Rust version, it's not error prone: adding a variant in one place will trigger a compiler error at the appropriate conversion function, to remind us to update the other place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am fine with some duplication/boilerplate, since this will be changed rarely, if ever, and even then only by people who really know what they're doing.

@tdyas tdyas force-pushed the path_metadata_request_intrinsic branch from dd24198 to 3c42342 Compare June 11, 2024 13:58
@tdyas
Copy link
Contributor Author

tdyas commented Jun 11, 2024

Thanks for iterating.

I was thinking about the workspace_environment use-case for this: I only vaguely understand the moving parts so far... but a potential other way to handle the cache invalidation for that might be loading a digest/snapshot of the dependencies and/or globs? That is, even though they're not an input to any sandbox construction, we could still be comparing hashes?

I feel strongly like I'm missing something, though 😄

(Also, minor process thing, but the force-pushs made it hard to tell what's changed for a re-review. At least while the PR is under-discussion I personally find it helpful to just have commits appended (potentially including merges with main of course, annoyingly).)

There is another use case for this intrinsic which is to allow Pants to eventually monitor local filesystem paths for changes to tools on a PATH. (This other use case was flagged to me by one of my clients.) I'll need to write up that issue. (And I say "eventually" since I would also need to teach our inotify wrapper to handle absolute paths properly.)

That said, my thought was to build this support to enable both the invalidation issue with workspace environment and updating tool detection in the local environment when system paths change. Your point bring up that that may be premature and I should just solve the cache invalidation problem separately.

I wrote #21051 for that. So I will be pursuing that other PR instead of this one in the meantime.

@tdyas tdyas marked this pull request as draft June 11, 2024 18:19
@tdyas
Copy link
Contributor Author

tdyas commented Jun 11, 2024

#21051 is an alternate solution and is probably the preferred solution.

@tdyas
Copy link
Contributor Author

tdyas commented Jun 20, 2024

@huonw / @benjyw: What are your remaining open issues with this PR?

Copy link
Contributor

@huonw huonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, other than:

  • the discussion about pyo3
  • retry_assertions seeming like it doesn't work due to Python limitations

@tdyas
Copy link
Contributor Author

tdyas commented Jun 20, 2024

This looks good to me, other than:

* the discussion about `pyo3`

* `retry_assertions` seeming like it doesn't work due to Python limitations

The latest push adds the pyo3 wrapper types and keeps the fs crate free of pyo3.

@tdyas
Copy link
Contributor Author

tdyas commented Jun 20, 2024

I also renamed the time-related fields in PathMetadata to not have _time suffixes. I like the naming on the Rust std::fs::Metadata type and went with that naming. See https://doc.rust-lang.org/std/fs/struct.Metadata.html#method.accessed for the inspiration.

@tdyas tdyas marked this pull request as ready for review June 20, 2024 03:51
Copy link
Contributor

@huonw huonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one! I like where this ended up a lot.

Sorry for the process frustration with this work.

Copy link
Contributor

@benjyw benjyw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for all the teeth-pulling this endured.

@tdyas
Copy link
Contributor Author

tdyas commented Jun 20, 2024

I removed the read_only property for now to simplify the API. The permission-related attributes for now will be is_executable (from the REAPI) and unix_mode which is technically a superset of read_only. (I didn't like the fact that I hadn't named it with a is_ prefix to match is_executable. Easier to just remove and add later if need be.)

@tdyas tdyas merged commit 925cfb6 into pantsbuild:main Jun 20, 2024
25 checks passed
@tdyas tdyas deleted the path_metadata_request_intrinsic branch June 20, 2024 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants