Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support files that are lazy loaded at runtime. #33

Merged
merged 4 commits into from
Nov 21, 2022

Conversation

jsirois
Copy link
Collaborator

@jsirois jsirois commented Nov 16, 2022

Files can now either be embedded in a scie or else loaded at runtime via
a binding command that takes the filename to load as an argument and
produces the file content on stdout. Hashing and size checking are still
performed by the installer.

An example is added that assumes curl, jq and bash are present on
the host for doing loading from the internet.

Closes #19

Files can now either be embedded in a scie or else loaded at runtime via
a binding cmd that takes the filename to load as an agument and produces
the file content on stdout. Hashing and size checking are still
performed by the installer.

And example is added that assumes `curl`, `jq` and `bash` are present on
the host for doing loading from the internet.

Closes a-scie#19
@jsirois jsirois marked this pull request as ready for review November 16, 2022 22:44
}
file.size
}
FileEntry::LoadAndInstall((binding, file, dst)) => {
let buffer_source = || {
Copy link
Collaborator Author

@jsirois jsirois Nov 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the core feature: if a file in the lift manifest has a "source" field, it's expected to be the name of a binding and that binding command is expected to take the name of the file to load as its sole argument and then emit the contents of the file on stdout. The extra lift manifest metadata is the bridge from file names to urls (or whatever extra metadata the load binding needs to know to load the file).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benjyw and @sureshjoshi - this is the key interface I'm hoping for feedback on.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benjyw and @sureshjoshi I can proceed without you, but I was hoping for a sanity check on at least this key interface. The driver here is not bloating the scie-jump but still allowing opt-in flexibility in scie construction by those willing to take the hit of embedding a utility to materialize files with. I'll likely proceed getting this PR in by evening.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was indeed just looking at this, sorry for the delay! See comments on interface.

Copy link
Collaborator

@sureshjoshi sureshjoshi Nov 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent the last couple days of OSS time landing some other PRs in Pants and others. I can look at this, this evening briefly - but I think Benjy's covering a lot of my thoughts as well already.

For me, it's more about understanding the use cases - and if/where/how I would use a feature like this. Instinctively seems contrary to what I would use scie for, but thats an anecdote of 1 who doesn't fully grok what's put in front of him.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I do appreciate both of your time looking at this. I debate man in bus vs stealing attention. There is nothing more valuable than attention and stealing it should be considered seriously.

So the motivating use case mentioned in #19 was a Twitter use case where shipping around very very large binaries was a hindrance on the dev cycle. It was much faster to ship an app descriptor from a laptop and then have the elements assembled "on-site" in the fast network environment of the production datacenter. This serves that case but also generally allows you to avoid waste. For example, if Pants ships as a scie, every new version of Pants will include a copy of the same ~64MB embedded Python interpreter. That interpreter will only extract once from the 1st Pants scie using that interpreter - say Python 3.10. The rest of the Pants binaries will just lug it along for no real reason. Certainly a small sin by today's standards, but one nonetheless. You could imagine instead shipping a thin scie instead without the Python interpreter embedded but instead fetched if needed (not already in the local machine ~/.nce cache).

Now that's roughly the motivating set of use cases here. The mechanism is a bit more general though - I'm not sure what people might do with it in terms of loading files. You could imagine pulling them them from other sources like a secured vault on the local machine allowing you to avoid distributing sensitive content.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks great context for this.

I can think up some use cases where I could use it, but practically speaking - I wouldn't just because of all of the edge cases that would inevitably kick me in the ass, because I'm the epitome of Murphy's Law.

Very useful from an app-update point of view though - especially in the embedded world, where disk space is genuinely at a premium.

// handling not installed, signals of various sorts); so, with the lock in hand, we clean up any
// stray work path before proceeding. Since we need to do this up front anyway, we do not attach
// cleanup to the work error path.
clean(&work_path)?;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a pre-existing bug that the new load example exposed.

if let Some(permissions) =
unpack(file.file_type, scie_tote_src, file.hash.as_str(), dst)?
{
// We let archives handle their internally stored permission bits.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a pre-existing bug the load example exposed. In that case, the perms of the get.sh script were not preserved from the scie-tote. A similar problem exists for non-scie-tote scies with blobs that should be executable. That is tracked in #34.

@@ -0,0 +1,8 @@
#!/usr/bin/env bash
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N.B.: The example uses this shell script to load files at runtime but the idea is to provide a set of ptex binaries for all supported platforms that can be included in a scie instead. That will free the scie from having dependencies on the target system providing bash, curl and jq.

Copy link
Collaborator

@benjyw benjyw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of comments on the interface, not looked at the code yet.

examples/load/lift.macos-aarch64.json Outdated Show resolved Hide resolved
"get": {
"exe": "{get.sh}",
"args": [
"{scie.lift}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the name arg be explicit here, to give more control over where it appears in the argslist, and also for less magic? E.g., via "{scie.file.name}" or something?

In fact, that could generalize to things like embedding the URL as a custom field inside the file's JSON object, rather than in that separate "get" field at the bottom, e.g.,

"args": [
  "{scie.file.url}"
]

Which removes the need for jq in the example.

Copy link
Collaborator Author

@jsirois jsirois Nov 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow the 1st paragraph. The {scie.lift} placeholder is a placeholder that is expanded to the absolute path of the lift manifest itself, after having been extracted from the scie file itself. In other words, the presence of a {scie.lift} placeholder triggers that extraction. It happens to be done in ~/.nce/<lift manifest hash>/lift.json. This is documented as a bit of an aside here: https://github.com/a-scie/jump/blob/main/docs/packaging.md#advanced-placeholders

My confusion over the 1st paragraph likely makes the 2nd paragraph track less well than it should so I'll stop there to hear what you're thinking in response.

A thing to keep in mind though is that the ultimate idea here is definitely not to use bash + jq + curl to fetch file contents - that depends way too much on the host OS to work at all (See the variance in the windows example for a highlight of this fragility). Instead the a-scie/ptex repo will provide a static binary for all platforms supported by a-scie/jump the does all this. No bash, jq or curl required.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, let me be clearer.

As I understand it, the value of the name field is implicitly tacked on to the end of these args. I was proposing that this be explicit, e.g., using "{scie.file.name}".

And then in paragraph 2 I expanded that to the realization we could have "{scie.file.XXX}" be a placeholder for any field in the file's object. And if we then allow custom fields in there, you could, for example, stick a url field in there, and reference it in args as "{scie.file.url}".

I got that {scie.lift} is a placeholder to the manifest itself, but right now we pass in the entire manifest, so you must use json parsing (whether jq, or in ptex implementation, or in whatever other mechanism) to pull out the single URL you care about, based on the name. Which is cumbersome, and also means that the binding binary must know about the shape of the manifest. What I'm suggesting is that we could instead do the more obvious thing, and just pass the URL (and maybe other standard or custom fields, as needed) right in via args.

Of course one could still use {scie.lift} in the above manner if one wished, if the binary did need the entire manifest for some reason.

Is that clearer? I'll post an example next.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{
  "scie": {
    "lift": {
      "name": "cowsay",
      "base": "~/.nce",
      "files": [
        {
          "name": "get.sh"
        },
        {
          "name": "cowsay-1.1.0.jar",
          "key": "cowsay.jar",
          "size": 1724250,
          "hash": "212ee64546eb8b5074572fed4107d850eb90bc462aa3099c0ac8ea63fdca7811",
          "type": "blob",
          "url": "https://repo1.maven.org/maven2/com/github/ricksbrown/cowsay/1.1.0/cowsay-1.1.0.jar",
          "source": "get"
        },
        {
          "name": "openjdk-19.0.1_linux-x64_bin.tar.gz",
          "key": "jdk",
          "size": 195925792,
          "hash": "7a466882c7adfa369319fe4adeb197ee5d7f79e75d641e9ef94abee1fc22b1fa",
          "type": "tar.gz",
          "url": "https://download.java.net/java/GA/jdk19.0.1/afdd2e245b014143b62ccb916125e3ce/10/GPL/openjdk-19.0.1_linux-x64_bin.tar.gz",
          "source": "get"
        }
      ],
      "boot": {
        "commands": {
          "": {
            "exe": "{jdk}/jdk-19.0.1/bin/java",
            "args": [
              "-jar",
              "{cowsay.jar}"
            ],
            "env": {
              "=JAVA_HOME": "{jdk}/jdk-19.0.1",
              "=PATH": "{jdk}/jdk-19.0.1/bin:{scie.env.PATH}"
            }
          }
        },
        "bindings": {
          "get": {
            "exe": "{get.sh}",
            "args": [
              "{scie.file.url}"
            ]
          }
        }
      }
    }
  }
}

And get.sh is:

#!/usr/bin/env bash
set -euo pipefail

url="$1"
exec curl -fL "${url}"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And of course there is nothing magical about the url key - it's a custom field that scie ignores, other than to look it up for placeholder replacement. Its name could be anything.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention is definitely general purpose. I thought the fact the existing {scie.lift} placeholder made this all work was a good proof of that. The speed bump of parsing your own custom metadata did not raise itself as a burden in my eyes, but you're pointing out it is. I do agree it would be nicer to streamline that away if possible. It does get tricky though when you get down to details. If a file accepts arbitrary extra keys, your example is only streamlined for the case when the key values are scalars. In this example, "url" -> str. So either the JSON handling of that would need to enforce scalars only (str, number or bool with a convention around bool - the JSON convention), or else if the value is an array or object, revert to passing that value as a serialized json blob I guess that the consumer must then parse.

I'll think on this some more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize I may not share intuitions about boilerplate. Perhaps to clear that up, but perhaps not, I was viewing lift manifests as something no human would write. That would be done by other tools. It's intended that https://github.com/a-scie/lift would be the actual CLI interface humans would use to assemble scies if not leaning on Pants or Pex to do the dirty work. So I have not been considering too much about boilerplate reduction, etc in the lift manifest as of yet. As such, JSON was an easy choice since serde handles it so well and its resulting parser / emitter is so small. Editing these by hand is not fun, I continuously mess up trailing commas.

So, from my perspective, your boilerplate concern is a good one, but only in regards to the burdens placed on the binding command implementations; i.e.: the fact they currently almost without fail will need to have code to parse json to get at extra metadata. Its that, in my mind, that would be good to make easier.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that is good context. I was definitely imagining humans writing manifests.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With "humans will not commonly write these" in mind, I think it's OK as-is. I just want to make sure the "fetching a URL" case is relatively straightforward, and it sounds like ptex will handle that. So the need to use a custom binding is pretty rare, and at that point we can say "you have to parse JSON".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks for thinking through this with me. I did try a bit here this morning to implement the arbitrary field attachments to File and its not trivial at least. Although adding support for passing parameters in this way will be backward compatible, the actual call API of a source=<binding command> will not, since the file name argument that's currently passed will either need to disappear or remain redundant and ignored. I'm going to proceed with the existing structure of all custom metadata is accessed through top level keys and see how reality plays out. Hopefully I can feel confident about all this prior to 1.0.

let mut scie_tote = vec![];
let mut file_entries = vec![];
for (index, file) in self.lift.files.iter().enumerate() {
if self.replacements.contains(&file) {
let path = self.get_path(file);
if file.size == 0 {
scie_tote.push((file.clone(), path));
} else {
} else if Source::Scie == file.source {
Copy link
Collaborator

@sureshjoshi sureshjoshi Nov 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirming here that there is no longer a valid else if the file is non-zero size, non-source

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. The remaining case is Source::LoadBinding(binding_name) and all those were handled just above in the new load_entries population loop.

@sureshjoshi
Copy link
Collaborator

From a UX perspective, could the keys of the lift.json be validated at jump-time for the stuff that will be part of the lazy init?

Motivating example:

rm -r ~/.nce

../../scie-jump-macos-x86_64 lift.v1.macos-x86_64.json
./cowsay-v1 Woot! Problem free and awesome!

#  ___________________________________
# < Woot! Problem free and awesome! >
#  -----------------------------------
#         \   ^__^
#          \  (oo)\_______
#             (__)\       )\/\
#                 ||----w |
#                 ||     ||

# --> In the new lift, re-name one of the "cowsay-1.1.0.jar" to "cowsay-1.1.0.ja" to cause a future key problem

../../scie-jump-macos-x86_64 lift.v2.macos-x86_64.json
./cowsay-v2 This works, because the ~/.nce is already populated!

#  ________________________________________
# / This works, because the /Users/sj/.nce \
# \ is already populated!                  /
#  ----------------------------------------
#         \   ^__^
#          \  (oo)\_______
#             (__)\       )\/\
#                 ||----w |
#                 ||     ||

rm -r ~/.nce
./cowsay-v2 Womp Womp...

#  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
#                                 Dload  Upload   Total   Spent    Left  Speed
#  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (6) Could not resolve host: null
# Error: Failed to prepare a scie jump action: Failed to establish atomic directory /Users/sj/.nce/212ee64546eb8b5074572fed4107d850eb90bc462aa3099c0ac8ea63fdca7811/cowsay-1.1.0.jar. 
# Population of work directory failed: The blob destination /Users/sj/.nce/212ee64546eb8b5074572fed4107d850eb90bc462aa3099c0ac8ea63fdca7811/cowsay-1.1.0.jar of size 0 had unexpected 
# hash: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855

This makes sense, and the same problem will happen if you mis-type the URLs or something, but key errors seem like an easy win.

I still say this knowing full well from the convos above that the lift.json will/should be generated.

Aside: This class of "weird" (for a user) errors is why I try not to use lazy deployments. I find that I'm more often saying "works on my machine" when I do :) ... Admittedly, this is a HUGE pain when I'm shipping binaries across the world through multiple VPNs connected to hardware over WiMax or something equally crazy 🤦🏽

@jsirois
Copy link
Collaborator Author

jsirois commented Nov 19, 2022

From a UX perspective, could the keys of the lift.json be validated at jump-time for the stuff that will be part of the lazy init?

In the current scheme no, not by scie-jump infrastructure; only by the binding command doing the load (see how the 1st draft of ptex does this here: https://github.com/a-scie/ptex/blob/a5d0a3ebf37f51ff3d416ab3385a31859b4e2d26/src/main.rs#L67-L72).

Right now, the file name is passed to the binding command and the scie-jump really has no clue what the binding command does with it from there to get to the point of producing bytes on stdout. That said, this problem is really pervasive. If you mis-type the values of exe or args or env, that could be the death of a binding command or a normal command. It's true the keys are validated since they are a part of the lift manifest schema, but the same goes for the keys in the load binding schema - the load binding command can validate the keys it expects are present.

The only real way to defend against these typos in general is to run the command as you did as part of your testing. Perhaps that is the answer here too. To test a scie you actually need to test it; same goes for thin scies.

@sureshjoshi
Copy link
Collaborator

Okay, thanks - I didn't think so, but just verifying.

And yeah, none of this would negate the need for pretty extensive testing, it's also just one of those principle of least surprise things for those deploying scies.

However, if this is taken care of at a higher level (e.g. the lift.json generator) or by the loader, then that's perfectly legit too.

Copy link
Collaborator

@sureshjoshi sureshjoshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is more I can add to this, outside of the discussions rendered.

Tested on my machine, and worked on some "breaking" cases as per the discussions.

@jsirois jsirois merged commit e3eb1ef into a-scie:main Nov 21, 2022
@jsirois jsirois deleted the issues/19 branch November 21, 2022 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support placeholders for files.
3 participants