Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream data more granularly #4

Open
tonistiigi opened this issue Jul 28, 2022 · 4 comments
Open

Stream data more granularly #4

tonistiigi opened this issue Jul 28, 2022 · 4 comments

Comments

@tonistiigi
Copy link
Member

Getting the full result takes lots of requests from different services.

How to refactor the library so that basic information can be retrieved quickly and the result is updated as more info becomes available.

Some info is always needed. For example, signatures need to be validated before anything can be shown about the image. Otoh, full SBOM does not need to be loaded before more basic data can be already shown.

Usually, only one architecture is shown at a time, but we still need to do validation of other architectures as well for some fields so that user doesn't make assumptions about other architectures if they are very different.

@jedevc
Copy link
Contributor

jedevc commented Jan 19, 2023

I wonder how granular we want to go 🤔

For example, imagine build-time multiple SBOMs attached to an image -- if the user asks for all go packages used in the build environments, should we download all of the SBOMs (which are likely to be large 🎉) before returning anything? Or should we be able to start returning data immediately as soon as it's available?


I think we probably want to make use of the flightcontrol package, so that we can queue up operations in parallel for downloading SBOMs, etc, even if multiple requests that use the same data source come in. For the easiest API surface, we could return an interface from the loader, and then all operations on it are blocking thread-safe functions.

Then we could do something like:

img := loader.Load("moby/buildkit:latest")
eg.Go(func() error {
    name, err := img.Name(ctx)
    if err != nil {
        return err
    }
    // display name
})
eg.Go(func() error {
    sbom, err := img.SBOM(ctx)
    if err != nil {
        return err
    }
    // display sbom
})

Alternatively, we could go for something that used channels so we didn't require the client to use a bunch of different go routines:

img := loader.Load(ctx, "moby/buildkit:latest")
select {
case sbom := <-img.SBOM():
    // display sbom
case sbom := <-img.Name():
    // display name
case err := <-img.Error():
    // handle error
    return err
}

I think I'd prefer something channel-based like the second, since it requires less thread management from the client, and also lets us progressively load packages, so we could return results from multiple SBOMs as they become available:

case package := <-img.Packages():
    // display a package

Any thoughts?

@tonistiigi
Copy link
Member Author

tonistiigi commented Jan 23, 2023

The initial idea for this library was that for an image source for any transport (registry, contentstore, oci layout) it will parse it and give a structured representation of that image. Current type https://github.com/docker/go-imageinspect/blob/main/types.go#L27 . When doing that, the library may use other APIs like Github or Hub API for additional data sources. The caller doesn't need to know how many APIs there are and if certain values are loaded from a specific object in the registry. For example, image description could be loaded from the annotation of the index, or manifest, or descriptor(2x), or Github, or Hub.

If we just add a bunch of helper functions for loading specific items, then this goal is not really achieved. The caller would need to know what separate objects are in the registry and other APIs, ask them individually and then try to combine their results. They would need to set up a bunch of errgroup goroutines, know what functions should be called in parallel, parallelize loading of separate platforms and then synchronize all these results safely together again.

An alternative is that we define a bunch of data states that can be asked separately. They can be thought as selectors or capabilities. In principle, they can exist for every field that types image structure defines, but we probably don't need this level of granularity.

selectors = ["name", "sbom", "dependencies", "env", "all"]

l, err := loader.New("imagename")

img, err := l.Wait(ctx, selectors...)
// this is a blocking function that returns image structure. The fields that are tracked by selectors are guaranteed to be filled. The result also contains a list of all the selectors that are present (if the result is passed to another function that may not know the initial request parameters).

// default "all" selector will block until everything is loaded (like the current library)

// multiple `Wait` can be called at the same time with any combination of selectors. The library will do the efficient synchronization internally.

// If some users prefer the syntax in the comment above, that can be achieved with a simple alias function
func (l *Loader) SBOM(ctx) ([]Package, error) {
  img, err := l.Wait(ctx, packages)
  return img.Packages
}

Whether the user wants properties for a specific platform(current in most cases) or for a map of all platforms can also be added as a parameter either to loader.New or Wait.

@jedevc
Copy link
Contributor

jedevc commented Feb 10, 2023

So, as mentioned by @tonistiigi earlier, we could consider using graphql as an API interface for this.

Couple of benefits:

  • It looks like the suggestion directly above is very similar to what a graphql API would look like
  • It means that users can direct what gets loaded, what doesn't, and can perform optimization themselves without us trying to guess at what to do
  • This can also be an API to expose as a server, which could be useful

Some potential downsides:

  • Even just library callers would need to insert lots of graphql into their code - though we could pre-write a few queries, e.g. an sbom query-string that loads all the sbom data. I think this is alright though.

@tonistiigi
Copy link
Member Author

graphql doesn't really fix the streaming aspect. The client would still need to know to ask for some things earlier as it predicts this data could be loaded faster. It works for custom requests but not for the cases of "give me all data but start sending as soon as you have even some of it". As this is a library, it is also a question of how easy it would be to call it from the clients then. When doing something like #4 (comment) it would probably be easy to wrap this with graphql for service that wants to expose that API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants