Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syft command restructure #516

Closed
spiffcs opened this issue Sep 27, 2021 · 8 comments · Fixed by #2446
Closed

Syft command restructure #516

spiffcs opened this issue Sep 27, 2021 · 8 comments · Fixed by #2446
Assignees
Labels
changelog-ignore Don't include this issue in the release changelog enhancement New feature or request
Milestone

Comments

@spiffcs
Copy link
Contributor

spiffcs commented Sep 27, 2021

What would you like to be added:

Currently syft's root and packages commands produce the same package specific output:
syft packages node:latest > /dev/null
syft node:latest > /dev/null

Screen Shot 2021-09-27 at 11 06 10 AM

syft also has a power-user command which produces a more verbose output concerning packages, secrets, file metadata file-digests:
syft power-use node:latest > /dev/null

Screen Shot 2021-09-27 at 11 09 39 AM

I believe the syft [noun] pattern is not the space where we want to be focusing development moving forward.

Given that syft is concerned with sbom generation, I propose we look to move our command structure towards syft [verb], starting with syft create or syft describe in order to generate an sbom.

The goal of this command API change follows improving a few key points.

  1. Our current path of coupling presenter/noun logic and structures will lead to a good bit of unmanageable code sprawl. We can already see here that the poweruser config struct has a host of VERY specific and useful information that only it has access to. Rather than reimplement/confuse these structures for each noun we should look to shift them lower in the program so presenters interpret all possible SBOM entities the same way.

    type JSONDocumentConfig struct {
    ApplicationConfig config.Application
    PackageCatalog *pkg.Catalog
    FileMetadata map[source.Location]source.FileMetadata
    FileDigests map[source.Location][]file.Digest
    FileClassifications map[source.Location][]file.Classification
    FileContents map[source.Location]string
    Secrets map[source.Location][]file.SearchResult
    Distro *distro.Distro
    SourceMetadata source.Metadata
    }

  2. As we move into wanting to describe more SBOM entities, code paths like presenter/secrets presenter/files presenter/relationships just do not scale into the user behavior or usefulness we want for the tool. syft create should already do analysis for the different entities with sane defaults and a config input that can increase/decrease the entities analyzed for the command.

  3. Refocus presenter to be presenter/spdx presenter/json presenter/cyclonedx. Formats as described previously should not be children of nouns like packages. The current program architecture does just this syft/internal/presenter/packages/spdx.go,json_pacakge.go,table,etc. Refactoring to syft [verb] -o [presenter] allows us much more space to design common logic/structures when dealing with different SBOM entities.

Why is this needed:
Remove formal sbom entity semantics from command API to refocus program on its main directive of generating sbom.

Additional context:
TODO

@spiffcs spiffcs added the enhancement New feature or request label Sep 27, 2021
@wagoodman
Copy link
Contributor

wagoodman commented Sep 28, 2021

To add on to this, this implies that we need better ways to express cataloging configurability. That is, today we catalog "packages" under the packages command, and this allows us to catalog more than packages under the "create" command (e.g. file metadata, file digests, etc). We could leave all catalogers on by default, however, it would be more ideal to allow for enabling catalogers selectively.

Today we change the set of catalogers based on what the source type is (image vs filesystem). When the input is an image we use catalogers that look specifically for installed packages (e.g. RPM dbs, python wheel and egg metadata files, etc). When the input is a filesystem we use catalogers that look for any ecosystem index files even if it's not indicative of an installation (e.g. python requirement.txt files).

Here's a possible path forward for this:

syft create --catalog NAME[,NAME,...]

Where NAME would be cataloger names or a name of a group of catalogers. Let's assume that package cataloger names take the form:

<language-or-ecosystem>[-<package-manager-name>]-<type>

In which case our current set of catalogers would be named:

  • ruby-gem-installation (specification/gem.spec)
  • ruby-gem-manifest (gemfile.lock)
  • python-installation (wheel/egg)
  • python-manifest (requirements.txt, poetry.lock)
  • javascript-npm-installation (package.json)
  • javascript-npm-manifest (package-lock.json, yarn.lock [not nested in node_modules dir])
  • os-dpkg-installation
  • os-rpmdb-installation
  • os-apkdb-installation
  • java-installation (jar,war,ear,jpi,hpi)
  • go-modules-manifest (go.mod)
  • rust-cargo-manifest (cargo.lock)

And you can specify partial names to select on:

  • ruby: matches the pattern ruby-* which selects: ruby-gem-installation, ruby-gem-manifest
  • ruby-manifest: matches the pattern ruby[-*]-manifest which selects: ruby-gem-manifest
  • python: matches the pattern python-* which selects: python-installation, python-manifest
  • os: matches the pattern os-* which selects: os-dpkg-installation, os-rpmdb-installation, os-apkdb-installation
  • installations: matches the pattern *-installation ...
  • manifests: matches the pattern *-manifest ...

Some example usage:

# look only for python installations
syft --catalog python-installations ...

# look for python installations and manifests
syft --catalog python

# look for all installations (default for image source types)
syft --catalog installations

# look for all manifests (default for filesystem source types)
syft --catalog manifests

We could add semantics that allow for keeping the default catalogers and adding or removing a few:

# use the default catalogers + look for python manifests
syft --catalog +python-manifest,ruby-gem-manifest   <my image>

# use the default catalogers with the exception of any python catalogers
syft --catalog -python

One problem I see with this approach is that it's really easy to want to specify plurals for items here, but that would be more difficult to parse.

Another thing: if we go with this idea (or some variant of this idea) that implies that we should remove the *_ENABLED environment variable / configuration options for all catalogers and instead add a configurable option for the list of catalogers enabled.

Open to thoughts/comments/suggestions on this!

@luhring
Copy link
Contributor

luhring commented Oct 1, 2021

( Relates to #465 )

@luhring
Copy link
Contributor

luhring commented Oct 1, 2021

I like this direction a lot!

Re: moving from syft [noun] to syft [verb]

I love it. Very excited about this!

Re: adding to cataloger configuration

We've known for a while that we'd need to allow for more granular configuration of cataloging, and it's exciting to see us begin to tackle this challenge 🎉 🎉 🎉

Here's my two cents...

  1. I think we should be mindful of kinds of (A) behavioral tweaks that users have in mind (including the terminology in their heads) vs. (B) the architectural constructs within Syft that can be adjusted, and how we're mapping between (A) and (B). Just for an example, perhaps there's actually an advantage to user comprehensibility if we use ecosystem-specific terms instead of generalizing all ecosystems with nouns like "manifests" vs. "installation" — e.g., package-lock-json instead of javascript-npm-manifest (I'm not sure NPM teams would immediately know what the latter means).
  2. Similar to the path we took with Grype's ignore rules, it could be worthwhile not to figure out CLI exposure of config just yet, and instead focus on the raw configuration itself, at least in the beginning...
  3. One tough question here is "how much configurability is the right amount"? Just from my personal observation from customer requests and community Slack messages, most of what users want to do that they can't do today comes down to: determining what's installed using a directory scan instead of an image scan — this is huge for users that scan within running containers, or on VMs/servers, or even local machines. I'm not arguing that no one wants to do any more than this, but I think as we iterate, it will be important to avoid "over-solving", especially when it comes to the "API" we expose.

@wagoodman
Copy link
Contributor

wagoodman commented Oct 1, 2021

responding about the cataloger configuration (tackling in a odd order here):

  1. I do have hesitations about the specific names I suggested above (with "manifest" and "installation"). It was the closest distinction I could get to categorizing the behavior of "dir" scanning and "image" scanning today that stuck with semantics. Let me recreate the list relative to your suggestion (e.g. package-lock-json names):
  • ruby-gem-spec (specification/gem.spec)
  • ruby-gem-file-lock (gemfile.lock)
  • python-package (wheel/egg)
  • python-requirements (requirements.txt) # note: this is split from todays cataloger
  • python-poetry-lock (poetry.lock) # note: this is split from todays cataloger
  • javascript-package-json (package.json)
  • javascript-package-lock (package-lock.json) # note: this is split from todays cataloger
  • javascript-yarn-lock (yarn.lock [not nested in node_modules dir]) # note: this is split from todays cataloger
  • os-dpkg
  • os-rpmdb
  • os-apkdb
  • java-archive (jar,war,ear) # note: this is split from todays cataloger
  • jenkins-plugin (jpi,hpi) # note: this is split from todays cataloger
  • go-mod (go.mod)
  • rust-cargo-lock (cargo.lock)

With the above list I still left organization prefixes so you could still specify python instead of python-package,python-requirements,python-poetry-lock or os instead of having to know which distro+cataloger pairing you need.

  1. I'm with you on the "just right amount" of configuration. This attempts to do two things at the same time:
    • expose the existing _ENABLED cataloger options from the config today on the CLI, migrating from individual "options" to "a list of names" instead
    • allow for finer control over which catalogers should be invoked (an ask in Add ability to enable/disable package catalogers #465 ) while providing an alternative to the "image" vs "directory" automatic cataloger selection, which some users have found non-obvious (but still important to keep).
      I think the automatic decision of which catalogers to use depending on the input source type (image vs dir) has not exposed enough configurability. If we were to implement point 1 (a list of cataloger names) as configuration is organized today, then "packages" represents all of the package catalogers that should be run given the input (image vs dir).
2. Agreed! I think the configuration we have today is mostly alright --the main alteration is swapping "switches" for "lists". For instance, take todays (amended) config:
...
package:
  cataloger:
    # enable/disable cataloging of packages
    # SYFT_PACKAGE_CATALOGER_ENABLED env var
    enabled: true

    # the search space to look for packages (options: all-layers, squashed)
    # same as -s ; SYFT_PACKAGE_CATALOGER_SCOPE env var
    scope: "squashed"

# cataloging file classifications is exposed through the power-user subcommand
file-classification:
  cataloger:
    # enable/disable cataloging of file classifications
    # SYFT_FILE_CLASSIFICATION_CATALOGER_ENABLED env var
    enabled: true

    # the search space to look for file classifications (options: all-layers, squashed)
    # SYFT_FILE_CLASSIFICATION_CATALOGER_SCOPE env var
    scope: "squashed"

# cataloging file contents is exposed through the power-user subcommand
file-contents:
  cataloger:
    # enable/disable cataloging of secrets
    # SYFT_FILE_CONTENTS_CATALOGER_ENABLED env var
    enabled: true

    # the search space to look for secrets (options: all-layers, squashed)
    # SYFT_FILE_CONTENTS_CATALOGER_SCOPE env var
    scope: "squashed"

  # skip searching a file entirely if it is above the given size (default = 1MB; unit = bytes)
  # SYFT_FILE_CONTENTS_SKIP_FILES_ABOVE_SIZE env var
  skip-files-above-size: 1048576

  # file globs for the cataloger to match on
  # SYFT_FILE_CONTENTS_GLOBS env var
  globs: []
...

Here's the same config selection, but with the suggested alterations:

...
##### this is the new section 
catalogers:
  enabled:
    - package
    - file-classification
    - file-contents

##### below is mostly the same, but a little smaller and flatter...
package:
  # the search space to look for packages (options: all-layers, squashed)
  # same as -s ; SYFT_PACKAGE_SCOPE env var
  scope: "squashed"

# cataloging file classifications is exposed through the power-user subcommand
file-classification:
  # the search space to look for file classifications (options: all-layers, squashed)
  # SYFT_FILE_CLASSIFICATION_SCOPE env var
  scope: "squashed"

# cataloging file contents is exposed through the power-user subcommand
file-contents:
  # the search space to look for secrets (options: all-layers, squashed)
  # SYFT_FILE_CONTENTS_SCOPE env var
  scope: "squashed"

  # skip searching a file entirely if it is above the given size (default = 1MB; unit = bytes)
  # SYFT_FILE_CONTENTS_SKIP_FILES_ABOVE_SIZE env var
  skip-files-above-size: 1048576

  # file globs for the cataloger to match on
  # SYFT_FILE_CONTENTS_GLOBS env var
  globs: []
...

Note: I'm only illustrating the config changes idea, so I'm leaving todays cataloger names in this example (e.g. package, instead of all of the catalogers)

A list of enabled catalogers seems to be the simplest approach to start with, but I also think this would be a small but meaningful lift to add:

cataloger:
  # could take cataloger names or group names
  enabled:
    - my-awesome-group-name

  # allow for user-defined groups.
  groups:
    my-awesome-group-name:
      - file-metadata
      - file-classification
      - package

Adding back in fine-grained control over the individual package catalogers... We could encode our "image" and "directory" distinctions as cataloger groups:

cataloger:
  # empty list (the default) = auto... syft selects for me
  enabled: []
  groups:
    # a default group...
    image:
      - ruby-gem-spec
      - python-package
      - ...
    # a default group...
    directory:
      - ruby-gem-file-lock
      - python-requirements
      - python-poetry-lock
      - ...

Note: this is mixing the feedback from point 1 (package cataloger names) , but is attempting to illustrate the usefulness of cataloger groups as a config option in general.

@wagoodman
Copy link
Contributor

@luhring , I think you're right for the meantime if we exposed per-cataloger enable/disable functionality it's probably a good idea to keep that to configuration for the meantime and discover the CLI expression later.

@wagoodman wagoodman added the breaking-change Change is not backwards compatible label Oct 19, 2021
@wagoodman
Copy link
Contributor

wagoodman commented Oct 19, 2021

A note that wasn't mentioned in this comment thread explicitly was what commands would be added to the syft CLI and which would be removed.

syft ...     # default command, runs "create"
syft create  # creates an SBOM

Commands that will be removed:

  • power-user: the create command should be able to replicate all of the same functionality via configuration (not necessarily via CLI)

@wagoodman
Copy link
Contributor

wagoodman commented Dec 17, 2021

I'm going to split out the configuration suggestions into it's own issue. That means this issue is only about deprecating the packages and power-user

Suggested work:

  • Remove power-user command
  • We probably should NOT remove the packages command
  • We should add a deprecation notice to the packages command (suggest using the root command)
  • Remove packages command from documentation

This implies that syft [noun] should not be used going forward in favor of syft [verb] ... (this says nothing to nested subcommands, such as syft create sbom for example).

@televi
Copy link

televi commented Aug 2, 2022

I've been following this issue since I use the power-user command. Do you have a proposed set of config settings that mimic the output currently generated by syft power-user? What I was looking for was something along the lines of:
"To mimic the power-user command, the following config settings may/should be used:".

I looked in https://github.com/anchore/syft#configuration based on the output of "syft power-user --help". A search there for "power-user" did give me some possible pointers, but it wasn't clear to me that setting those config items to true would be all that was needed. The settings I saw that seemed to related to the output produced by the power-user command were:

  • package.cataloger.enabled
  • file-classifier.cataloger.enabled
  • file-contents.cataloger.enabled
  • file-metadata.cataloger.enabled
  • secrets.cataloger.enabled

I really just want to be ready for when power-user is deprecated, so no rush here (unless v1.0 is coming out tomorrow 😄 ).

@spiffcs spiffcs removed their assignment May 23, 2023
@wagoodman wagoodman self-assigned this Dec 19, 2023
@wagoodman wagoodman added changelog-ignore Don't include this issue in the release changelog and removed breaking-change Change is not backwards compatible labels Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog-ignore Don't include this issue in the release changelog enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants