Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report Packages at the codebase-level #2098

Closed
6 of 9 tasks
JonoYang opened this issue Jul 1, 2020 · 7 comments
Closed
6 of 9 tasks

Report Packages at the codebase-level #2098

JonoYang opened this issue Jul 1, 2020 · 7 comments

Comments

@JonoYang
Copy link
Member

JonoYang commented Jul 1, 2020

We should make the package consolidation logic from the consolidation plugin a default function of the Package scanning option. The consolidation plugin would then be focused on files that are not part of a package so we can perform logical groupings on them.

Some changes that would have to change on the Package model/Package scanning process:

  • Remove root_path from Package. There is no universal root path for all Packages.
  • Tag a resource if it is part of one or more Package during a Package scan. This would be similar to the consolidated_to field, where it would be a list of purls. Possible name for this field is for_packages.
  • Return detected Packages in a new top level codebase attribute. This would be a set of all detected packages from a codebase.
  • Add primary license expression/copyright/holders to Package models, which is populated from top-level key package files (manifests, etc.)
  • Add secondary license expression/copyrights/holders to Package. This is populated from the detected license expressions/copyrights of Package resources, excluding top-level key files.

Here is an updated design:

The key elements are to:

  • report packages as top-level. The data structure is the same as the one at the file level but will be the merged data from possibly several manifests and lock files.
  • track which files are part of a given package instance

Package model updates

Files model updates

This could look this way:

packages (aka. package instances)
  package
    package_manifest_paths
    ... data...

files:
   package_manifests: [...] (formerly named packages)
   for_packages: [ list of 
     { package_url: pkg:foo/bar@12, package_instance: UUID}
   ]

For later:

  • sub-packages/embedded could be 1) listed directly as packages of their own 2) related to their parent (or the parent related to them)
  • we could also track the file_paths under each top level package instance
  • get all packages to have actual files
  • migrate system packages (Alpine, RPM, Debian, Windows) to new model
@JonoYang JonoYang self-assigned this Jul 1, 2020
JonoYang added a commit that referenced this issue Jul 2, 2020
JonoYang added a commit that referenced this issue Jul 2, 2020
JonoYang added a commit that referenced this issue Jul 2, 2020
JonoYang added a commit that referenced this issue Jul 6, 2020
JonoYang added a commit that referenced this issue Jul 6, 2020
@pombredanne
Copy link
Member

Something to consider if we were to track packages at the top level:

  • should we track each instance of a package (and its files separately)
  • what happens if different package instances have slightly metadata?

@pombredanne
Copy link
Member

@JonoYang I reckon this was never merged. And we will need this but IMHO we should go one more step:

  1. the things we detect today are incorrectly reported as packages when they really are package manifests
  2. some packages may have multiple manifests (e.g. manifest proper and a lock file). For instance .gemspec/Gemfile/Gemfile.lock or package.json/package-lock.json/yarn.lock or setup.py/setup.cfg/requirements.txt ... etc.

Packages can be nested too:a package can have sub-packages or have multiple personalities such as a bower.json and a package.json, or node_modules nested under an npm or scancode-toolkit bundled wheels.

Therefore I think we should:

  • rename the current file-level packages to package_manifests. There the data structure is that of a Package, but with no constraints about which field are mandatory.

  • report packages only as top-level. The data structure is the same but is the merged data from possibly several manifests, and lock files.

  • we want to track unique instances of each packages (say you have two identical wheels in the same code tree) and we could use simply a UUID for each instance. Files of a package would point to the purl + UUID, e.g. an instance of a package.

  • sub-packages/embedded could be 1) listed directly as packages of their own 2) related to their parent (or the parent related to them)

@pombredanne pombredanne changed the title Improve Package models wrt consolidation Report Packages at the codebase-level Sep 6, 2021
@pombredanne
Copy link
Member

After a chat with @tdruez , we should list right away the file_paths under each top level package instance as it becomes much clearer and simpler to handle in scancode.io.

@pombredanne
Copy link
Member

From #1554

Packages often contain other nested packages from other origins. For instance an installed npm-based application will have a node_modules directory of the provisioned dependencies once an npm install command has been issued. The same would apply to Pypi packages in a virtualenv, Rubygems bundled in a vendor dir with bundler, nested Maven projects... etc.

Therefore it would be useful to attach the list of files that are part of a detected package in the packages list of returned results. In many cases this list of files is exactly the same as the descendants files of the package root directory ... but not always as explained above.

The usage will be to use these to perform a proper summarization of the license/origin/etc of the files that are part of a package at the package level. And this will support #377

@pombredanne pombredanne added this to the v31.0 milestone Sep 24, 2021
@AyanSinhaMahapatra AyanSinhaMahapatra self-assigned this Oct 13, 2021
@AyanSinhaMahapatra
Copy link
Member

Also see improvements to this in #2843

pombredanne added a commit that referenced this issue Mar 5, 2022
Add Package Instances #2691

This PR adds the PackageInstance class and functions to group package
manifests and package data as top level package instances.

Existing package data are ported to this new approach.

Reference: #2098 
Reference: #2691 
Reference: #2692
Reference: #2693
Reference: #2843 
Reference: #2652 
Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
@pombredanne
Copy link
Member

This is mostly done but there are some smaller issues pending.

@pombredanne
Copy link
Member

I consider this done now. @JonoYang @AyanSinhaMahapatra Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants