-
-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(environment): support collecting license evidence from installed distributions #674
base: main
Are you sure you want to change the base?
Conversation
… distributions Signed-off-by: Nejc Habjan <[email protected]>
if not (dist_files := dist.files): | ||
return None | ||
|
||
if not (license_files := [f for f in dist_files if LICENSE_FILE_REGEX.match(f.name)]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to guess the name of the file.
You might be able to access all names via dist.metadata.get_all('License-File')
this is similar to how declared component licenses are collected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkowalleck maybe the scope of this PR differs slightly from the issue description, but one of my main goals with this was to also collect licenses that the maintainers do not declare. For example, with dual licenses or in bundled and vendored code or where we can't trust maintainers to provide all the fields.
So when we build SBOMs internally at my employer I'd then get complaints that SBOMs are incomplete/inaccurate. This tries to address parts of that as well.
Take this example of the popular mkdocs-material theme (https://github.com/squidfunk/mkdocs-material):
import re
from importlib.metadata import distribution
dist = distribution("mkdocs-material")
LICENSE_FILE_REGEX = re.compile('LICEN[CS]E.*|COPYING.*')
dist.metadata.get_all('License-File')
> ['LICENSE']
[f for f in dist.files if LICENSE_FILE_REGEX.match(f.name)]
> [PackagePath('material/templates/.icons/fontawesome/LICENSE.txt'), PackagePath('material/templates/.icons/material/LICENSE'), PackagePath('material/templates/.icons/octicons/LICENSE'), PackagePath('material/templates/.icons/simple/LICENSE.md'), PackagePath('mkdocs_material-9.4.7.dist-info/licenses/LICENSE')]
Or just tried another one, https://pypi.org/project/packaging/ with a dual license:
dist = distribution("packaging")
dist.metadata.get_all('License-File')
>
[f for f in dist.files if LICENSE_FILE_REGEX.match(f.name)]
> [PackagePath('packaging-23.2.dist-info/LICENSE'), PackagePath('packaging-23.2.dist-info/LICENSE.APACHE'), PackagePath('packaging-23.2.dist-info/LICENSE.BSD')]
Maybe it's slightly different scopes but I think it's worth adding all those licenses, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see.
then maybe do your auto-detection, and add the declared license files, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks - so if I understand correctly we should try make a set out of the collected and maintainer-declared files right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might consider preventing duplicates in certain means, yes.
To what extent? I'd leave the answer to you, as you are probably a user of this feature. I am certain you will make the right decision.
@@ -47,6 +50,8 @@ | |||
|
|||
T_AllComponents = Dict[str, Tuple['Component', Iterable[Requirement]]] | |||
|
|||
LICENSE_FILE_REGEX = re.compile('LICEN[CS]E.*|COPYING.*') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
files to consider:
LICENCE
/LICENSE
- obviouslyNOTICE
- some license texts, like Apache-2.0, explicitly state this exact file (name) as a optional source of custom license addendumCOPYING
- obviously
read also: https://wheel.readthedocs.io/en/stable/user_guide.html?#including-license-files-in-the-generated-wheel-file
They describe this already, and they also allow arbitrary file suffixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah nice catch! I think I got the original idea for the pattern from another license finder.
I'm thinking with AUTHORS
and NOTICE
it's a bit less clear if some of the results should be considered license files in the sense that I'd expect in a CycloneDX SBOM, but perhaps it still makes sense. See e.g.
https://github.com/CycloneDX/cyclonedx-python/blob/main/NOTICE (maybe it counts.. not sure)
https://github.com/python-gitlab/python-gitlab/blob/main/AUTHORS
Suffixes should already be covered but I'll make sure to add some tests for this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regarding NOTICE
file:
such a file is a case of an addendum to standard license texts.
example:
the https://github.com/CycloneDX/cyclonedx-python/blob/main/NOTICE counts, since it is anchored in the apache-2.0 license - see https://github.com/CycloneDX/cyclonedx-python/blob/main/LICENSE#L106-L121
the AUTHORS
is unclear, this is true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkowalleck thanks! Ok makes sense. WDYT should we skip AUTHORS
or add it but make the pattern configurable? (this would make more sense if there was a tiny high-level public API for programmatic usage perhaps, I'll open a separate issue to discuss).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skip authors for now. if it is needed, somebody can add the feature later.
PS: Oftern i've seen AUTHORS
being just a huge collection of email addresses and names with not much benefit. If it had any legal meaning, it would be where it belonged, instead of being an extra file. I am no lawyer and i did not read all license texts, maybe it is similar to the NOTICE
file ... IDK
added PEP 639 compliant license gathering via #755 |
Closes #570
Todo: still in draft as I need to add tests