Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve license detection of declared RPM licenses #2412

Open
pombredanne opened this issue Feb 23, 2021 · 10 comments
Open

Improve license detection of declared RPM licenses #2412

pombredanne opened this issue Feb 23, 2021 · 10 comments

Comments

@pombredanne
Copy link
Member

Description

We should create a license symbols map for RPMs and use that to feed the expression detection first before detecting more.
Otherwise we get too many inconsistencies.
A recent set of CentOS RPM licenses with detected/declared is attached for info
rpms-licenses.csv.txt

@pombredanne pombredanne changed the title Improve license detecion of declared RPM licenses Improve license detection of declared RPM licenses Feb 23, 2021
@akugarg
Copy link
Collaborator

akugarg commented Apr 8, 2021

@pombredanne Can you please explain a bit more on this.

@pombredanne
Copy link
Member Author

An RPM can have a license declared which is one of the tags we collect.
It can also have special files tagged as being license-related.
We have a limited way to detect a license on these and we should improve that such that:

  1. we have license mapping based on some analysis of a large number of RPMs and spec files so we can better detect these licenses, such that the compute_normalized_license method for rpm is more efficient. This may be in https://github.com/nexB/scancode-toolkit/blob/75b95c51ba977e13cca0c716cbb0be2773f93d9a/src/packagedcode/rpm_licenses.txt
    (FWIW that's this RPM tag we collect for this https://github.com/rpm-software-management/rpm/blob/0b75075a8d006c8f792d33a57eae7da6b66a4591/lib/rpmtag.h#L91 )
  2. we have additional rules as needed
  3. we can also collect and detect the files tagged as being license-related (these files tagged this way https://github.com/rpm-software-management/rpm/blob/da55723907418bfb3939cd6ddd941b3fdb4f6887/lib/rpmfiles.h#L58 )

Check https://github.com/nexB/scancode-toolkit/blob/develop/src/packagedcode/rpm_installed.py and https://github.com/nexB/scancode-toolkit/blob/develop/src/packagedcode/rpm.py

@pombredanne
Copy link
Member Author

To better explain the context, a list of used licenses can be found in each RPM metadata. These are also in the repomd such as in this one from CentOS: https://archive.kernel.org/centos-vault/8.0.1905/BaseOS/x86_64/os/repodata/087f260d0243e74021b5bde5d0091b2ba15998ffdcf336471bbe3e97ffafbb7b-primary.xml.gz

<package type="rpm">
  <name>ModemManager-glib</name>
  <arch>i686</arch>
  <version epoch="0" ver="1.8.0" rel="1.el8"/>
  <checksum type="sha256" pkgid="YES">b27635edf4ece5cff60f231f8a578da14d300d98b9da4b5b52d43d4b4c43ba31</checksum>
  <summary>Libraries for adding ModemManager support to applications that use glib.</summary>
  <description>This package contains the libraries that make it easier to use some ModemManager
functionality from applications that use glib.</description>
  <packager>CentOS Buildsys &lt;[email protected]&gt;</packager>
  <url>http://www.freedesktop.org/wiki/Software/ModemManager/</url>
  <time file="1562077020" build="1557586982"/>
  <size package="258216" installed="1184560" archive="1185624"/>
  <location href="Packages/ModemManager-glib-1.8.0-1.el8.i686.rpm"/>
  <format>
    <rpm:license>GPLv2+</rpm:license>
    <rpm:vendor>CentOS</rpm:vendor>
...

Such primary repomd data could be used to assemble a list of license symbols (such as GPLv2+ here) and then create the mapping for each of these to an actual scancode license. Then these mapped symbols would be used to detect expression-like expressions in the RPM license tag, or used as-is as a mapping for thing that cannot be mapped.

wget https://archive.kernel.org/centos-vault/8.0.1905/BaseOS/x86_64/os/repodata/087f260d0243e74021b5bde5d0091b2ba15998ffdcf336471bbe3e97ffafbb7b-primary.xml.gz
gunzip 087f260d0243e74021b5bde5d0091b2ba15998ffdcf336471bbe3e97ffafbb7b-primary.xml.gz
# using a perl regex to print only 
grep -oP "(?<=license\>)(.*)(?=</rpm)"  087f260d0243e74021b5bde5d0091b2ba15998ffdcf336471bbe3e97ffafbb7b-primary.xml | sort -u

This yields these:

AFL and GPLv2+
ASL 2.0
ASL 2.0 or BSD
Bitstream Vera and Public Domain
BSD
BSD and GPLv2+
BSD and GPLv2 and GPLv2+
BSD and ISC
BSD and LGPLv2+
BSD and LGPLv2+ and GPLv2 and GPLv2+
BSD and LGPLv2+ and GPLv2+ and Public Domain
BSD and LGPLv2 and Sleepycat
BSD and MIT
BSD and Python and Unicode
BSD or GPL+
BSD or GPLv2
BSD or GPLv2+
BSD with advertising
BSD with advertising and MPLv1.1
CC0 and Redistributable, no modification permitted
CDDL
Copyright only
CPL
EPL
(FTL or GPLv2+) and BSD and MIT and Public Domain and zlib with acknowledgement
GPL+
GPL+ and GPLv2+ and BSD and MIT and Copyright only and IEEE
.....

Then thi would need to be massaged to get a list of symbols such Public Domain or GPLv2+ ... to be used later in a custom license expression parsing for RPMs to be written

This small list from CentOS is one of many, this is just as an illustration... it could be used this way more or less:

from license_expression import *
symbols = (
    LicenseSymbol(key='bsd-new', aliases=('BSD',)),
    LicenseSymbol(key='bsd-original', aliases=('BSD with advertising',)),
    LicenseSymbol(key='unknown-license-reference', aliases=('Copyright only',)),
    LicenseSymbol(key='gpl-1.0-plus', aliases=('GPL', 'GPL+',)),
    LicenseSymbol(key='gpl-2.0-plus', aliases=('GPLv2+',)),
    LicenseSymbol(key='mit', aliases=('MIT',)),
    LicenseSymbol(key='zlib-acknowledgement', aliases=('zlib with acknowledgement',)),
)

licensing = Licensing(symbols=symbols)

>>> e='GPL+ and GPLv2+ and BSD and MIT and Copyright only or zlib with acknowledgement'
>>> str(licensing.parse(e))
'(gpl-1.0-plus AND gpl-2.0-plus AND bsd-new AND mit AND unknown-license-reference) OR zlib-acknowledgement'


@pombredanne
Copy link
Member Author

This is not simple, but a good first issue for a skilled aspiring contributor.

@pombredanne
Copy link
Member Author

Merging this other issue to consolidate things in one place for RPMs:

See https://github.com/hughsie/PackageKit/blob/fcd5b8a693cdc3135b91f473cb3f860af9295cec/backends/yum/licenses.txt
See also http://fedoraproject.org/wiki/Licensing:Main

@pombredanne
Copy link
Member Author

pombredanne commented Apr 21, 2022

These other issues are closely related:

@pombredanne
Copy link
Member Author

@ivanayov ping, following up on this comment aboutcode-org/license-expression#70 (comment)

@pombredanne
Copy link
Member Author

@sutula @qduanmu @richardfontana @jlovejoy ping FYI

@xsuchy
Copy link
Contributor

xsuchy commented Jan 18, 2025

Fedora moved to SPDX id in license tags https://fedoraproject.org/wiki/Changes/SPDX_Licenses_Phase_4 and RHEL10 will have SPDX ids too.
So this issue become obsolete - and I am not sure if this is worth of time doing for older Fedoras and RHELs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants