Proposal: Scan deduction and summarization #377

pombredanne · 2016-11-25T08:52:26Z

Context

Scanning operates at the file level. This is good but in many cases a scan reports too much data at a too detailed level. This happens when related clues are detected across files or inside the same file.

Problem

Multiple related clues in different files

For instance, if every file in a directory tree has the same license and copyright statements, then the license and origin information could be rolled up at the level of this directory and the file details could be omitted.

Or say that a scanned directory only contains a COPYING file with a license and notice and none of the files in that directory have a license or copyright. Then the license and origin information could be extended from the COPYING to all the files in that tree.

Or say that a scanned directory only contains a README file with a license and notice and that all the files in that directory have a comment See README for licensing. Then the license and origin information could be extended from the README to all the files in that tree that carry this comment.

Or say that a Package is detected (such as Maven Jar or an NPM or else) and that the package-level metadata accurately described the licensing of all the files for this package and that the scan of the files in this package does not bring new details. Then only the license and origin information from the package could be kept and the file details omitted.

Or say that a directory contains code in a mix of programming languages: the primary or main language or language stats could be rolled up at the directory level.

Or say that a directory contains both code and build scripts and that the license for the build scripts is different from that of the code (say this is some autotools MIT or FSF notice). Then the licenses for the directory could be summarized based on a classification of the code files, and the build scripts and the build script licenses would not be reported as the directory or package license.

Multiple related clues in the same file

Some scans operate on the same data in a given file and this may trigger reporting extra or spurious clues and could be instead considered together.

For instance a license text may contain a copyright statement for the text of the license and URLs and emails. Detecting licenses, copyrights, emails and urls could report four different clues in same scanned file and scanned text region when this is may be instead a single clue for the license that should be reported and not four clues.

Or a package metadata file would typically contains origin and license information and these would end up reported twice both as package attributes and individual detection for license, copyright and urls.

Solution elements

A comprehensive solution may cover some or all of these:

determine where to summarize and roll up clues. For instance, rolling everything at the root directory level would rarely make sense; instead rolling things up at a package level and finding what would be a good directory level to use as a break point would be important
implement some classification of files such as test, code proper, build scripts, test code, etc.
implement some statistics, rules and/or machine learning to summarize and deduct proper higher-level origin and license.
scan all the clues togther in order to combine (and filter) them properly
combine package detection with license and copyright detection

The text was updated successfully, but these errors were encountered:

yahalom5776 · 2016-11-30T16:28:51Z

@pombredanne Another case I see quite often is a detection of a generic clue for e.g. LGPL (with no further version info) and then another clue in the same file with the specific license information, e.g. LGPL 2.1 or later. It would help to have some logic to roll these up to the "better" result which is LGPL 2.1 or later. Could be based on the "distance" clues are away from each other in the file and the knowledge that LGPL and LGPL 2.1 are related (this would have to be set in the license meta data/detection definitions). Another topic where such a roll up would be helpful are the typical GPL 2 or later with Autoconf exception headers.

I thought about this for some time and I am still a little bit worried about "auto-resolutions" if I do not know that this resolution even happened. So perhaps we could preserve the raw data of all clues found somehow to be able to retrace the finding?

Assuming licenses from clues on directory level to other files (perhaps with the condition they have no other clues themselves) is a possibility but I think it's a completely different ballgame from a complexity and (legal) risk level. Perhaps it makes more sense to start on file level for that matter. But that's just IMHO.

pombredanne · 2016-11-30T18:41:47Z

@yahalom5776 Thanks for the feedback. This makes 100% sense to me. I agree we should always keep the raw scans: this is more about adding smarts and summaries at the package and some directory levels, but not hiding the things below these

pombredanne · 2016-11-30T18:42:59Z

@yahalom5776 If you can provide some examples for Another topic where such a roll up would be helpful are the typical GPL 2 or later with Autoconf exception headers. this would be great

pkunz · 2016-11-30T21:16:34Z

On

For instance, if every file in a directory tree has the same license and copyright statements, then the license and origin information could be rolled up at the level of this directory and the file details could be omitted.
Yes, if not done automatically, it has to be done manually during an Audit.

One has to be careful with the COPYING file. It may be the text for gpl-2.0 or lgpl-2.1, but in the head of files one may find gpl-2.0-plus or lgpl-2.1-plus. Or the 'or later' might be found in a NOTICE or README file. Also there may be a few files with licenses other than the one stated in the COPYING file.

If autotools are used (quite common) then the same set of licenses show up in a scan which could/should be ignored because the autotool files are copied verbatim or generated from a template. Perhaps we could make a list of such files that can be ignored.

I don't always trust the license info in the metadata of an rpm because this is put in by hand by the author of the rpm spec file who is not necessarily the author of the package.

yahalom5776 · 2016-12-09T13:19:00Z

@pombredanne Sorry for the late reply but here is a similar example. It's from glibc 2.19:
glibc.zip-extract/glibc-2.19/wcsmbs/isoc99_vswscanf.c

License header:

/* Copyright (C) 1993-2014 Free Software Foundation, Inc.
   This file is part of the GNU C Library.

   The GNU C Library is free software; you can redistribute it and/or
   modify it under the terms of the GNU Lesser General Public
   License as published by the Free Software Foundation; either
   version 2.1 of the License, or (at your option) any later version.

   The GNU C Library is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   Lesser General Public License for more details.

   You should have received a copy of the GNU Lesser General Public
   License along with the GNU C Library; if not, see
   <http://www.gnu.org/licenses/>.

   As a special exception, if you link the code in this file with
   files compiled with a GNU compiler to produce an executable,
   that does not cause the resulting executable to be covered by
   the GNU Lesser General Public License.  This exception does not
   however invalidate any other reasons why the executable file
   might be covered by the GNU Lesser General Public License.
   This exception applies to code released by its copyright holders
   in files containing the exception.  */

That's the ScanCode result according to the HTML output for that file:

glibc.zip-extract/glibc-2.19/wcsmbs/isoc99_vswscanf.c 	2 	25 	license 	lgpl-2.1-plus
glibc.zip-extract/glibc-2.19/wcsmbs/isoc99_vswscanf.c 	2 	25 	license 	lgpl-2.1-plus-linking

Correct roll-up would be

lgpl-2.1-plus-linking

in this case. Perhaps you can have a look. Thank you!

Edit: Another one from glibc 2.19, this time it is an autoconf clue:

License header of glibc.zip-extract/glibc-2.19/scripts/config.gues (Lines 1 - 32):

#! /bin/sh
# Attempt to guess a canonical system name.
#   Copyright 1992-2013 Free Software Foundation, Inc.

timestamp='2013-11-29'

# This file is free software; you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, see <http://www.gnu.org/licenses/>.
#
# As a special exception to the GNU General Public License, if you
# distribute this file as part of a program that contains a
# configuration script generated by Autoconf, you may include it under
# the same distribution terms that you use for the rest of that
# program.  This Exception is an additional permission under section 7
# of the GNU General Public License, version 3 ("GPLv3").
#
# Originally written by Per Bothner.
#
# You can get the latest version of this script from:
# http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.guess;hb=HEAD
#
# Please send patches with a ChangeLog entry to [email protected].

ScanCode detection:

glibc.zip-extract/glibc-2.19/scripts/config.guess 	7 	18 	license 	gpl-3.0-plus
glibc.zip-extract/glibc-2.19/scripts/config.guess 	20 	25 	license 	gpl-3.0-autoconf
glibc.zip-extract/glibc-2.19/scripts/config.guess 	55 	56 	license 	unknown

The "unkown" detection is further down in the file and should be reviewed and handled independently IMO:

Originally written by Per Bothner.
Copyright 1992-2013 Free Software Foundation, Inc.

This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE."

pombredanne · 2016-12-09T14:46:26Z

@yahalom5776 Thanks!

For the GLibc case, this is something that will dealt with license expressions with #74 e.g. in this case, it would be an expression like: lgpl-2.1-plus with lgpl-2.1-plus-linking . This is because the two licenses need to be reported and are detected together

pombredanne · 2016-12-09T14:53:59Z

@yahalom5776 For config.guess case, (and in general when several licenses are detected in a single file) we have various possibilities:

you have several repeated licenses with the same copyright (e.g. as is the case in most detection in a config.guess) and these are good candidate for a summarization
you have several licenses (such as in a top level notice that would recap all the embedded third-party licenses) in that becomes hard to summarize anything

In the case of the unknown detection, we have this interesting text: see the source for copying conditions which could be something we could detect on its own (and many variations on the same theme of "see in this other file for licensing...").... and we could be smart about that. Is there a See LICENSE and we detected a license in a LICENSE file nearby? And if so could we infer what this "unknown" license is instead?

Finally in the case of a common build such as config.guess and related autotools scripts, having them classified automatically as being build scripts could offer a way to further do some deduction of what the license is and what is the relative importance of these licenses e.g. the license of the build scripts is not as important as the license of the main code proper and usually has little or impact on the resulting license: I can build an MIT-licensed package with autotools or a GPL-licensed build script and my package will still be MIT-licensed and neither the built binaries nor the source proper will not inherit from the build script licensing.

* detect license references such as "See COPYING for details" Signed-off-by: Philippe Ombredanne <[email protected]>

* also rename CLI option * add tests Signed-off-by: Philippe Ombredanne <[email protected]>

* this way this can run from a virtual codebase too Signed-off-by: Philippe Ombredanne <[email protected]>

This is very basic at the moment. Signed-off-by: Philippe Ombredanne <[email protected]>

The counters are not a summary Signed-off-by: Philippe Ombredanne <[email protected]>

Signed-off-by: Philippe Ombredanne <[email protected]>

- there is now a single summary option that summarizes whichever scan is available from the copyrights, licenses, programming language - the summary is report either as a new codebase-level attribute or as both codebase-level and file/directory level when using --summary-with-details - only json output support summaries for now Signed-off-by: Philippe Ombredanne <[email protected]>

* Fix test failures (from unstable sort order) * Refactor common code where relevant * Other minor refinements Signed-off-by: Philippe Ombredanne <[email protected]>

Signed-off-by: Philippe Ombredanne <[email protected]>

A path pattern must be matched or not. For instance matching a directory does not mean the children are matched. Signed-off-by: Philippe Ombredanne <[email protected]>

Signed-off-by: Philippe Ombredanne <[email protected]>

When doing aggregations ofor key files or grouping by facet, we need to recompute value summaries for each summarize attribute to get correct summaries. Signed-off-by: Philippe Ombredanne <[email protected]>

When computing summaries for #377 empty values (e.g. summaries of None) and attributes without a summary should not be the cause of crashes. Same for empty directories. Signed-off-by: Philippe Ombredanne <[email protected]>

Signed-off-by: Philippe Ombredanne <[email protected]>

pombredanne added the new feature label Nov 25, 2016

pombredanne added a commit that referenced this issue Dec 13, 2016

#377 Experimental new license and rules for external file references

ae2e269

* detect license references such as "See COPYING for details" Signed-off-by: Philippe Ombredanne <[email protected]>

pombredanne mentioned this issue Jan 4, 2017

Proposal: high level file classification #426

Open

pombredanne mentioned this issue Jan 23, 2017

Option to only report first occurrence of a license of a kind #464

Closed

This was referenced Feb 6, 2017

Omit files without findings #476

Merged

WIP: Use a generator to reduce memory consumption when filtering results #491

Closed

pombredanne mentioned this issue Feb 17, 2017

License (with FSF copyright) is not properly detected in re2 project #496

Closed

pombredanne added a commit that referenced this issue Feb 17, 2017

#377 Experimental new license and rules for external file references

7018e94

* detect license references such as "See COPYING for details" Signed-off-by: Philippe Ombredanne <[email protected]>

pombredanne mentioned this issue Feb 19, 2017

The --license-text does not work as expected (for JSON output) #504

Closed

pombredanne mentioned this issue Feb 26, 2017

Add File-based user preferences and scan configuration #520

Closed

11 tasks

pombredanne mentioned this issue Mar 10, 2017

Plugins #552

Closed

4 tasks

pombredanne mentioned this issue Mar 24, 2017

Essential tickets to fix for v2.0 release #568

Closed

yashdsaraf mentioned this issue Jul 22, 2017

Post scan plugins for plugin architecture #699

Closed

pombredanne mentioned this issue Aug 22, 2017

false positive: ScanSoft Public License #734

Closed

pombredanne added this to the v3.0 milestone Oct 20, 2017

This was referenced Mar 28, 2018

Post-scan plugin for license expression(s) summarization #772

Closed

Proposal: Introduce configurable Rules for scan classification, summarization, refinements and inference #1012

Open

This was referenced Apr 17, 2018

Define patterns of file names and types for classification #1036

Open

Improve copyright summary #1043

Closed

pombredanne mentioned this issue Apr 27, 2018

RFC: introduce new "autoscan" mode #1049

Closed

pombredanne added a commit that referenced this issue May 7, 2018

Add count to copyright summary #1043 #377

87ca90e

* also rename CLI option * add tests Signed-off-by: Philippe Ombredanne <[email protected]>

pombredanne added a commit that referenced this issue May 7, 2018

Do not require the copyright option #1043 #377

42d83e1

* this way this can run from a virtual codebase too Signed-off-by: Philippe Ombredanne <[email protected]>

pombredanne added a commit that referenced this issue May 7, 2018

Add new license summary option #377

95832d4

This is very basic at the moment. Signed-off-by: Philippe Ombredanne <[email protected]>

pombredanne mentioned this issue Jun 13, 2018

Suggest making more use of "author" statements in scan results #1107

Closed

pombredanne added a commit that referenced this issue Jun 18, 2018

Rename internal summary to counts #377

3339b61

The counters are not a summary Signed-off-by: Philippe Ombredanne <[email protected]>

pombredanne mentioned this issue Jul 2, 2018

for license references can we somehow include 2-3 words after and before the detected keywords? #1122

Open

pombredanne added a commit that referenced this issue Jul 11, 2018

Remove unused summarizers #377

bb58d7d

Signed-off-by: Philippe Ombredanne <[email protected]>

pombredanne added a commit that referenced this issue Jul 16, 2018

Improve summary computation #377 #1043

f40700d

* Fix test failures (from unstable sort order) * Refactor common code where relevant * Other minor refinements Signed-off-by: Philippe Ombredanne <[email protected]>

pombredanne mentioned this issue Jul 16, 2018

Add summaries, facets and classification #1130

Merged

pombredanne added a commit that referenced this issue Jul 18, 2018

Add tests for facets CLI option #377

0b3f119

Signed-off-by: Philippe Ombredanne <[email protected]>

pombredanne added a commit that referenced this issue Jul 18, 2018

Add CLI option to detect generated code #377

12ba2c5

Signed-off-by: Philippe Ombredanne <[email protected]>

pombredanne added a commit that referenced this issue Jul 18, 2018

Clarify and simplify facet code #377

2e98b41

A path pattern must be matched or not. For instance matching a directory does not mean the children are matched. Signed-off-by: Philippe Ombredanne <[email protected]>

pombredanne added a commit that referenced this issue Jul 18, 2018

Add --summary-by-facet CLI option #377

92fef95

Signed-off-by: Philippe Ombredanne <[email protected]>

pombredanne mentioned this issue Jul 23, 2018

Add license rule importance flags #1140

Merged

pombredanne added a commit that referenced this issue Oct 30, 2018

Add new generated keywords and use lowercase #377

44bb9aa

Signed-off-by: Philippe Ombredanne <[email protected]>

pombredanne mentioned this issue Nov 5, 2018

Improve summary and score #1253

Merged

pombredanne modified the milestones: v3.0, v3.1 Nov 5, 2018

pombredanne added a commit that referenced this issue Nov 8, 2018

Add new generated keywords and use lowercase #377

7440831

Signed-off-by: Philippe Ombredanne <[email protected]>

pombredanne added a commit that referenced this issue Nov 8, 2018

Add new generated keywords and use lowercase #377

fc7599d

Signed-off-by: Philippe Ombredanne <[email protected]>

pombredanne mentioned this issue Nov 12, 2018

Override license detection by checksum #1281

Open

pombredanne modified the milestones: v3.1 Documentation, documentation, documentation, v3.2 Feb 16, 2019

pombredanne mentioned this issue May 8, 2019

Add to a detected package the list of files that it contains #1554

Closed

pombredanne mentioned this issue Jun 7, 2019

Group related files together (such as the files of a package, build scripts, etc) #1524

Closed

pombredanne added the summaries label Jun 11, 2019

pombredanne mentioned this issue Oct 1, 2020

Scan detects Apache-1.1 instead of/in addition to Apache-2.0 in notice files by Apache foundation. #2266

Open

pombredanne removed this from the v3.3 milestone Sep 24, 2021

pombredanne mentioned this issue Sep 24, 2021

Report Packages at the codebase-level #2098

Closed

9 tasks

pombredanne mentioned this issue Jan 13, 2024

Meta Issue: File classification and categorization #3639

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Scan deduction and summarization #377

Proposal: Scan deduction and summarization #377

pombredanne commented Nov 25, 2016 •

edited by mjherzog

Loading

yahalom5776 commented Nov 30, 2016

pombredanne commented Nov 30, 2016

pombredanne commented Nov 30, 2016

pkunz commented Nov 30, 2016

yahalom5776 commented Dec 9, 2016 •

edited

Loading

pombredanne commented Dec 9, 2016

pombredanne commented Dec 9, 2016

Proposal: Scan deduction and summarization #377

Proposal: Scan deduction and summarization #377

Comments

pombredanne commented Nov 25, 2016 • edited by mjherzog Loading

Context

Problem

Solution elements

yahalom5776 commented Nov 30, 2016

pombredanne commented Nov 30, 2016

pombredanne commented Nov 30, 2016

pkunz commented Nov 30, 2016

yahalom5776 commented Dec 9, 2016 • edited Loading

pombredanne commented Dec 9, 2016

pombredanne commented Dec 9, 2016

pombredanne commented Nov 25, 2016 •

edited by mjherzog

Loading

yahalom5776 commented Dec 9, 2016 •

edited

Loading