-
-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Scan deduction and summarization #377
Comments
@pombredanne Another case I see quite often is a detection of a generic clue for e.g. LGPL (with no further version info) and then another clue in the same file with the specific license information, e.g. LGPL 2.1 or later. It would help to have some logic to roll these up to the "better" result which is LGPL 2.1 or later. Could be based on the "distance" clues are away from each other in the file and the knowledge that LGPL and LGPL 2.1 are related (this would have to be set in the license meta data/detection definitions). Another topic where such a roll up would be helpful are the typical GPL 2 or later with Autoconf exception headers. I thought about this for some time and I am still a little bit worried about "auto-resolutions" if I do not know that this resolution even happened. So perhaps we could preserve the raw data of all clues found somehow to be able to retrace the finding? Assuming licenses from clues on directory level to other files (perhaps with the condition they have no other clues themselves) is a possibility but I think it's a completely different ballgame from a complexity and (legal) risk level. Perhaps it makes more sense to start on file level for that matter. But that's just IMHO. |
@yahalom5776 Thanks for the feedback. This makes 100% sense to me. I agree we should always keep the raw scans: this is more about adding smarts and summaries at the package and some directory levels, but not hiding the things below these |
@yahalom5776 If you can provide some examples for |
On
One has to be careful with the COPYING file. It may be the text for gpl-2.0 or lgpl-2.1, but in the head of files one may find gpl-2.0-plus or lgpl-2.1-plus. Or the 'or later' might be found in a NOTICE or README file. Also there may be a few files with licenses other than the one stated in the COPYING file. If autotools are used (quite common) then the same set of licenses show up in a scan which could/should be ignored because the autotool files are copied verbatim or generated from a template. Perhaps we could make a list of such files that can be ignored. I don't always trust the license info in the metadata of an rpm because this is put in by hand by the author of the rpm spec file who is not necessarily the author of the package. |
@pombredanne Sorry for the late reply but here is a similar example. It's from glibc 2.19: License header:
That's the ScanCode result according to the HTML output for that file:
Correct roll-up would be
in this case. Perhaps you can have a look. Thank you! Edit: Another one from glibc 2.19, this time it is an autoconf clue: License header of
ScanCode detection:
The "unkown" detection is further down in the file and should be reviewed and handled independently IMO:
|
@yahalom5776 Thanks! For the GLibc case, this is something that will dealt with license expressions with #74 e.g. in this case, it would be an expression like: |
@yahalom5776 For config.guess case, (and in general when several licenses are detected in a single file) we have various possibilities:
In the case of the unknown detection, we have this interesting text: Finally in the case of a common build such as config.guess and related autotools scripts, having them classified automatically as being build scripts could offer a way to further do some deduction of what the license is and what is the relative importance of these licenses e.g. the license of the build scripts is not as important as the license of the main code proper and usually has little or impact on the resulting license: I can build an MIT-licensed package with autotools or a GPL-licensed build script and my package will still be MIT-licensed and neither the built binaries nor the source proper will not inherit from the build script licensing. |
* detect license references such as "See COPYING for details" Signed-off-by: Philippe Ombredanne <[email protected]>
* detect license references such as "See COPYING for details" Signed-off-by: Philippe Ombredanne <[email protected]>
* also rename CLI option * add tests Signed-off-by: Philippe Ombredanne <[email protected]>
* this way this can run from a virtual codebase too Signed-off-by: Philippe Ombredanne <[email protected]>
This is very basic at the moment. Signed-off-by: Philippe Ombredanne <[email protected]>
The counters are not a summary Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
- there is now a single summary option that summarizes whichever scan is available from the copyrights, licenses, programming language - the summary is report either as a new codebase-level attribute or as both codebase-level and file/directory level when using --summary-with-details - only json output support summaries for now Signed-off-by: Philippe Ombredanne <[email protected]>
* Fix test failures (from unstable sort order) * Refactor common code where relevant * Other minor refinements Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
A path pattern must be matched or not. For instance matching a directory does not mean the children are matched. Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
When doing aggregations ofor key files or grouping by facet, we need to recompute value summaries for each summarize attribute to get correct summaries. Signed-off-by: Philippe Ombredanne <[email protected]>
When computing summaries for #377 empty values (e.g. summaries of None) and attributes without a summary should not be the cause of crashes. Same for empty directories. Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Context
Scanning operates at the file level. This is good but in many cases a scan reports too much data at a too detailed level. This happens when related clues are detected across files or inside the same file.
Problem
Multiple related clues in different files
For instance, if every file in a directory tree has the same license and copyright statements, then the license and origin information could be rolled up at the level of this directory and the file details could be omitted.
Or say that a scanned directory only contains a COPYING file with a license and notice and none of the files in that directory have a license or copyright. Then the license and origin information could be extended from the COPYING to all the files in that tree.
Or say that a scanned directory only contains a README file with a license and notice and that all the files in that directory have a comment
See README for licensing
. Then the license and origin information could be extended from the README to all the files in that tree that carry this comment.Or say that a Package is detected (such as Maven Jar or an NPM or else) and that the package-level metadata accurately described the licensing of all the files for this package and that the scan of the files in this package does not bring new details. Then only the license and origin information from the package could be kept and the file details omitted.
Or say that a directory contains code in a mix of programming languages: the primary or main language or language stats could be rolled up at the directory level.
Or say that a directory contains both code and build scripts and that the license for the build scripts is different from that of the code (say this is some autotools MIT or FSF notice). Then the licenses for the directory could be summarized based on a classification of the code files, and the build scripts and the build script licenses would not be reported as the directory or package license.
Multiple related clues in the same file
Some scans operate on the same data in a given file and this may trigger reporting extra or spurious clues and could be instead considered together.
For instance a license text may contain a copyright statement for the text of the license and URLs and emails. Detecting licenses, copyrights, emails and urls could report four different clues in same scanned file and scanned text region when this is may be instead a single clue for the license that should be reported and not four clues.
Or a package metadata file would typically contains origin and license information and these would end up reported twice both as package attributes and individual detection for license, copyright and urls.
Solution elements
A comprehensive solution may cover some or all of these:
The text was updated successfully, but these errors were encountered: