-
-
Notifications
You must be signed in to change notification settings - Fork 576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scanning NPM packages that list SPDX IDs results in false positives #1032
Comments
@sschuberth this is a thorny issue indeed. It has been discussed a bit in the past in #270 with @spf13 and @yahalom5776 .... the problem is that if we scan for licensing a package that itself is a licensing-related tool and contains a lot of license references and texts (such as golang's cobra or the npms you mentioned here, or if you were to scan scancode itself), we end up getting a lot of licenses detected... in some cases some false positive'ish looking detections and these licenses are typically NOT the licenses of the package being scanned. Imagine you scan scancode... you will be getting literally about 10K+ license matches all of which would likely be correct and at the same time completely useful and meaningless. IMHO the way out is this: as part of the general theme of classification/summarization #426 we should eventually store a set of patterns and names of these packages (there is a finite number of these such as scancode proper, the SPDX and licensing-related npms, and possibly less than a 100 overall) such that these packages are clearly identified and that either
|
Note that we could also --and right away-- create a bunch of license detection rules (typically "negative" rule)s that would avoid the detection there. This would be a lot of rules, though they are fairly easy to generate. Let me push a quickie branch to show you this approach. These rules would end up being very specific, though in the case of a long list of SPDX ids, they may be a tad more universal. |
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
* some npms contains a lot of SPDX licenses: these rules are crafted to avoid reporting these. The bulk of these has been generated based on chunks and ngrams from full lists of licenses ids found in these packages Signed-off-by: Philippe Ombredanne <[email protected]>
@sschuberth the branch https://github.com/nexB/scancode-toolkit/tree/1032-license-tools-false-positive shows you the negative rules approach. This is a quick shot with mostly generated rules using ngrams, probably a few too many ;).... The tests show that you tow npm above would be detected fullyu accurately with these rules.... Though I am not 100% sure this is the right way to go as this creates a lot of rules for only a few not so common cases IMHO. |
* some npms contains a lot of SPDX licenses: these rules are crafted to avoid reporting these. The bulk of these has been generated based on chunks of 8 or 10 lines from the license id lists found in these packages Signed-off-by: Philippe Ombredanne <[email protected]>
@sschuberth any feedback? |
That's also what I'm asking myself. Not sure how maintainable this approach is. But probably the only way to tell is to merge this (now that you have it) and see how many more of these we would need over time? |
How many common packages do we see with these issues beyond the common SPDX-related npms? |
Hard to say, these are the first I saw in about a thousand scans. |
@sschuberth OK, I will wait a bit, because adding 100+ new "negative" detection rules may not be the most scalable way to handle this. |
* Add negative rules to avoid incorrect license detection in some license-related tools : some npms contains a lot of SPDX licenses. These rules are crafted to avoid reporting these. The bulk of these has been generated based on chunks of 8 or 10 lines from the license id lists found in these packages * Add rules for correct license detection in certain package.json * Add test for correct detection in license tools Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
And I merged it after all! |
I am doing some cleanup and review as a prep for the 3.0 release... and since this is fixed in develop, I am closing this! |
Scanning the source code of NPM packages like
that contain lists of SPDX license IDs trigger a bunch of false positive license findings. I wonder whether there's something smart we could do about that, like not reporting licenses if you have many short identifiers very close together.
The text was updated successfully, but these errors were encountered: