Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scanning NPM packages that list SPDX IDs results in false positives #1032

Closed
sschuberth opened this issue Apr 13, 2018 · 10 comments
Closed

Scanning NPM packages that list SPDX IDs results in false positives #1032

sschuberth opened this issue Apr 13, 2018 · 10 comments

Comments

@sschuberth
Copy link
Collaborator

sschuberth commented Apr 13, 2018

Scanning the source code of NPM packages like

that contain lists of SPDX license IDs trigger a bunch of false positive license findings. I wonder whether there's something smart we could do about that, like not reporting licenses if you have many short identifiers very close together.

@pombredanne
Copy link
Member

@sschuberth this is a thorny issue indeed. It has been discussed a bit in the past in #270 with @spf13 and @yahalom5776 .... the problem is that if we scan for licensing a package that itself is a licensing-related tool and contains a lot of license references and texts (such as golang's cobra or the npms you mentioned here, or if you were to scan scancode itself), we end up getting a lot of licenses detected... in some cases some false positive'ish looking detections and these licenses are typically NOT the licenses of the package being scanned. Imagine you scan scancode... you will be getting literally about 10K+ license matches all of which would likely be correct and at the same time completely useful and meaningless.

IMHO the way out is this: as part of the general theme of classification/summarization #426 we should eventually store a set of patterns and names of these packages (there is a finite number of these such as scancode proper, the SPDX and licensing-related npms, and possibly less than a 100 overall) such that these packages are clearly identified and that either

  1. specific filtering patterns are applied in these cases
  2. OR a clear message is returned when these are found to indicate that the detected licenses are not meaningful and explain why.

@pombredanne
Copy link
Member

Note that we could also --and right away-- create a bunch of license detection rules (typically "negative" rule)s that would avoid the detection there. This would be a lot of rules, though they are fairly easy to generate. Let me push a quickie branch to show you this approach. These rules would end up being very specific, though in the case of a long list of SPDX ids, they may be a tad more universal.

pombredanne added a commit that referenced this issue Apr 13, 2018
pombredanne added a commit that referenced this issue Apr 13, 2018
pombredanne added a commit that referenced this issue Apr 13, 2018
 * some npms contains a lot of SPDX licenses: these rules are crafted
   to avoid reporting these. The bulk of these has been generated based
   on chunks and ngrams from full lists of licenses ids found in these
   packages

Signed-off-by: Philippe Ombredanne <[email protected]>
@pombredanne
Copy link
Member

@sschuberth the branch https://github.com/nexB/scancode-toolkit/tree/1032-license-tools-false-positive shows you the negative rules approach. This is a quick shot with mostly generated rules using ngrams, probably a few too many ;).... The tests show that you tow npm above would be detected fullyu accurately with these rules.... Though I am not 100% sure this is the right way to go as this creates a lot of rules for only a few not so common cases IMHO.

pombredanne added a commit that referenced this issue Apr 14, 2018
 * some npms contains a lot of SPDX licenses: these rules are crafted
   to avoid reporting these. The bulk of these has been generated based
   on chunks of 8 or 10 lines from the license id lists found in these
   packages

Signed-off-by: Philippe Ombredanne <[email protected]>
@pombredanne
Copy link
Member

@sschuberth any feedback?

@sschuberth
Copy link
Collaborator Author

Though I am not 100% sure this is the right way to go as this creates a lot of rules for only a few not so common cases IMHO.

That's also what I'm asking myself. Not sure how maintainable this approach is. But probably the only way to tell is to merge this (now that you have it) and see how many more of these we would need over time?

@pombredanne
Copy link
Member

How many common packages do we see with these issues beyond the common SPDX-related npms?

@sschuberth
Copy link
Collaborator Author

Hard to say, these are the first I saw in about a thousand scans.

@pombredanne
Copy link
Member

@sschuberth OK, I will wait a bit, because adding 100+ new "negative" detection rules may not be the most scalable way to handle this.

pombredanne added a commit that referenced this issue May 6, 2018
 * Add negative rules to avoid incorrect license detection in some
   license-related tools : some npms contains a lot of SPDX licenses.
   These rules are crafted to avoid reporting these. The bulk of these
   has been generated based on chunks of 8 or 10 lines from the license
   id lists found in these packages
 * Add rules for correct license detection in certain package.json
 * Add test for correct detection in license tools

Signed-off-by: Philippe Ombredanne <[email protected]>
pombredanne added a commit that referenced this issue May 6, 2018
Signed-off-by: Philippe Ombredanne <[email protected]>
@pombredanne
Copy link
Member

And I merged it after all!

@pombredanne
Copy link
Member

I am doing some cleanup and review as a prep for the 3.0 release... and since this is fixed in develop, I am closing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants