Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider detected copyrights when determining a declared holder from a package manifest in summary plugin #2972

Open
JonoYang opened this issue May 19, 2022 · 6 comments

Comments

@JonoYang
Copy link
Member

When scanning the package atheris v 2.0.11 (https://github.com/google/atheris/archive/refs/tags/2.0.11.tar.gz) using the --summary plugin, the declared_holder value in the scan summary is Bitshift, which is the author of the package. This was determined from the parsed package data from the setup.py file of atheris. However, the setup.py contains a comment that is a copyright statement with the actual copyright holders. The summary plugin should be updated to also consider copyrights detected by the copyright scanner. This value should take precedence over authors.

@JonoYang
Copy link
Member Author

It also may not behoove us to use the package authors as a copyright holder when we do not detect an explicit copyright statement from package data.

JonoYang added a commit that referenced this issue May 19, 2022
JonoYang added a commit that referenced this issue May 19, 2022
    * Update expected test results
    * TODO: consider not converting common company names to a canonical form

Signed-off-by: Jono Yang <[email protected]>
@JonoYang
Copy link
Member Author

JonoYang commented May 19, 2022

@DennisClark @tdruez @pombredanne

When removing the code that assigns the author or other detected parties from a Package as the declared holder, I noticed that the tallies plugins does some sort of normalization on the detected holders from Resources in the codebase. The majority of the files have Google LLC as the copyright holder, but looking at the summary, only Google, Inc. shows up as the declared holder. This is done so we are able to group different forms of the same holders together. For example, from https://github.com/nexB/scancode-toolkit/blob/2972-summary-consider-copyrights/src/summarycode/copyright_tallies.py#L487, we normalize google, google llc, and google inc as Google, Inc..

Should we remove this normalization of holders to a canonical form? Normalizing and grouping the related holders together helps with getting a good count of how many times a particular holder shows up, especially when there are many different forms of copyright statements for that holder. However, it can become confusing when someone wants to verify the summary results and they cannot find the declared holder in files because the detected holder value was changed.

@DennisClark
Copy link
Member

@JonoYang I am not convinced that using an author value for Holder when there is no copyright detected is a good thing, although I don't feel strongly about it. However, I vaguely recall some community discussion on this topic, where someone strongly asserted that author is NOT equivalent to copyright, so there is definitely a case for not using it at all for a Holder.

As far as "normalizing" the holder goes, it is a nice feature if we can still point back to the original somehow.

@JonoYang
Copy link
Member Author

JonoYang commented May 20, 2022

@DennisClark

@JonoYang I am not convinced that using an author value for Holder when there is no copyright detected is a good thing, although I don't feel strongly about it. However, I vaguely recall some community discussion on this topic, where someone strongly asserted that author is NOT equivalent to copyright, so there is definitely a case for not using it at all for a Holder.

I've removed the code that uses the Package authors/maintainers as a holder when no copyright is detected.

As far as "normalizing" the holder goes, it is a nice feature if we can still point back to the original somehow.

Maybe we can have a list of the original holder values when we present the tallies of holders?

    ...
    "declared_holder": {
        "holder": "Google, Inc.",
        "holder_forms": [
          "Google LLC",
          "Google, Inc."
        ],
    },
    "other_holders": [
      {
        "value": "Fraunhofer FKIE",
        "holder_forms": [
          "Fraunhofer FKIE"
        ],
        "count": 21
      }
    ],
    ...

I'm not sure what the best name for that field would be.

JonoYang added a commit that referenced this issue May 24, 2022
JonoYang added a commit that referenced this issue May 24, 2022
    * Update expected test results
    * TODO: consider not converting common company names to a canonical form

Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue May 24, 2022
    * If no explicit copyright was detected from Package datafiles, then we take the holder detections from those files instead

Signed-off-by: Jono Yang <[email protected]>
@JonoYang
Copy link
Member Author

After discussion with @pombredanne, it would make sense to just use the company/organization name itself without any of the suffixes. Google, Inc., Google LLC, etc. should just become Google.

@mjherzog
Copy link
Member

That does make sense for this case, but this Google example seems to be a relatively easy one. There will be many other cases where the relationship among holders is not evident in the names. There is really no way for us to figure this out from a set of copyright holder names beyond these simple cases. What would be interesting is to know the holder best associated with the primary license.

JonoYang added a commit that referenced this issue May 24, 2022
JonoYang added a commit that referenced this issue May 24, 2022
    * Update expected test results

Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue May 24, 2022
JonoYang added a commit that referenced this issue May 25, 2022
JonoYang added a commit that referenced this issue May 25, 2022
pombredanne added a commit that referenced this issue Jun 10, 2022
KevinJi22 pushed a commit to KevinJi22/scancode-toolkit that referenced this issue Jun 14, 2022
KevinJi22 pushed a commit to KevinJi22/scancode-toolkit that referenced this issue Jun 14, 2022
    * Update expected test results
    * TODO: consider not converting common company names to a canonical form

Signed-off-by: Jono Yang <[email protected]>
KevinJi22 pushed a commit to KevinJi22/scancode-toolkit that referenced this issue Jun 14, 2022
    * If no explicit copyright was detected from Package datafiles, then we take the holder detections from those files instead

Signed-off-by: Jono Yang <[email protected]>
KevinJi22 pushed a commit to KevinJi22/scancode-toolkit that referenced this issue Jun 14, 2022
KevinJi22 pushed a commit to KevinJi22/scancode-toolkit that referenced this issue Jun 14, 2022
    * Update expected test results

Signed-off-by: Jono Yang <[email protected]>
KevinJi22 pushed a commit to KevinJi22/scancode-toolkit that referenced this issue Jun 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants