Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Warehouse's models to accomodate OIDC verification #10617

Closed
woodruffw opened this issue Jan 19, 2022 · 10 comments · Fixed by #10753
Closed

Update Warehouse's models to accomodate OIDC verification #10617

woodruffw opened this issue Jan 19, 2022 · 10 comments · Fixed by #10753
Assignees

Comments

@woodruffw
Copy link
Member

Original thought: Warehouse's User model, at minimum, will probably need one or more new columns (possibly relations) for each supported provider.

Upon conversation with @di: we don't actually need the AuthN step, so we can probably get away without any changes to the User model. Instead, each project model will need a GitHub repository URL and workflow name to check incoming JWTs against.

@woodruffw
Copy link
Member Author

woodruffw commented Jan 19, 2022

Open questions about the data model:

  • The relationship between GitHub repositories and PyPI projects is potentially many-to-one (one repo hosting many Python packages), so it probably doesn't make sense to have the tuple of (repo_url, workflow_name) be unique. It might even make sense to have it be a relation to a separate table, so that PyPI admins can perform more direct administration/triage (e.g., deleting the pairing for every project that references a particular repo in the event of a compromise.)

  • Because of the above many-to-one relationship, we probably need to use the aud field in the JWT after all, since all other default claims won't allow us to disambiguate between different projects being published in the same workflow.

    • Alternatively: we could require GitHub repos to use per-Python-project workflows, e.g. (https://github.com/foo/bar, release-bar-api) and (https://github.com/foo/bar, release-bar-cli) for bar-api and bar-cli respectively. But this might be imposing too much structure and might hinder adoption.

TL;DR of the above: if we make the (repo_url, workflow_name) tuple unique for each Project model, then we don't have any ambiguities when it comes to creating the right temporary token for the workflow. Otherwise, we need some ambiguity resolution technique, whether that's setting aud or something else.

@woodruffw woodruffw self-assigned this Jan 19, 2022
@di
Copy link
Member

di commented Jan 21, 2022

The relationship between GitHub repositories and PyPI projects is potentially many-to-one (one repo hosting many Python packages), so it probably doesn't make sense to have the tuple of (repo_url, workflow_name) be unique. It might even make sense to have it be a relation to a separate table, so that PyPI admins can perform more direct administration/triage (e.g., deleting the pairing for every project that references a particular repo in the event of a compromise.)

Agreed. Not just multiple projects per repo, but that a single workflow in a single repo might publish multiple discrete projects.

Considering that it might someday be useful to determine the inverse of the relationship ("which projects correspond to repo X?") I think this could be a separate SourceRepository table, and an association table between that and Projects.

Because of the above many-to-one relationship, we probably need to use the aud field in the JWT after all, since all other default claims won't allow us to disambiguate between different projects being published in the same workflow.

We should be able to distinguish based on the artifact being uploaded (filename, metadata), no?

Overall, we should think about this in a generic way and not a GitHub-specific way, e.g. as a relationship between a project and a source repository with a "type" of GitHub, which is the default type for now.

@woodruffw
Copy link
Member Author

Considering that it might someday be useful to determine the inverse of the relationship ("which projects correspond to repo X?") I think this could be a separate SourceRepository table, and an association table between that and Projects.

Sounds good! Just to clarify: are you thinking that SourceRepository should be GitHub only, or should it have some kind of column identifying the source host? If the latter, we should also give some thought to how we store the workflow name, since other source hosts won't necessarily have the same "workflow" concept for cross-checking an OIDC JWT against.

We should be able to distinguish based on the artifact being uploaded (filename, metadata), no?

Oh, that's a good point -- I forgot that package distributions have well-structured filenames 😅

That does indeed solve the problem without needing the aud, which is nice. We'll probably want to still use some of GitHub's non default claims, but that shouldn't either complicate or impede our ability to support other providers.

@di
Copy link
Member

di commented Jan 21, 2022

The latter, and I think SourceRepository is probably not ideal because we're conflating two things which happen to be the almost the same with GitHub: the source location and the thing that can publish.

So maybe this is a Publisher or DelegatedPublisher or PublisherClaim or OIDCPublisher or Claimant or something? And the actual values are claims, so we could store the key/value pairs that we expect to validate for the claim, so this could be something like:

{
    "iss": "https://token.actions.githubusercontent.com",
    "sub": "repo:octo-org/octo-repo:environment:prod",
    "additonal_claims": {
        "whatever": "whatever"
        ...
    }
}

Since per https://openid.net/specs/openid-connect-core-1_0.html

The sub (subject) and iss (issuer) Claims, used together, are the only Claims that an RP can rely upon as a stable identifier for the End-User

@di
Copy link
Member

di commented Jan 21, 2022

Also... this is probably actually a many:many relationship. I can't see a reason why a project would be limited to a single delegated publisher. However, in practice, I imagine most users will only want a single publisher, so we can stick to 1 publisher per project for now, but keep in mind that we might make it many:many in the future.

@woodruffw
Copy link
Member Author

#10645 (comment) gave us a clearer picture of what claims we want for the GitHub case. What that in mind, here's the data/relationship model I have in mind:

Project -> [Publisher] # a Project can have multiple Publishers
Publisher = {name, issuer, [PublisherClaim]} # name is a "pretty" name like "github"
PublisherClaim = {key, value} # key and value are unstructured strings

PublisherClaim is the murkiest part -- we need to be able to support things that aren't "claims" in the OAuth/OIDC sense, but are data needed to verify claims. For example, for a github publisher, the [PublisherClaim] table slice might look like:

key value
repository foobar
owner woodruffw
owner_id 1122334455
workflow_name release

As a result, this might not be the best terminology or the best way to store the "claims." Another option is to store them as a claims/claim_verification JSON blob on each Publisher.

@woodruffw
Copy link
Member Author

Also N.B.: We might not want to store the issuer with each Publisher, since it's essentially duplicated en masse. We could instead key it based on name, which should be a relatively small set of supported OIDC providers. That's what I had in mind in part with the OIDC_PROVIDERS dict originally in #10628.

@woodruffw
Copy link
Member Author

Similar to #10617 (comment): we'll want to verify the aud, but we don't need to store it (since it'll always be pypi or some other fixed string we pick).

@di
Copy link
Member

di commented Feb 4, 2022

As a result, this might not be the best terminology or the best way to store the "claims."

I think I agree. What we really want is something that can store the necessary key/value details, and that for each claim to be verified, can generate a string from them.

A very rough sketch:

class PublisherThingy:
  
   def __init__(self, repository, owner, owner_id, workflow_name):
     self.repository = repository
     self.owner = owner
     ...
      
  @claim
  def repository(self):
    return f"{self.owner}/{self.repository}"
      
  def verify_claims(self, identity_token):
    for claim in self.claims:
      if claim != identity_token[claim]:
        return False
    return True

@woodruffw
Copy link
Member Author

Follow on: #11096.

@di di closed this as completed in #10753 Apr 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants