Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix STM error when fetching a container needs multiple different tokens #1370

Merged
merged 20 commits into from
Feb 8, 2024

Conversation

csasarak
Copy link
Contributor

@csasarak csasarak commented Jan 25, 2024

Overview

I was investigating STM errors and found one in the way that we were handling auth tokens with STM. The idea is that for container scanning we want to share an auth token between a few workers. However, this code originally used putTMVar for updating the variable holding the token. For registries that use only one token this is fine, the RegistryCtx starts empty, is filled, and then that token works for every layer archive download. In the case of some registries the token for downloading blobs may be different than the one for fetching the manifest.

When that happens:

  1. A thread gets a new auth token from the authentication server.
  2. It goes to write the new token into the RegistryCtx using putTMVar so other threads can use it.
  3. The write blocks because putTMVar will wait for the RegistryCtx TMVar to be emptied (using some variation of takeTMVar) which will never happen.
  4. Every thread fetching a blob does steps 1-3.
  5. Haskell's RTS notices that all these threads are waiting for the TMVar to change, but also notices that there isn't an unblocked thread with a reference to the TMVar. Realizing that the TMVar will never be released, it kills all the blocked threads with a thread blocked indefinitely in an STM transaction exception.

This PR:

  • Changes to using writeTMVar but also adds some logic so that when one thread is working to get a new auth token, the other ones will wait for it.
  • Ads some logging that I wrote that I think might be generally useful.
  • Adds hlint rules to warn about using operations in STM that may potentially block.
  • Adds an integration test for registry image scanning.

Acceptance criteria

  • We can download container layers from registries that may use different auth tokens for different parts of the fetch.

Testing plan

A simple interactive test involves using our currently released fossa to get the alpine image from AWS's public registry:

$ fossa container analyze public.ecr.aws/docker/library/alpine:edge -o

It should fail with an STM exception.

Running the same command with the code on this branch should succeed. If you run it with --debug, you can also observe the coordination of threads fetching tokens in the debug bundle:

[
  "[ 0 Waiting / 3 Running / 0 Completed ]",
  "[ 0 Waiting / 2 Running / 1 Completed ]",
  {
    "duration": "0.409973000",
    "events": [
      "Requesting: \"HEAD\" https://public.ecr.aws/v2/docker/library/alpine/blobs/sha256:9198849dd7f6d0e6e7d483145c7d50e5cc9f3138d544a46b76baed8dfd9c157d",
      "Received: Status {statusCode = 401, statusMessage = \"Unauthorized\"} (Content-Type:Just \"application/json; charset=utf-8\")",
      "Got auth challenge: BearerAuthChallenge (RegistryBearerChallenge {authChallengeBearerRealm = \"https://public.ecr.aws/token/\", authChallengeService = \"public.ecr.aws\", authChallengeScope = \"aws\"})",
      "Requesting: \"GET\" https://public.ecr.aws/token/?service=public.ecr.aws&scope=aws",
      "Received: Status {statusCode = 200, statusMessage = \"OK\"} (Content-Type:Just \"application/json; charset=utf-8\")",
      "Downloaded: 9198849dd7f6d0e6e7d483145c7d50e5cc9f3138d544a46b76baed8dfd9c157d.json"
    ],
    "scope": "Export job ID: d8dc4f2e-fd18-45c4-a947-2c5d4b4638e8, Export thread ID: ThreadId 51"
  },
  "[ 0 Waiting / 1 Running / 2 Completed ]",
  {
    "duration": "0.626634000",
    "events": [
      "Requesting: \"HEAD\" https://public.ecr.aws/v2/docker/library/alpine/blobs/sha256:dcccee43ad5d95c556da9df1c1d859fd9864643786d8c2c323ca9886c51b07b9",
      "Received: Status {statusCode = 401, statusMessage = \"Unauthorized\"} (Content-Type:Just \"application/json; charset=utf-8\")",
      "Token is already being updated. Waiting...",
      "Gzip extracted & downloaded: dcccee43ad5d95c556da9df1c1d859fd9864643786d8c2c323ca9886c51b07b9.tar"
    ],
    "scope": "Export job ID: 0333d963-cb16-4a23-bac5-9fad13bee987, Export thread ID: ThreadId 48"
  }
]

Risks

The thread that's fetching the token could receive a malformed challenge and fail, leaving the RegistryCtx in a bad state that would result in the same STM error as before.

This PR still represents an improvement in the number of cases where that could happen so I still think it's worth merging. I'd like to see if there's a more general way to handle situations like that as part of continuing work on ANE-883.

Metrics

References

Checklist

  • I added tests for this PR's change (or explained in the PR description why tests don't make sense).
  • If this PR introduced a user-visible change, I added documentation into docs/.
  • If this PR added docs, I added links as appropriate to the user manual's ToC in docs/README.ms and gave consideration to how discoverable or not my documentation is.
  • If this change is externally visible, I updated Changelog.md. If this PR did not mark a release, I added my changes into an # Unreleased section at the top.
  • If I made changes to .fossa.yml or fossa-deps.{json.yml}, I updated docs/references/files/*.schema.json AND I have updated example files used by fossa init command. You may also need to update these if you have added/removed new dependency type (e.g. pip) or analysis target type (e.g. poetry).
  • If I made changes to a subcommand's options, I updated docs/references/subcommands/<subcommand>.md.

@csasarak csasarak marked this pull request as ready for review January 26, 2024 02:15
@csasarak csasarak requested a review from a team as a code owner January 26, 2024 02:15
@csasarak csasarak changed the title Registryctx stm fix Fix STM error when fetching a container needs multiple different tokens Jan 26, 2024
@csasarak csasarak merged commit 19cdf82 into master Feb 8, 2024
16 checks passed
@csasarak csasarak deleted the registryctx-stm-fix branch February 8, 2024 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants