Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

va: log MPIC summaries #7817

Closed
wants to merge 1 commit into from
Closed

va: log MPIC summaries #7817

wants to merge 1 commit into from

Conversation

jsha
Copy link
Contributor

@jsha jsha commented Nov 15, 2024

Add "Perspective" and "RIR" configuration at the primary VA for each of its backends. This information is passed through to the VA as part of the RemoteVA struct.

Remove the "Hostname" field from the RemoteVA struct. It was previously used in logging errors, but Perspective will be a clearer way to indicate those errors.

Hoist the logic for "am I primary?" from performRemoteValidation to PerformValidation. This allows performRemoteValidation to more consistently return a set of succeeded and failed results.

Note: this is a stacked change on top of #7815. DO NOT MERGE until #7815 is merged. Also, this change pulls in the change from good++ / bad++ to slices of passed / failed from #7814.

@jsha jsha requested a review from a team as a code owner November 15, 2024 17:56
@jsha jsha requested review from beautifulentropy and removed request for a team November 15, 2024 17:56
Copy link
Contributor

@jsha, this PR appears to contain configuration and/or SQL schema changes. Please ensure that a corresponding deployment ticket has been filed with the new values.

@jsha
Copy link
Contributor Author

jsha commented Nov 15, 2024

Pulling in some feedback from #7815 (comment):

I previously considered this approach but deemed it too risky. The remote VA backends are configured using SRV records from an in-DC Consul, populated by cloud automation. If the primary VA is set to treat certain backends as "perspective A," but their DNS actually points to "perspective B" backends, it could take a long time to discover the mismatch. The resulting revocation event would include every certificate using an authorization validated by "perspective A".

Instead, we should rely on the remote VAs to report their perspective as part of the ValidationResult. The result of a DNS misconfiguration or a primary VA configured to to lookup the wrong SRV record would be an increase in validations which now fail because they cannot reach quorum.

These are good points, and MPIC is important enough compliance wise that we should do what we can to programmatically detect config problems. There are big advantages to the primary knowing what its perspectives are too, like being able to abort at startup if the perspectives are wrong. And the primary can report failures by perspective even when that perspective is down.

What do you think about this: primary and remotes both know about perspectives, and in each RPC to a remote, the primary asserts what it believes the perspective and RIR are. If the remote disagrees, it returns an error.

@beautifulentropy
Copy link
Member

What do you think about this: primary and remotes both know about perspectives, and in each RPC to a remote, the primary asserts what it believes the perspective and RIR are. If the remote disagrees, it returns an error.

Agreed, comparing the results of both is probably the best call.

@beautifulentropy
Copy link
Member

I'll hold off on reviewing this until main is merged.

@aarongable
Copy link
Contributor

What do you think about this: primary and remotes both know about perspectives, and in each RPC to a remote, the primary asserts what it believes the perspective and RIR are. If the remote disagrees, it returns an error.

I'm fine with this approach in the long term. But the purpose of these smaller PRs was to refactor Samantha's big PR into a collection of more-easily-reviewable changes, not to change the approach taken in that PR. For now, let's stick with having each remote know (and report) its own perspective/RIR, and leave the gRPC request to those remotes alone.

We can add having the primary VA know about its remotes' perspectives/RIRs and add checks that they match up later.

@jsha
Copy link
Contributor Author

jsha commented Nov 15, 2024

Sounds good, I'll split that part out (and fix up merge conflicts).

InternalError string `json:",omitempty"`
Perspective string `json:",omitempty"`
RIR string `json:",omitempty"`
MPICSummary *mpicSummary `json:",omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MPIC summary must always be logged, even if it's empty because we caught a local failure. Our logs don't distinguish between a local problem and a remote problem so it's useful to positively state that we did not attempt MPIC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will change this to not omitempty.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we document why it's not omitempty?

va/va.go Outdated
Comment on lines 633 to 635
// Note: len(va.remoteVAs) is greater than len(passed) + len(failed) because some
// are canceled on reaching quorum.
QuorumResult: fmt.Sprintf("%d/%d", len(passed), len(va.remoteVAs)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make an "attempt" I believe that we have to log the outcome of that attempt according to the BRs:

5.4.1 Types of events recorded

...

The CA SHALL record at least the following events:

...

Multi-Perspective Issuance Corroboration attempts from each Network Perspective, minimally recording the following information:

  • a. an identifier that uniquely identifies the Network Perspective used;
  • b. the attempted domain name and/or IP address; and
  • c. the result of the attempt (e.g., "domain validation pass/fail", "CAA permission/prohibition").>
    Multi-Perspective Issuance Corroboration quorum results for each attempted domain name or IP address represented in a Certificate request (i.e., "3/4" which should be interpreted as "Three (3) out of four (4) attempted Network Perspectives corroborated the determinations made by the Primary Network Perspective).

va/va.go Outdated
Comment on lines 633 to 634
// Note: len(va.remoteVAs) is greater than len(passed) + len(failed) because some
// are canceled on reaching quorum.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not believe these are cancelled, they simply hit timeout after we've already returned the ValidationResult.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In grpc/interceptors.go we apply a deadline, and then defer cancel(). The cancel gets run after the handler returns.

localCtx, cancel := context.WithDeadline(ctx, deadline)
defer cancel()

va/va.go Show resolved Hide resolved
return &mpicSummary{
PassedPerspectives: passedPerspectives,
FailedPerspectives: failedPerspectives,
PassedRIRs: passedRIRs,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're collecting up the RIRs but we're not ensuring that passing validations came from at least two unique RIRs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true. My goal here was to add logging, not enforcement.

va/caa.go Outdated
@@ -255,17 +256,17 @@ func (va *ValidationAuthorityImpl) performRemoteCAACheck(
result.Problem = probs.ServerInternal("Remote VA IsCAAValid RPC cancelled")
} else {
// Handle validation error.
va.log.Errf("Remote VA %q.IsCAAValid failed: %s", rva.Address, err)
va.log.Errf("Remote VA %q.IsCAAValid failed: %s", perspective, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rva.Perspective is probably less useful than rva.Address when trying to find a remote VA instance that's acting up so you can terminate it.

va/caa.go Outdated
result := &remoteVAResult{
VAHostname: rva.Address,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was removed, if you add it back as I have suggested in other comments it should be Address as this is the [IP Address]:Port of the remote VA.

va/caa_test.go Outdated
Comment on lines 720 to 722
{brokenVA, testPerspective1, testRIR1},
{remoteVA, testPerspective2, testRIR2},
{remoteVA, testPerspective3, testRIR3},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the test cases here stop at 3 perspectives, but the number of allowed failures increases to 2 at 6+

Table: Quorum Requirements

# of Distinct Remote Network Perspectives Used # of Allowed non-Corroborations
2-5 1
6+ 2

We should add a test that checks this at exactly below and above that threshold.

va/va.go Outdated
Comment on lines 289 to 295
for i, va1 := range remoteVAs {
for j, va2 := range remoteVAs {
if i != j && va1.Perspective == va2.Perspective && va1.Perspective != "" {
return nil, fmt.Errorf("duplicate remote VA perspective %q", va1.Perspective)
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a check like this could give someone a false sense of confidence, when in reality the backends it starts with will be steadily replaced as remote VAs scale up and down daily. It probably doesn't need to go away, but we should comment that these are also evaluated when receiving each validation/check RPC.

@aarongable
Copy link
Contributor

aarongable commented Nov 15, 2024

Given the complexity of the requirements here (as exemplified by the comments that Samantha just left) I'm unsure if the "lots of small refactors" plan should go this far. Right now I'm thinking it may be best for the two biggest pieces of Samantha's PR -- MPIC enforcement, and dropping CAA -- to remain in her original PR, rebased on top of the great small cleanups landed this week. Do we think that the changes made so far have improved the reviewability of that PR sufficiently?

Maybe I'm totally wrong, and this is the right approach to be taking. But it worries me to be going over MPIC requirements (like "what's the denominator in the quorum fraction") again, when that PR had already resolved those.

@jsha jsha changed the base branch from min-3-rvas to main November 15, 2024 22:54
@jsha
Copy link
Contributor Author

jsha commented Nov 15, 2024

You both make the good point that this PR got too big! I noticed the issue of "oops sometimes we log hostnames instead of perspectives," and fixing it, well... snowballed.

I've update this to only add the logging, and leave the hostnames alone for a future PR. I still need to add some test cases that check for the logging.

I've also changed the base to main.

@aarongable aarongable marked this pull request as draft November 25, 2024 19:48
@jsha
Copy link
Contributor Author

jsha commented Dec 6, 2024

Closing in favor of #7870

@jsha jsha closed this Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants