Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRIVERS-2972: add message requirement to ServerDescription.error #1729

Merged
merged 22 commits into from
Feb 4, 2025

Conversation

W-A-James
Copy link
Contributor

@W-A-James W-A-James commented Nov 12, 2024

Please complete the following before merging:

  • Update changelog.
  • Test changes in at least one language driver.
  • Test these changes against all server versions and topologies (including standalone, replica set, sharded
    clusters, and serverless).

the `error` field of the new `ServerDescription` object MUST include a descriptive error explaining that it was
invalidated because the primary was determined to be stale. Drivers MAY additionally specify whether this was due to an
electionId or setVersion mismatch. A multi-threaded client MUST
[request an immediate check](server-monitoring.md#requesting-an-immediate-check) for that server as soon as possible.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we expand the scope of this ticket to include more info in other cases where we reset a server to "unknown" besides stale primary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think that makes sense to do here. Seems like the only other place we do that is in the handleError function, so I'll take a look at that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there are two places we do this in our handleError logic, but in both cases, the handleError function takes in an error as a parameter. Do we still want to be testing against the errors here?

should be none or one) and replaces its description with a default ServerDescription of type "Unknown". Additionally,
the `error` field of the new `ServerDescription` object MUST include a descriptive error explaining that it was
invalidated because the primary was determined to be stale. Drivers MAY additionally specify whether this was due to an
electionId or setVersion mismatch. A multi-threaded client MUST
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get more specific about what the error should look like so that we can add tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I purposely left this a little more open-ended to give drivers leeway to use the Error/Exception API most native to their language. I can say that more explicitly in the ServerDescription.error section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to cross-reference the ServerDescription.error section

- (=) `error`: information about the last error related to this server. Default null.
- (=) `error`: information about the last error related to this server. Default null. MUST contain or be able to produce
a string describing the error. The name of the field containing the string describing the error SHOULD be what is
most idiomatic for each driver.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the motivation for adding "The name of the field containing the string describing the error SHOULD be what is most idiomatic for each driver."?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that the error field would be an object representing an error and since different languages have different ways of doing that, I think it's better to defer to driver engineers on what is the best way to communicate that to users. In Node we'd likely do that with a new MongoError subclass (which itself subclasses the native Node Error type) that uses a message field to hold human readable information about the error. We currently have it typed as a generic MongoError

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I think we can remove that sentence since it generally always true of any spec prescribed api.

Comment on lines 495 to 496
| RSPrimary with a stale electionId is discovered | `'primary marked stale due to electionId mismatch'` |
| RSPrimary with a stale setVersion is discovered | `'primary marked stale due to setVersion mismatch'` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we combine these two cases and include both the stale and max election tuples?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's fair

| -------------------------------------------------------------------- | -------------------------------------------------------- |
| RSPrimary with a stale electionId is discovered | `'primary marked stale due to electionId mismatch'` |
| RSPrimary with a stale setVersion is discovered | `'primary marked stale due to setVersion mismatch'` |
| A more current RSPrimary is discovered alongside an existing primary | `'primary marked stale due to discovery of new primary'` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we combine this case as well? IIUC the case is the same as the above but just from the other server's point of view.

@W-A-James W-A-James requested a review from ShaneHarvey December 3, 2024 18:44
@W-A-James
Copy link
Contributor Author

Node POC: mongodb/node-mongodb-native#4340

@W-A-James W-A-James changed the title add requirement to error field DRIVERS-2972: add message requirement to ServerDescription.error Dec 3, 2024
@W-A-James W-A-James marked this pull request as ready for review December 3, 2024 18:57
@W-A-James W-A-James requested a review from a team as a code owner December 3, 2024 18:57
@W-A-James W-A-James requested a review from ShaneHarvey January 15, 2025 21:53
else:
# Stale primary.
# replace serverDescription with a default ServerDescription of type "Unknown"
if serverDescription.electionId < topologyDescription.maxElectionId:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we the pseudocode is a bit more verbose now that the logic needs to distinguish between the electionId vs setVersion stale cases. What do you think about one error message that include the two tuples:

"primary marked stale due to electionId/setVersion mismatch, <stale tuple> is stale compared to <max tuple>"

That way we don't need to refactor this code and more importantly drivers don't need to refactor their comparison logic.

@W-A-James W-A-James requested a review from ShaneHarvey January 23, 2025 19:33
@@ -906,7 +915,8 @@ for each server in topologyDescription.servers:
if server.address != serverDescription.address:
if server.type is RSPrimary:
# See note below about invalidating an old primary.
replace the server with a default ServerDescription of type "Unknown"
# replace the server with a default ServerDescription of type "Unknown"
# and an error field with a message containing the subsring "primary marked stale due to discovery of newer primary"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks a bit off to have an if-statement code block with only comments. How about:

        if server.type is RSPrimary:
            # See note below about invalidating an old primary.
            # The error field MUST include the substring "primary marked stale due to discovery of newer primary"
            replace the server with a default ServerDescription of type "Unknown"

@@ -71,6 +71,8 @@ following keys:

- type: A ServerType name, like "RSSecondary". See [ServerType](../server-discovery-and-monitoring.md#servertype) for
details pertaining to async and multi-threaded drivers.
- error: An optional object with a with a string field containing a string that must be a substring of the message on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An optional object with a with a string field containing a string that must be...
->
An optional string that must be...


- 2024-11-04: Make the description of `TopologyDescription.servers` consistent with the spec tests.

- 2025-01-22: Add error messages when a new primary is elected or a primary with a stale electionID or setVersion is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

electionID -> electionId

client MUST [request an immediate check](server-monitoring.md#requesting-an-immediate-check) for that server as soon as
should be none or one) and replaces its description with a default ServerDescription of type "Unknown". Additionally,
the `error` field of the new `ServerDescription` object MUST include a descriptive error explaining that it was
invalidated because the primary was determined to be stale. Drivers MAY additionally specify whether this was due to an
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph is "a note on invalidating the old primary" so I don't think it applies to all the "electionId/setVersion mismatch" case and this sentence should be moved:

Drivers MAY additionally specify whether this was due to an electionId or setVersion mismatch as described in the ServerDescripion.error section.

It should be moved to the paragraph below that starts with "If the server is primary with an obsolete electionId or setVersion,"


| circumstance | error substring |
| ---------------------------------------------------------- | -------------------------------------------------------------- |
| RSPrimary with a stale electionId/setVersion is discovered | `'primary marked stale due to electionId/setVersion mismatch'` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this comment #1729 (comment) I had suggested adding new and old election tuples to the error message. Did you decide not to do that? Or could you add an example of the suggested error message here?

Copy link
Contributor Author

@W-A-James W-A-James Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, just missed that. Here's what the message looks like in the Node driver currently (implemented before we finalized this):

MongoStalePrimaryError: primary marked stale due to electionId/setVersion mismatch: server setVersion: 1, server electionId: 000000000000000000000001, topology setVersion: 1, topology electionId: 000000000000000000000002

After implementing this it could look like

MongoStalePrimaryError: primary marked stale due to electionId/setVersion mismatch: stale electionId/setVersion: (000000000000000000000001,1), new electionId/setVersion: (000000000000000000000002, 1)

@W-A-James W-A-James requested a review from ShaneHarvey January 24, 2025 22:18
Copy link
Member

@ShaneHarvey ShaneHarvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just implemented this in Python and found that some of these tests are expecting an incorrect error message:

$ pytest -s -v -k 'new_primary or electionid or setversion' -- test/test_discovery_and_monitoring.py
                                                                               
test/test_discovery_and_monitoring.py::TestAllScenarios::test_rs_electionId_precedence_setVersion PASSED
test/test_discovery_and_monitoring.py::TestAllScenarios::test_rs_equal_electionids PASSED
test/test_discovery_and_monitoring.py::TestAllScenarios::test_rs_new_primary PASSED
test/test_discovery_and_monitoring.py::TestAllScenarios::test_rs_new_primary_new_electionid FAILED
test/test_discovery_and_monitoring.py::TestAllScenarios::test_rs_new_primary_new_setversion FAILED
test/test_discovery_and_monitoring.py::TestAllScenarios::test_rs_new_primary_wrong_set_name PASSED
test/test_discovery_and_monitoring.py::TestAllScenarios::test_rs_primary_disconnect_electionid FAILED
test/test_discovery_and_monitoring.py::TestAllScenarios::test_rs_primary_disconnect_setversion FAILED
test/test_discovery_and_monitoring.py::TestAllScenarios::test_rs_setversion_equal_max_without_electionid PASSED
test/test_discovery_and_monitoring.py::TestAllScenarios::test_rs_setversion_greaterthan_max_without_electionid FAILED
test/test_discovery_and_monitoring.py::TestAllScenarios::test_rs_setversion_without_electionid PASSED
test/test_discovery_and_monitoring.py::TestAllScenarios::test_rs_setversion_without_electionid-pre-6.0 FAILED
test/test_discovery_and_monitoring.py::TestAllScenarios::test_rs_use_setversion_without_electionid PASSED
test/test_discovery_and_monitoring.py::TestAllScenarios::test_rs_use_setversion_without_electionid-pre-6.0 FAILED

FAILED test/test_discovery_and_monitoring.py::TestAllScenarios::test_rs_new_primary_new_electionid - AssertionError: phase: 1: 'primary marked stale due to electionId/setVersion mismatch' not found in 'primary marked stale due to discovery of newer...
FAILED test/test_discovery_and_monitoring.py::TestAllScenarios::test_rs_new_primary_new_setversion - AssertionError: phase: 1: 'primary marked stale due to electionId/setVersion mismatch' not found in 'primary marked stale due to discovery of newer...
FAILED test/test_discovery_and_monitoring.py::TestAllScenarios::test_rs_primary_disconnect_electionid - AssertionError: phase: 0: 'primary marked stale due to electionId/setVersion mismatch' not found in 'primary marked stale due to discovery of newer...
FAILED test/test_discovery_and_monitoring.py::TestAllScenarios::test_rs_primary_disconnect_setversion - AssertionError: phase: 0: 'primary marked stale due to electionId/setVersion mismatch' not found in 'primary marked stale due to discovery of newer...
FAILED test/test_discovery_and_monitoring.py::TestAllScenarios::test_rs_setversion_greaterthan_max_without_electionid - AssertionError: phase: 1: 'primary marked stale due to electionId/setVersion mismatch' not found in 'primary marked stale due to discovery of newer...
FAILED test/test_discovery_and_monitoring.py::TestAllScenarios::test_rs_setversion_without_electionid-pre-6.0 - AssertionError: phase: 1: 'primary marked stale due to electionId/setVersion mismatch' not found in 'primary marked stale due to discovery of newer...
FAILED test/test_discovery_and_monitoring.py::TestAllScenarios::test_rs_use_setversion_without_electionid-pre-6.0 - AssertionError: phase: 1: 'primary marked stale due to electionId/setVersion mismatch' not found in 'primary marked stale due to discovery of newer...
===================================================== 7 failed, 7 passed, 219 deselected in 0.48s ======================================================

@W-A-James
Copy link
Contributor Author

@ShaneHarvey updated Node POC and tests mongodb/node-mongodb-native#4397

@W-A-James W-A-James requested a review from ShaneHarvey February 4, 2025 21:05
Copy link
Member

@ShaneHarvey ShaneHarvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, tests pass in Python.

@W-A-James W-A-James merged commit a6897dc into mongodb:master Feb 4, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants