Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regionliveness: add prober for detecting live regions #113231

Merged
merged 1 commit into from
Nov 9, 2023

Conversation

fqazi
Copy link
Collaborator

@fqazi fqazi commented Oct 27, 2023

Previously, queries into system tables had no way of tracking the liveness of regions when query system tables. As a result, system tables for tracking state information could not be configured as SURVIVE ZONE FAILURE, since a full region outage, will cause schema changes, jobs, and other infrastructure to get stuck on lost regions. To address this, this patch starts initial work with a region-liveness Prober interface, allowing these different subsystems to detect live regions and query by region. This patch also adds a cluster setting for gating the probing interface (sql.region_liveness.enabled).

EPIC: CRDB-28158
Fixes: #113160
Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@fqazi fqazi force-pushed the regionLivenessProber branch 2 times, most recently from a0e84cd to 811be18 Compare October 27, 2023 17:47
@fqazi fqazi requested a review from jeffswenson October 27, 2023 17:48
@fqazi fqazi marked this pull request as ready for review October 27, 2023 17:48
@fqazi fqazi requested a review from a team as a code owner October 27, 2023 17:48
@fqazi fqazi force-pushed the regionLivenessProber branch 4 times, most recently from 9323c38 to 94039fa Compare October 30, 2023 13:33
const fetchDbRegions = `
SELECT
regions
FROM system.crdb_internal.databases
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using the crdb_internal virtual table, I think this should use regions/region_provider.go. It's a bit more efficient and makes it easier to reason about the behavior of fetchSystemDBRegions while the sql server is bootstrapping.

if l.settings != nil {
defaultTTL = slinstance.DefaultTTL.Get(&l.settings.SV)
}
txnTS := txn.KV().ReadTimestamp().GoTime()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting the unavailable_at timestamp must fix the transaction's commit timestamp. Otherwise an internal retry could push the transaction's timestamp and violate the protocol.

// QueryLiveness implements Prober.
func (l *livenessProber) QueryLiveness(ctx context.Context, txn *kv.Txn) (LiveRegions, error) {
// If region liveness is disabled, return nil.
if !RegionLivenessEnabled.Get(&l.settings.SV) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of having RegionLivenessEnabled -> LiveRegions contains every region? That allows us to disable the liveness mechanism without needing to insert every checks into every consumer of region liveness.

@fqazi fqazi force-pushed the regionLivenessProber branch from 94039fa to 5e8949d Compare November 3, 2023 15:36
Copy link
Collaborator Author

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson)


pkg/sql/regionliveness/prober.go line 68 at r1 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

Instead of using the crdb_internal virtual table, I think this should use regions/region_provider.go. It's a bit more efficient and makes it easier to reason about the behavior of fetchSystemDBRegions while the sql server is bootstrapping.

Done.


pkg/sql/regionliveness/prober.go line 113 at r1 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

Setting the unavailable_at timestamp must fix the transaction's commit timestamp. Otherwise an internal retry could push the transaction's timestamp and violate the protocol.

Done.


pkg/sql/regionliveness/prober.go line 127 at r1 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

What do you think of having RegionLivenessEnabled -> LiveRegions contains every region? That allows us to disable the liveness mechanism without needing to insert every checks into every consumer of region liveness.

Done.

@fqazi fqazi requested a review from jeffswenson November 3, 2023 15:37
@fqazi fqazi force-pushed the regionLivenessProber branch 2 times, most recently from bb0e6fe to cd45ced Compare November 6, 2023 21:57
@fqazi
Copy link
Collaborator Author

fqazi commented Nov 6, 2023

@jeffswenson RFAL

Copy link
Collaborator

@jeffswenson jeffswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


// Region has gone down, set the unavailable_at time on it
return l.db.Txn(ctx, func(ctx context.Context, txn isql.Txn) error {
defaultTTL := slinstance.DefaultTTL.Get(&l.settings.SV)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to consider for the future: we could create a sql function that wraps this (e.g crdb_internal.set_region_unavailable(region)). That allows us to write a roachprod test that randomly marks regions as unavailable and checks to make sure the sql_liveness expires_at timestamp is always less than the unavailable_at timestamp.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea and I think I'll pick this up in the next PR for lease table

Previously, queries into system tables had no way of tracking the
liveness of regions when query system tables. As a result, system tables
for tracking state information could not be configured as SURVIVE ZONE
FAILURE, since a full region outage, will cause schema changes, jobs,
and other infrastructure to get stuck on lost regions. To address this,
this patch starts initial work with a region-liveness Prober interface,
allowing these different subsystems to detect *live* regions and query
by region. This patch also adds a cluster setting for gating the probing
interface (sql.region_liveness.enabled).

EPIC: CRDB-28158
Fixes: cockroachdb#113160
Release note: None
@fqazi fqazi force-pushed the regionLivenessProber branch from cd45ced to 21ec385 Compare November 9, 2023 19:13
@fqazi
Copy link
Collaborator Author

fqazi commented Nov 9, 2023

@jeffswenson TFTR

bors r+

@craig
Copy link
Contributor

craig bot commented Nov 9, 2023

Build succeeded:

@craig craig bot merged commit 14eb6cb into cockroachdb:master Nov 9, 2023
craig bot pushed a commit that referenced this pull request Nov 13, 2023
114264: regionliveness: minor nits for initial prober implementation r=rafiss a=rafiss

- Use QueryRowEx so that the prober always runs as a privilged node user.
- Avoid using SELECT *, since that can break if the columns change, and it makes the code harder to understand.
- Fix broken errors.Is usage and use HasType instead. Also changed the check to avoid an allocation.

Epic: CRDB-28158
informs: #113231
fixes: #114245
Release note: None

114375: sql: add issue numbers and clean up todos r=rharding6373 a=rharding6373

sql: add issue numbers and clean up todos

Epic: None
Release Note: None

114381: roachtest: bump timeout for tpcc/headroom/n4cpu16 r=renatolabs a=DarrylWong

When using metamorphic builds, this test would very rarely timeout or almost timeout. This change raises the timeout to 4 hours.

Epic: none
Release note: none
Fixes: #114281

Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: rharding6373 <[email protected]>
Co-authored-by: DarrylWong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: add prober for region liveness
3 participants