-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regionliveness: add prober for detecting live regions #113231
Conversation
a0e84cd
to
811be18
Compare
9323c38
to
94039fa
Compare
pkg/sql/regionliveness/prober.go
Outdated
const fetchDbRegions = ` | ||
SELECT | ||
regions | ||
FROM system.crdb_internal.databases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using the crdb_internal
virtual table, I think this should use regions/region_provider.go
. It's a bit more efficient and makes it easier to reason about the behavior of fetchSystemDBRegions while the sql server is bootstrapping.
pkg/sql/regionliveness/prober.go
Outdated
if l.settings != nil { | ||
defaultTTL = slinstance.DefaultTTL.Get(&l.settings.SV) | ||
} | ||
txnTS := txn.KV().ReadTimestamp().GoTime() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting the unavailable_at timestamp must fix the transaction's commit timestamp. Otherwise an internal retry could push the transaction's timestamp and violate the protocol.
pkg/sql/regionliveness/prober.go
Outdated
// QueryLiveness implements Prober. | ||
func (l *livenessProber) QueryLiveness(ctx context.Context, txn *kv.Txn) (LiveRegions, error) { | ||
// If region liveness is disabled, return nil. | ||
if !RegionLivenessEnabled.Get(&l.settings.SV) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of having RegionLivenessEnabled -> LiveRegions contains every region? That allows us to disable the liveness mechanism without needing to insert every checks into every consumer of region liveness.
94039fa
to
5e8949d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson)
pkg/sql/regionliveness/prober.go
line 68 at r1 (raw file):
Previously, JeffSwenson (Jeff Swenson) wrote…
Instead of using the
crdb_internal
virtual table, I think this should useregions/region_provider.go
. It's a bit more efficient and makes it easier to reason about the behavior of fetchSystemDBRegions while the sql server is bootstrapping.
Done.
pkg/sql/regionliveness/prober.go
line 113 at r1 (raw file):
Previously, JeffSwenson (Jeff Swenson) wrote…
Setting the unavailable_at timestamp must fix the transaction's commit timestamp. Otherwise an internal retry could push the transaction's timestamp and violate the protocol.
Done.
pkg/sql/regionliveness/prober.go
line 127 at r1 (raw file):
Previously, JeffSwenson (Jeff Swenson) wrote…
What do you think of having RegionLivenessEnabled -> LiveRegions contains every region? That allows us to disable the liveness mechanism without needing to insert every checks into every consumer of region liveness.
Done.
bb0e6fe
to
cd45ced
Compare
@jeffswenson RFAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
// Region has gone down, set the unavailable_at time on it | ||
return l.db.Txn(ctx, func(ctx context.Context, txn isql.Txn) error { | ||
defaultTTL := slinstance.DefaultTTL.Get(&l.settings.SV) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to consider for the future: we could create a sql function that wraps this (e.g crdb_internal.set_region_unavailable(region)
). That allows us to write a roachprod test that randomly marks regions as unavailable and checks to make sure the sql_liveness expires_at timestamp is always less than the unavailable_at timestamp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea and I think I'll pick this up in the next PR for lease table
Previously, queries into system tables had no way of tracking the liveness of regions when query system tables. As a result, system tables for tracking state information could not be configured as SURVIVE ZONE FAILURE, since a full region outage, will cause schema changes, jobs, and other infrastructure to get stuck on lost regions. To address this, this patch starts initial work with a region-liveness Prober interface, allowing these different subsystems to detect *live* regions and query by region. This patch also adds a cluster setting for gating the probing interface (sql.region_liveness.enabled). EPIC: CRDB-28158 Fixes: cockroachdb#113160 Release note: None
cd45ced
to
21ec385
Compare
@jeffswenson TFTR bors r+ |
Build succeeded: |
114264: regionliveness: minor nits for initial prober implementation r=rafiss a=rafiss - Use QueryRowEx so that the prober always runs as a privilged node user. - Avoid using SELECT *, since that can break if the columns change, and it makes the code harder to understand. - Fix broken errors.Is usage and use HasType instead. Also changed the check to avoid an allocation. Epic: CRDB-28158 informs: #113231 fixes: #114245 Release note: None 114375: sql: add issue numbers and clean up todos r=rharding6373 a=rharding6373 sql: add issue numbers and clean up todos Epic: None Release Note: None 114381: roachtest: bump timeout for tpcc/headroom/n4cpu16 r=renatolabs a=DarrylWong When using metamorphic builds, this test would very rarely timeout or almost timeout. This change raises the timeout to 4 hours. Epic: none Release note: none Fixes: #114281 Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: rharding6373 <[email protected]> Co-authored-by: DarrylWong <[email protected]>
Previously, queries into system tables had no way of tracking the liveness of regions when query system tables. As a result, system tables for tracking state information could not be configured as SURVIVE ZONE FAILURE, since a full region outage, will cause schema changes, jobs, and other infrastructure to get stuck on lost regions. To address this, this patch starts initial work with a region-liveness Prober interface, allowing these different subsystems to detect live regions and query by region. This patch also adds a cluster setting for gating the probing interface (sql.region_liveness.enabled).
EPIC: CRDB-28158
Fixes: #113160
Release note: None