Skip to content

Commit

Permalink
kvserver,spanconfig: add requisite version gates
Browse files Browse the repository at this point in the history
In an internal support issue[1] we observed that a mixed-version cluster
straddling 21.2 and 22.1 (where span configs were first introduced)
with a 5x replication factor, the 22.1 nodes were incorrectly
down replicating ranges towards 3x replication. This happened because we
were missing requisite version gates in KV code when deciding what
"source" to use for span configs (choosing between the gossiped
system config span or the span configs infrastructure introduced in
\cockroachdb#67679). The 22.1 nodes were using a fallback, statically hard coded
config with 3x replication factor.

Release note (bug fix): In mixed version clusters running 21.2 and 22.1
nodes, it was possible for CockroachDB to not respect zone configs. This
manifested in a few ways:
- If num_replicas was set to something other than 3, we would still
  add or remove replicas to get to 3x replication.
  - If num_voters was set explicitly to get a mix of voting and
    non-voting replicas, it would be ignored. CockroachDB could possible
    remove non-voting replicas.
- If range_min_bytes or range_max_bytes were changed from 128 MiB and
  512 MiB respectively, we would instead try to size ranges to be within
  [128 MiB, 512MiB]. This could appear as an excess amount of range
  splits or merges, as visible in the Replication Dashboard under "Range
  Operations".
- If gc.ttlseconds was set to something other than 90000 seconds, we
  would still GC data only older than 90000s/25h. If the GC TTL was set
  to something larger than 25h, AOST queries going further back may now
  start failing. For GC TTLs less than the 25h default, clusters would
  observe increased disk usage due to more retained garbage.
- If constraints, lease_preferences or voter_constraints were set, they
  would be ignored. Range data and leases would possibly be moved
  outside where prescribed.
This issues only lasted during the mixed-version state, where the
cluster was not finalized. Finalization of the 22.1 version happens
automatically when all nodes in the cluster are running 22.1. The only
exception to this is if users have set
cluster.preserve_downgrade_option, where auto-finalization is allowed to
proceed once they issue a 'RESET CLUSTER SETTING
cluster.preserve_downgrade_option', or 'SET CLUSTER SETTING version =
22.1' explicitly. Cluster version finalization is better documented on
https://www.cockroachlabs.com/docs/v22.1/upgrade-cockroach-version.

[1]: https://github.com/cockroachlabs/support/issues/2104
  • Loading branch information
irfansharif committed Mar 7, 2023
1 parent 97d9ad7 commit 910aff4
Show file tree
Hide file tree
Showing 2 changed files with 53 additions and 5 deletions.
15 changes: 10 additions & 5 deletions pkg/kv/kvserver/store.go
Original file line number Diff line number Diff line change
Expand Up @@ -2018,7 +2018,8 @@ func (s *Store) Start(ctx context.Context, stopper *stop.Stopper) error {
// configs infrastructure, we want to re-apply configs on all
// replicas from whatever the new source is.
spanconfigstore.EnabledSetting.SetOnChange(&s.ClusterSettings().SV, func(ctx context.Context) {
enabled := spanconfigstore.EnabledSetting.Get(&s.ClusterSettings().SV)
enabled := spanconfigstore.EnabledSetting.Get(&s.ClusterSettings().SV) &&
s.cfg.Settings.Version.IsActive(ctx, clusterversion.EnableSpanConfigStore)
if enabled {
s.applyAllFromSpanConfigStore(ctx)
} else if scp := s.cfg.SystemConfigProvider; scp != nil {
Expand Down Expand Up @@ -2348,11 +2349,13 @@ func (s *Store) removeReplicaWithRangefeed(rangeID roachpb.RangeID) {
// systemGossipUpdate is a callback for gossip updates to
// the system config which affect range split boundaries.
func (s *Store) systemGossipUpdate(sysCfg *config.SystemConfig) {
if !s.cfg.SpanConfigsDisabled && spanconfigstore.EnabledSetting.Get(&s.ClusterSettings().SV) {
ctx := s.AnnotateCtx(context.Background())
if !s.cfg.SpanConfigsDisabled &&
spanconfigstore.EnabledSetting.Get(&s.ClusterSettings().SV) &&
s.cfg.Settings.Version.IsActive(ctx, clusterversion.EnableSpanConfigStore) {
return // nothing to do
}

ctx := s.AnnotateCtx(context.Background())
s.computeInitialMetrics.Do(func() {
// Metrics depend in part on the system config. Compute them as soon as we
// get the first system config, then periodically in the background
Expand Down Expand Up @@ -2401,8 +2404,10 @@ func (s *Store) systemGossipUpdate(sysCfg *config.SystemConfig) {
// onSpanConfigUpdate is the callback invoked whenever this store learns of a
// span config update.
func (s *Store) onSpanConfigUpdate(ctx context.Context, updated roachpb.Span) {
if !spanconfigstore.EnabledSetting.Get(&s.ClusterSettings().SV) {
return
if s.cfg.SpanConfigsDisabled ||
!spanconfigstore.EnabledSetting.Get(&s.ClusterSettings().SV) ||
!s.cfg.Settings.Version.IsActive(ctx, clusterversion.EnableSpanConfigStore) {
return // nothing to do
}

sp, err := keys.SpanAddr(updated)
Expand Down
43 changes: 43 additions & 0 deletions pkg/migration/migrations/migrate_span_configs_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ import (
"github.com/cockroachdb/cockroach/pkg/spanconfig/spanconfigsqlwatcher"
"github.com/cockroachdb/cockroach/pkg/sql/catalog/descpb"
"github.com/cockroachdb/cockroach/pkg/testutils"
"github.com/cockroachdb/cockroach/pkg/testutils/skip"
"github.com/cockroachdb/cockroach/pkg/testutils/sqlutils"
"github.com/cockroachdb/cockroach/pkg/testutils/testcluster"
"github.com/cockroachdb/cockroach/pkg/util/hlc"
Expand Down Expand Up @@ -354,3 +355,45 @@ func TestEnsureSpanConfigSubscription(t *testing.T) {
)
require.False(t, scKVSubscriber.LastUpdated().IsEmpty())
}

// TestEnablingSpanConfigs verifies that we only use the span config
// infrastructure once the right version gate has been bumped. It's a regression
// test for #98094.
func TestEnablingSpanConfigs(t *testing.T) {
defer leaktest.AfterTest(t)()
defer log.Scope(t).Close(t)
skip.UnderRace(t, "may time out due to multiple servers")
skip.UnderStress(t, "too many nodes")

ctx := context.Background()
tc := testcluster.StartTestCluster(t, 6, base.TestClusterArgs{
ServerArgs: base.TestServerArgs{
Knobs: base.TestingKnobs{
Server: &server.TestingKnobs{
DisableAutomaticVersionUpgrade: make(chan struct{}),
BinaryVersionOverride: clusterversion.ByKey(
clusterversion.EnableSpanConfigStore - 1,
),
},
SpanConfig: &spanconfig.TestingKnobs{
ManagerDisableJobCreation: true,
},
},
ScanInterval: 100 * time.Millisecond, // speed up test
},
})

defer tc.Stopper().Stop(ctx)

tdb := sqlutils.MakeSQLRunner(tc.ServerConn(0))
tdb.Exec(t, `ALTER RANGE DEFAULT CONFIGURE ZONE USING num_replicas = 4`)

key := tc.ScratchRange(t)
testutils.SucceedsSoon(t, func() error {
desc := tc.LookupRangeOrFatal(t, key)
if got := len(desc.Replicas().Descriptors()); got != 4 {
return errors.Newf("expected 4 replicas for scratch range, found %d", got)
}
return nil
})
}

0 comments on commit 910aff4

Please sign in to comment.