Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: v19 tablet server config parsing incorrect #16392

Closed
shanth96 opened this issue Jul 15, 2024 · 2 comments · Fixed by #16393
Closed

Bug Report: v19 tablet server config parsing incorrect #16392

shanth96 opened this issue Jul 15, 2024 · 2 comments · Fixed by #16393

Comments

@shanth96
Copy link
Contributor

shanth96 commented Jul 15, 2024

Overview of the Issue

We're trying to validate v19 to upgrade our clusters and we noticed that Vitess is unable to boot up and gets stuck in a loop with the following errors:

E0711 13:57:16.790079       1 state_manager.go:294] Error transitioning to the desired state: REPLICA, Serving, will keep retrying: Code: DEADLINE_EXCEEDED
resource pool context already expired
I0711 13:57:16.790091       1 state_manager.go:745] State: exiting lameduck
E0711 13:57:16.790096       1 tm_state.go:331] Cannot start query service: Code: DEADLINE_EXCEEDED

While digging into this, we noticed that the errors were happening because schemaChangeReloadTimeout was being set to 0s (even though default is 30s and we don't explicitly configure it). This was causing the DEADLINE_EXCEEDED errors when trying to open the schema engine here

Upon further investigation, we noticed that the initConfig section of vttablet had some weird behaviour:

Reproduction Steps

The code below is a simple reproduction of the initConfig code here.

Running this in v18 vs v19 produces two different results:

package main

import (
	"fmt"
	"os"

	"vitess.io/vitess/go/vt/vttablet/tabletserver/tabletenv"
	"vitess.io/vitess/go/yaml2"
)

var tabletConfig = "<path_to_sample_config>"

func main() {
	config := tabletenv.NewDefaultConfig()
	gotBytes, _ := yaml2.Marshal(config)
	fmt.Printf("Config before unmarshaling:\n%s", gotBytes)

	bytes, err := os.ReadFile(tabletConfig)
	if err != nil {
		panic(err)
	}
	if err := yaml2.Unmarshal(bytes, config); err != nil {
		panic(err)
	}

	gotBytes, _ = yaml2.Marshal(config)
	fmt.Printf("Loaded config file %s successfully:\n%s", tabletConfig, gotBytes)
}

where tablet-config.yaml is a simple yaml file with any config:

consolidator: enable

on v19, we get the following output:

Config before unmarshaling:
consolidator: enable
consolidatorStreamQuerySize: 2097152
consolidatorStreamTotalSize: 134217728
gracePeriods: {}
healthcheck:
  degradedThresholdSeconds: 30s
  intervalSeconds: 20s
  unhealthyThresholdSeconds: 2h0m0s
hotRowProtection:
  maxConcurrency: 5
  maxGlobalQueueSize: 1000
  maxQueueSize: 20
  mode: disable
messagePostponeParallelism: 4
olap:
  txTimeoutSeconds: 30s
olapReadPool:
  idleTimeoutSeconds: 30m0s
  size: 200
oltp:
  maxRows: 10000
  queryTimeoutSeconds: 30s
  txTimeoutSeconds: 30s
oltpReadPool:
  idleTimeoutSeconds: 30m0s
  size: 16
queryCacheDoorkeeper: true
queryCacheMemory: 33554432
replicationTracker:
  heartbeatIntervalSeconds: 250ms
  mode: disable
rowStreamer:
  maxInnoDBTrxHistLen: 1000000
  maxMySQLReplLagSecs: 43200
schemaChangeReloadTimeout: 30s
schemaReloadIntervalSeconds: 30m0s
signalWhenSchemaChange: true
streamBufferSize: 32768
txPool:
  idleTimeoutSeconds: 30m0s
  size: 20
  timeoutSeconds: 1s

Loaded config file /Users/shanth/src/github.com/Shopify/vitess/go/cmd/vttablet/main/tablet-config.yaml successfully:
consolidator: enable
consolidatorStreamQuerySize: 2097152
consolidatorStreamTotalSize: 134217728
gracePeriods: {}
healthcheck:
  degradedThresholdSeconds: 30s
  intervalSeconds: 20s
  unhealthyThresholdSeconds: 2h0m0s
hotRowProtection:
  maxConcurrency: 5
  maxGlobalQueueSize: 1000
  maxQueueSize: 20
  mode: disable
messagePostponeParallelism: 4
olap:
  txTimeoutSeconds: 30s
olapReadPool:
  idleTimeoutSeconds: 30m0s
  size: 200
oltp:
  maxRows: 10000
  queryTimeoutSeconds: 30s
  txTimeoutSeconds: 30s
oltpReadPool:
  idleTimeoutSeconds: 30m0s
  size: 16
queryCacheDoorkeeper: true
queryCacheMemory: 33554432
replicationTracker:
  heartbeatIntervalSeconds: 250ms
  mode: disable
rowStreamer:
  maxInnoDBTrxHistLen: 1000000
  maxMySQLReplLagSecs: 43200
signalWhenSchemaChange: true
streamBufferSize: 32768
txPool:
  idleTimeoutSeconds: 30m0s
  size: 20
  timeoutSeconds: 1s

vs. on v18, we get:

Config before unmarshaling:
consolidator: enable
consolidatorStreamQuerySize: 2097152
consolidatorStreamTotalSize: 134217728
gracePeriods: {}
healthcheck:
  degradedThresholdSeconds: 30s
  intervalSeconds: 20s
  unhealthyThresholdSeconds: 2h0m0s
hotRowProtection:
  maxConcurrency: 5
  maxGlobalQueueSize: 1000
  maxQueueSize: 20
  mode: disable
messagePostponeParallelism: 4
olap:
  txTimeoutSeconds: 30s
olapReadPool:
  idleTimeoutSeconds: 30m0s
  size: 200
oltp:
  maxRows: 10000
  queryTimeoutSeconds: 30s
  txTimeoutSeconds: 30s
oltpReadPool:
  idleTimeoutSeconds: 30m0s
  maxWaiters: 5000
  size: 16
queryCacheDoorkeeper: true
queryCacheMemory: 33554432
replicationTracker:
  heartbeatIntervalSeconds: 250ms
  mode: disable
rowStreamer:
  maxInnoDBTrxHistLen: 1000000
  maxMySQLReplLagSecs: 43200
schemaChangeReloadTimeout: 30s
schemaReloadIntervalSeconds: 30m0s
signalWhenSchemaChange: true
streamBufferSize: 32768
txPool:
  idleTimeoutSeconds: 30m0s
  maxWaiters: 5000
  size: 20
  timeoutSeconds: 1s

Loaded config file /Users/shanth/src/github.com/Shopify/vitess/go/cmd/vttablet/main/tablet-config.yaml successfully:
consolidator: enable
consolidatorStreamQuerySize: 2097152
consolidatorStreamTotalSize: 134217728
gracePeriods: {}
healthcheck:
  degradedThresholdSeconds: 30s
  intervalSeconds: 20s
  unhealthyThresholdSeconds: 2h0m0s
hotRowProtection:
  maxConcurrency: 5
  maxGlobalQueueSize: 1000
  maxQueueSize: 20
  mode: disable
messagePostponeParallelism: 4
olap:
  txTimeoutSeconds: 30s
olapReadPool:
  idleTimeoutSeconds: 30m0s
  size: 200
oltp:
  maxRows: 10000
  queryTimeoutSeconds: 30s
  txTimeoutSeconds: 30s
oltpReadPool:
  idleTimeoutSeconds: 30m0s
  maxWaiters: 5000
  size: 16
queryCacheDoorkeeper: true
queryCacheMemory: 33554432
replicationTracker:
  heartbeatIntervalSeconds: 250ms
  mode: disable
rowStreamer:
  maxInnoDBTrxHistLen: 1000000
  maxMySQLReplLagSecs: 43200
schemaChangeReloadTimeout: 30s
schemaReloadIntervalSeconds: 30m0s
signalWhenSchemaChange: true
streamBufferSize: 32768
txPool:
  idleTimeoutSeconds: 30m0s
  maxWaiters: 5000
  size: 20
  timeoutSeconds: 1s

Particularly, note how schemaChangeReloadTimeout is not present as a config variable in v19 after unmarshaling.

Binary Version

Testing on v19.0.4 and v18.0.5

Operating System and Environment details

N/A

Log Fragments

No response

@shanth96 shanth96 added Needs Triage This issue needs to be correctly labelled and triaged Type: Bug labels Jul 15, 2024
@mattlord
Copy link
Contributor

mattlord commented Jul 15, 2024

Hi @shanth96!

I believe this behavior change is coming from: #14733

Specifically here, as the value is an empty string after reading the empty config file: https://github.com/vitessio/vitess/pull/14733/files#diff-ee5ffc675b719e69de7147068ab4ecf03f6d5db91a000d212a30f8ce61544691R424-R431

That is still in place on main:

if tmp.SchemaChangeReloadTimeout != "" {
cfg.SchemaChangeReloadTimeout, err = time.ParseDuration(tmp.SchemaChangeReloadTimeout)
if err != nil {
return err
}
} else {
cfg.SchemaChangeReloadTimeout = 0
}

on main:

❯ vttablet --version
vttablet version Version: 21.0.0-SNAPSHOT (Git revision 8a59865817029578efcac98abf1a502150d572c9 branch 'main') built on Mon Jul 15 19:04:54 EDT 2024 by [email protected] using go1.22.5 darwin/arm64

❯ cat /tmp/tablet-config.yml
consolidator: enable

❯ cat bugtest.go
package main

import (
	"fmt"
	"os"

	"vitess.io/vitess/go/vt/vttablet/tabletserver/tabletenv"
	"vitess.io/vitess/go/yaml2"
)

var tabletConfig = "/tmp/tablet-config.yml"

func main() {
	config := tabletenv.NewDefaultConfig()
	gotBytes, _ := yaml2.Marshal(config)
	fmt.Printf("Config before unmarshaling:\n%s", gotBytes)

	bytes, err := os.ReadFile(tabletConfig)
	if err != nil {
		panic(err)
	}
	if err := yaml2.Unmarshal(bytes, config); err != nil {
		panic(err)
	}

	gotBytes, _ = yaml2.Marshal(config)
	fmt.Printf("Loaded config file %s successfully:\n%s", tabletConfig, gotBytes)
}

❯ go run bugtest.go | grep -i reload
schemaChangeReloadTimeout: 30s
schemaReloadIntervalSeconds: 30m0s

On v18:

git checkout release-18.0
make build

❯ go run bugtest.go | grep -i reload
schemaChangeReloadTimeout: 30s
schemaReloadIntervalSeconds: 30m0s
schemaChangeReloadTimeout: 30s
schemaReloadIntervalSeconds: 30m0s

@mattlord mattlord self-assigned this Jul 15, 2024
@mattlord mattlord added Component: Cluster management and removed Needs Triage This issue needs to be correctly labelled and triaged labels Jul 15, 2024
@shanth96
Copy link
Contributor Author

Hi @mattlord! Thank you for the quick fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants