Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: acceptance/gossip/peerings failed #48005

Closed
cockroach-teamcity opened this issue Apr 24, 2020 · 7 comments · Fixed by #55166
Closed

roachtest: acceptance/gossip/peerings failed #48005

cockroach-teamcity opened this issue Apr 24, 2020 · 7 comments · Fixed by #55166
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).acceptance/gossip/peerings failed on master@0e16cc15f139b816b8e46fe6571691a8ec0e6937:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/gossip/peerings/run_1
	gossip.go:259,acceptance.go:91,test_runner.go:753: status: 403 Forbidden, content-type: application/json, body: {
		  "error": "not allowed (due to the 'server.remote_debugging.mode' setting)",
		  "message": "not allowed (due to the 'server.remote_debugging.mode' setting)",
		  "code": 7,
		  "details": [
		  ]
		}, error: <nil>
		github.com/cockroachdb/cockroach/pkg/util/httputil.doJSONRequest
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/httputil/http.go:116
		github.com/cockroachdb/cockroach/pkg/util/httputil.GetJSON
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/httputil/http.go:55
		main.(*gossipUtil).check.func1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/gossip.go:157
		github.com/cockroachdb/cockroach/pkg/util/retry.ForDuration
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:188
		main.(*gossipUtil).check
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/gossip.go:153
		main.runGossipPeerings
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/gossip.go:258
		main.registerAcceptance.func2
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/acceptance.go:91
		main.(*testRunner).runTest.func2
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1357
		failed to get gossip status from node 1
		main.(*gossipUtil).check.func1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/gossip.go:158
		github.com/cockroachdb/cockroach/pkg/util/retry.ForDuration
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:188
		main.(*gossipUtil).check
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/gossip.go:153
		main.runGossipPeerings
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/gossip.go:258
		main.registerAcceptance.func2
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/acceptance.go:91
		main.(*testRunner).runTest.func2
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1357

More

Artifacts: /acceptance/gossip/peerings

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Apr 24, 2020
@cockroach-teamcity cockroach-teamcity added this to the 20.1 milestone Apr 24, 2020
@andreimatei andreimatei removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Apr 28, 2020
@nvanbenschoten
Copy link
Member

07:42:13 test.go:325: test failure: 	gossip.go:259,acceptance.go:91,test_runner.go:753: status: 403 Forbidden, content-type: application/json, body: {
		  "error": "not allowed (due to the 'server.remote_debugging.mode' setting)",
		  "message": "not allowed (due to the 'server.remote_debugging.mode' setting)",
		  "code": 7,
		  "details": [
		  ]
		}, error: <nil>
		github.com/cockroachdb/cockroach/pkg/util/httputil.doJSONRequest
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/httputil/http.go:116
		github.com/cockroachdb/cockroach/pkg/util/httputil.GetJSON
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/httputil/http.go:55
		main.(*gossipUtil).check.func1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/gossip.go:157
		github.com/cockroachdb/cockroach/pkg/util/retry.ForDuration
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:188
		main.(*gossipUtil).check
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/gossip.go:153
		main.runGossipPeerings
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/gossip.go:258
		main.registerAcceptance.func2
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/acceptance.go:91
		main.(*testRunner).runTest.func2
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1357

Looks like we're trying to hit the admin UI (/_status/gossip/local) shortly after the cluster starts up and we hit this error. This is an indication that the server has not received an updated version of the cluster settings, otherwise is would be aware that server.remote_debugging.mode was set to 'any' here.

@nvanbenschoten nvanbenschoten self-assigned this Apr 28, 2020
@nvanbenschoten
Copy link
Member

One thing we see in the logs from this node is that:

W200424 07:42:12.970613 143 server/node.go:670  [n1] [n1,s1]: unable to compute metrics: [n1,s1]: system config not yet available

fires a few times after:

I200424 07:41:42.969732 24 server/server.go:1419  [n1] starting http server at [::]:26258 (use: 10.128.0.76:26258)
I200424 07:41:49.029579 99 gossip/gossip.go:1538  [n1] node has connected to cluster via gossip

@tbg
Copy link
Member

tbg commented Apr 29, 2020

Yes, that sounds reasonable. The settings are not persisted on the node, so we're always running with the default settings for a little bit of time even after signaling readiness.

Something we could do here is to wait until a system config has been ingested before returning from .Start (in the non-init case). But we're also phasing out this use of Gossip, so this will rot quickly, and the new system will use higher-level primitives. For now, seems best to add a retry or sleep to the test.

@cockroach-teamcity
Copy link
Member Author

(roachtest).acceptance/gossip/peerings failed on master@61f18db7dd9a054d9a4648f67546202f760b5000:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/gossip/peerings/run_1
	gossip.go:259,acceptance.go:91,test_runner.go:753: status: 403 Forbidden, content-type: application/json, body: {
		  "error": "not allowed (due to the 'server.remote_debugging.mode' setting)",
		  "message": "not allowed (due to the 'server.remote_debugging.mode' setting)",
		  "code": 7,
		  "details": [
		  ]
		}, error: <nil>
		github.com/cockroachdb/cockroach/pkg/util/httputil.doJSONRequest
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/httputil/http.go:116
		github.com/cockroachdb/cockroach/pkg/util/httputil.GetJSON
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/httputil/http.go:55
		main.(*gossipUtil).check.func1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/gossip.go:157
		github.com/cockroachdb/cockroach/pkg/util/retry.ForDuration
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:188
		main.(*gossipUtil).check
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/gossip.go:153
		main.runGossipPeerings
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/gossip.go:258
		main.registerAcceptance.func2
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/acceptance.go:91
		main.(*testRunner).runTest.func2
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1357
		failed to get gossip status from node 1
		main.(*gossipUtil).check.func1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/gossip.go:158
		github.com/cockroachdb/cockroach/pkg/util/retry.ForDuration
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:188
		main.(*gossipUtil).check
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/gossip.go:153
		main.runGossipPeerings
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/gossip.go:258
		main.registerAcceptance.func2
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/acceptance.go:91
		main.(*testRunner).runTest.func2
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1357

More

Artifacts: /acceptance/gossip/peerings

See this test on roachdash
powered by pkg/cmd/internal/issues

@knz
Copy link
Contributor

knz commented May 1, 2020

Yes, that sounds reasonable. The settings are not persisted on the node, so we're always running with the default settings for a little bit of time even after signaling readiness.

omg that explains so much about many other test failures I've investigated in the past. It also explains why certain SQL clients which should get some defaults initialized by settings don't get them when they connect immediately after a node starts.

I'm going to file this as a separate issue under the "rolling restarts" project.

@cockroach-teamcity
Copy link
Member Author

(roachtest).acceptance/gossip/peerings failed on master@20916a30cf9356683c973f8653e8b69613a75fe4:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/gossip/peerings/run_1
	gossip.go:259,acceptance.go:94,test_runner.go:753: status: 403 Forbidden, content-type: application/json, body: {
		  "error": "not allowed (due to the 'server.remote_debugging.mode' setting)",
		  "message": "not allowed (due to the 'server.remote_debugging.mode' setting)",
		  "code": 7,
		  "details": [
		  ]
		}, error: <nil>
		github.com/cockroachdb/cockroach/pkg/util/httputil.doJSONRequest
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/httputil/http.go:116
		github.com/cockroachdb/cockroach/pkg/util/httputil.GetJSON
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/httputil/http.go:55
		main.(*gossipUtil).check.func1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/gossip.go:157
		github.com/cockroachdb/cockroach/pkg/util/retry.ForDuration
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:188
		main.(*gossipUtil).check
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/gossip.go:153
		main.runGossipPeerings
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/gossip.go:258
		main.registerAcceptance.func2
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/acceptance.go:94
		main.(*testRunner).runTest.func2
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1357
		failed to get gossip status from node 1
		main.(*gossipUtil).check.func1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/gossip.go:158
		github.com/cockroachdb/cockroach/pkg/util/retry.ForDuration
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:188
		main.(*gossipUtil).check
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/gossip.go:153
		main.runGossipPeerings
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/gossip.go:258
		main.registerAcceptance.func2
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/acceptance.go:94
		main.(*testRunner).runTest.func2
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:753
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1357

More

Artifacts: /acceptance/gossip/peerings

See this test on roachdash
powered by pkg/cmd/internal/issues

@tbg
Copy link
Member

tbg commented Jun 16, 2020

We cannot indiscriminately block on receiving the settings before signaling ready because of the chicken-and-egg problem that comes up when the node currently starting is needed for quorum on the system config range. For example, if a three node cluster is completely down, at least two of the nodes must be online before current settings are received.

I think we should do two things here:

  1. persist the settings locally on the first store, so that restarting nodes can come up with settings that are no staler than the ones they had when they went down.
  2. if a new node joins the cluster, wait until it has gotten the settings from gossip (and applied them) before continuing to ready status.

tbg added a commit to tbg/cockroach that referenced this issue Jun 16, 2020
In an ideal world a KV node would not declare itself as ready until
it has received the current cluster settings.

However, we cannot indiscriminately block on that because of the
chicken-and-egg problem that comes up when the node currently starting
is needed for quorum on the system config range. For example, if a three
node cluster is completely down, at least two of the nodes must be
online before current settings are received.

Instead do the following:

1. persist the settings locally on the first store whenever they are
received, so that restarting nodes can come up with settings that are no
staler than the ones they had when they went down.
2. if a new node joins the cluster, we can wait for the settings to show
up (since the node is just joining, it is not required for any quorum).

Fixes cockroachdb#48005.

Release note: None
@craig craig bot closed this as completed in 5343d56 Aug 27, 2020
vrongmeal added a commit to vrongmeal/cockroach that referenced this issue Oct 2, 2020
Add functions to persist settings key values with the local store
prefix so restarting nodes can come up with settings that are no
staler than the ones they had when they went down.

Fixes cockroachdb#48005.

Release note: None

Signed-off-by: Vaibhav <[email protected]>
vrongmeal added a commit to vrongmeal/cockroach that referenced this issue Oct 20, 2020
In an ideal world a KV node would not declare itself as ready until
it has received the current cluster settings.

However, we cannot indiscriminately block on that because of the
chicken-and-egg problem that comes up when the node currently starting
is needed for quorum on the system config range. For example, if a three
node cluster is completely down, at least two of the nodes must be
online before current settings are received.

Instead do the following:

1. persist the settings locally on the first store whenever they are
received, so that restarting nodes can come up with settings that are no
staler than the ones they had when they went down.
2. if a new node joins the cluster, we can wait for the settings to show
up (since the node is just joining, it is not required for any quorum).

Fixes cockroachdb#48005.

Release note: None

Signed-off-by: Vaibhav <[email protected]>
vrongmeal added a commit to vrongmeal/cockroach that referenced this issue Nov 3, 2020
In an ideal world a KV node would not declare itself as ready until
it has received the current cluster settings.

However, we cannot indiscriminately block on that because of the
chicken-and-egg problem that comes up when the node currently starting
is needed for quorum on the system config range. For example, if a three
node cluster is completely down, at least two of the nodes must be
online before current settings are received.

Instead do the following:

1. persist the settings locally on the first store whenever they are
   received, so that restarting nodes can come up with settings that are
   no staler than the ones they had when they went down.
2. if a new node joins the cluster, we can wait for the settings to show
   up (since the node is just joining, it is not required for any quorum).

Fixes cockroachdb#48005.

Release note: None

Signed-off-by: Vaibhav <[email protected]>
vrongmeal added a commit to vrongmeal/cockroach that referenced this issue Nov 9, 2020
In an ideal world a KV node would not declare itself as ready until
it has received the current cluster settings.

However, we cannot indiscriminately block on that because of the
chicken-and-egg problem that comes up when the node currently starting
is needed for quorum on the system config range. For example, if a three
node cluster is completely down, at least two of the nodes must be
online before current settings are received.

Instead do the following:

1. persist the settings locally on the first store whenever they are
   received, so that restarting nodes can come up with settings that are
   no staler than the ones they had when they went down.
2. if a new node joins the cluster, we can wait for the settings to show
   up (since the node is just joining, it is not required for any quorum).

Fixes cockroachdb#48005.

Release note: None

Signed-off-by: Vaibhav <[email protected]>
vrongmeal added a commit to vrongmeal/cockroach that referenced this issue Nov 10, 2020
In an ideal world a KV node would not declare itself as ready until
it has received the current cluster settings.

However, we cannot indiscriminately block on that because of the
chicken-and-egg problem that comes up when the node currently starting
is needed for quorum on the system config range. For example, if a three
node cluster is completely down, at least two of the nodes must be
online before current settings are received.

Instead do the following:

1. persist the settings locally on the first store whenever they are
   received, so that restarting nodes can come up with settings that are
   no staler than the ones they had when they went down.
2. if a new node joins the cluster, we can wait for the settings to show
   up (since the node is just joining, it is not required for any quorum).

Fixes cockroachdb#48005.

Release note: None

Signed-off-by: Vaibhav <[email protected]>
vrongmeal added a commit to vrongmeal/cockroach that referenced this issue Nov 18, 2020
In an ideal world a KV node would not declare itself as ready until
it has received the current cluster settings.

However, we cannot indiscriminately block on that because of the
chicken-and-egg problem that comes up when the node currently starting
is needed for quorum on the system config range. For example, if a three
node cluster is completely down, at least two of the nodes must be
online before current settings are received.

Instead do the following:

1. persist the settings locally on the first store whenever they are
   received, so that restarting nodes can come up with settings that are
   no staler than the ones they had when they went down.
2. if a new node joins the cluster, we can wait for the settings to show
   up (since the node is just joining, it is not required for any quorum).

Fixes cockroachdb#48005.

Release note: None

Signed-off-by: Vaibhav <[email protected]>
craig bot pushed a commit that referenced this issue Nov 18, 2020
55166: server: ensure settings are up-to-date. r=tbg a=vrongmeal

[WIP]

Context: #50271

Fixes #48005.

Release note: None

Signed-off-by: Vaibhav <[email protected]>

Co-authored-by: Vaibhav <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
5 participants