Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

acceptance: TestDockerCLI failed #61896

Closed
cockroach-teamcity opened this issue Mar 12, 2021 · 16 comments
Closed

acceptance: TestDockerCLI failed #61896

cockroach-teamcity opened this issue Mar 12, 2021 · 16 comments
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 12, 2021

(acceptance).TestDockerCLI failed on release-21.1@050385adeb3094405ce52e73f6deb03f75a96f1a:

=== RUN   TestDockerCLI
    test_log_scope.go:73: test logs captured to: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/logTestDockerCLI255973864
    test_log_scope.go:74: use -show-logs to present logs inline
=== CONT  TestDockerCLI
    cli_test.go:93: -- test log scope end --
test logs left over in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/logTestDockerCLI255973864
--- FAIL: TestDockerCLI (426.64s)
=== RUN   TestDockerCLI/test_demo_node_cmds.tcl
=== CONT  TestDockerCLI/test_demo_node_cmds.tcl
    cli_test.go:89: non-zero exit code: 1
    --- FAIL: TestDockerCLI/test_demo_node_cmds.tcl (37.62s)

More

Parameters:

  • GOFLAGS=-json
make stressrace TESTS=TestDockerCLI PKG=./pkg/acceptance TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1

Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

Jira issue: CRDB-6260

@cockroach-teamcity cockroach-teamcity added branch-release-21.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Mar 12, 2021
@knz knz unassigned tbg Mar 15, 2021
@knz
Copy link
Contributor

knz commented Mar 15, 2021

related: #61834

@otan could you have a look?

@otan
Copy link
Contributor

otan commented Mar 15, 2021

seems like something changed with node recommissioning:

[email protected]:26257/movr> \demo decommission 4
node 4 has been decommissioned
select node_id, draining, decommissioning, membership from crdb_internal.gossip_liveness ORDER BY node_id;
[email protected]:26257/movr> select node_id, draining, decommissioning, memb �ership from crdb_internal.gossip_liveness ORDER BY node_id;
  node_id | draining | decommissioning |   membership
----------+----------+-----------------+-----------------
        1 |  false   |      false      | active
        2 |  false   |      false      | active
        3 |  false   |      false      | active
        4 |  false   |      true       | decommissioned
        5 |  false   |      false      | active
(5 rows)

Time: 0ms total (execution 0ms / network 0ms)

\demo recommission 4
[email protected]:26257/movr> \demo recommission 4
internal server error: while trying to mark as decommissioning: rpc error: code = FailedPrecondition desc = can only recommission a decommissioning node; n4 found to be decommissioned
\demo add blah
[email protected]:26257/movr> \demo add blah
internal server error: tier must be in the form "key=value" not "blah"
\demo add region=ca-central,zone=a
[email protected]:26257/movr> \demo add region=ca-central,zone=a
internal server error: error recording initial status summaries: failed to connect to n4 at 127.108.193.241:41497: initial connection heartbeat failed: rpc error: code = PermissionDenied desc = n4 was permanently removed from the cluster at 2021-03-11 06:13:31.482096845 +0000 UTC; it is not allowed to rejoin the cluster
[email protected]:26257/movr> 
.210311 06:13:56.512951636 EXPECT TEST: TIMEOUT WAITING FOR "node 6 has been added with locality "region=ca-central,zone=a""

maybe we're not allowed to recommission a node that has been fully decommissioned anymore? i can remove the test, but rather have someone who has been working on this confirm that to be the case.

@knz
Copy link
Contributor

knz commented Mar 15, 2021

maybe we're not allowed to recommission a node that has been fully decommissioned anymore?

Yes that is exactly correct. We changed that behavior a couple weeks ago. I am surprised this wasn't detected by the test when the corresponding PR was merged.

@otan
Copy link
Contributor

otan commented Mar 15, 2021

on second thoughts, eexpect "can only recommission a decommissioning node" worked as expected.

however, we're failing to start node 6 with internal server error: error recording initial status summaries: failed to connect to n4 at 127.108.193.241:41497: initial connection heartbeat failed: rpc error: code = PermissionDenied desc = n4 was permanently removed from the cluster at 2021-03-11 06:13:31.482096845 +0000 UTC; it is not allowed to rejoin the cluster, which feels like even though the recommission should have failed, the failed recomission still affects adding new nodes?

@knz
Copy link
Contributor

knz commented Mar 15, 2021

I am confused - this later error suggests that the same (virtual) store was used to add a new node. That would be an error. We want new/fresh stores for new nodes.

Is that error also happening for a fresh store?

If so probably @irfansharif or @lunevalex can help.

@otan
Copy link
Contributor

otan commented Mar 15, 2021

This is failing on the adding a fresh store step AFAICT.

@lunevalex
Copy link
Collaborator

I wonder if @erikgrinaker has an opinion about this, as someone who recently worked in this space?

@tbg
Copy link
Member

tbg commented Mar 16, 2021

I took a look and I can put blame neither on the decommissioning mechanism nor the transient cluster (though the transient cluster code is a bit jumbled; why do we have both of these:

s *server.TestServer
servers []*server.TestServer

and we certainly don't decommission properly as we're never waiting for the replicas to drain, we just yank the node out right away:

// This (cumbersome) two step process is due to the allowed state
// transitions for membership status. To mark a node as fully
// decommissioned, it has to be marked as decommissioning first.
{
req := &serverpb.DecommissionRequest{
NodeIDs: []roachpb.NodeID{nodeID},
TargetMembership: livenesspb.MembershipStatus_DECOMMISSIONING,
}
_, err = adminClient.Decommission(ctx, req)
if err != nil {
return errors.Wrap(err, "while trying to mark as decommissioning")
}
}
{
req := &serverpb.DecommissionRequest{
NodeIDs: []roachpb.NodeID{nodeID},
TargetMembership: livenesspb.MembershipStatus_DECOMMISSIONED,
}
_, err = adminClient.Decommission(ctx, req)
if err != nil {
return errors.Wrap(err, "while trying to mark as decommissioned")
}
}

Another weird thing is that we seem to be using the s field to get a client to decommission/recommission nodes, which I assume is fine if that server is the first one (as I assume it is), at least in this test.

The thing I really would like to see, and which is not available in the artifacts it seems, are the logs for the servers.

@knz
Copy link
Contributor

knz commented Mar 16, 2021

demo server logs are disabled unfortunately
(@otan might be worth adding a --log-dir=logs arg to the demo command in these tests)

@knz
Copy link
Contributor

knz commented Mar 16, 2021

we can probably repro locally too

@tbg
Copy link
Member

tbg commented Mar 16, 2021 via email

@knz
Copy link
Contributor

knz commented Mar 16, 2021

It's not failing all the time, otherwise it would have failed in the PR that caused the failure :)

I'm going to stress this locally and report results.

@knz
Copy link
Contributor

knz commented Mar 16, 2021

After 100 runs locally I am not able to repro either...

However Tobias' remarks above about how the decommissioning process is effected are still applicable. I think the code should be massaged to mimic what is done by the node commands.

@tbg
Copy link
Member

tbg commented Mar 16, 2021

However Tobias' remarks above about how the decommissioning process is effected are still applicable. I think the code should be massaged to mimic what is done by the node commands.

Or rather the work should be done on the server side, so that we don't have to identical-but-different implementations on the client side. (Or the client side impls should be made shareable).

@cockroach-teamcity
Copy link
Member Author

(acceptance).TestDockerCLI failed on release-21.1@8a4e9de1cc8c150f9c95b505a4f748056974bba7:

=== RUN   TestDockerCLI
    test_log_scope.go:73: test logs captured to: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/logTestDockerCLI416694678
    test_log_scope.go:74: use -show-logs to present logs inline
=== CONT  TestDockerCLI
    cli_test.go:93: -- test log scope end --
test logs left over in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/logTestDockerCLI416694678
--- FAIL: TestDockerCLI (414.10s)
=== RUN   TestDockerCLI/test_demo_node_cmds.tcl
=== CONT  TestDockerCLI/test_demo_node_cmds.tcl
    cli_test.go:89: non-zero exit code: 1
    --- FAIL: TestDockerCLI/test_demo_node_cmds.tcl (37.43s)

More

Parameters:

  • GOFLAGS=-json
make stressrace TESTS=TestDockerCLI PKG=./pkg/acceptance TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1

Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@rafiss
Copy link
Collaborator

rafiss commented Jul 6, 2023

Closing since this test failure is from an old branch

@rafiss rafiss closed this as completed Jul 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
None yet
Development

No branches or pull requests

6 participants