acceptance: TestDockerCLI failed #61896

cockroach-teamcity · 2021-03-12T07:03:11Z

(acceptance).TestDockerCLI failed on release-21.1@050385adeb3094405ce52e73f6deb03f75a96f1a:

=== RUN   TestDockerCLI
    test_log_scope.go:73: test logs captured to: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/logTestDockerCLI255973864
    test_log_scope.go:74: use -show-logs to present logs inline
=== CONT  TestDockerCLI
    cli_test.go:93: -- test log scope end --
test logs left over in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/logTestDockerCLI255973864
--- FAIL: TestDockerCLI (426.64s)
=== RUN   TestDockerCLI/test_demo_node_cmds.tcl
=== CONT  TestDockerCLI/test_demo_node_cmds.tcl
    cli_test.go:89: non-zero exit code: 1
    --- FAIL: TestDockerCLI/test_demo_node_cmds.tcl (37.62s)

More

Parameters:

GOFLAGS=-json

make stressrace TESTS=TestDockerCLI PKG=./pkg/acceptance TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1

acceptance: TestDockerCLI failed #61834 acceptance: TestDockerCLI failed C-test-failure O-robot branch-master

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

Jira issue: CRDB-6260

The text was updated successfully, but these errors were encountered:

knz · 2021-03-15T17:31:30Z

related: #61834

@otan could you have a look?

otan · 2021-03-15T19:18:27Z

seems like something changed with node recommissioning:

[email protected]:26257/movr> \demo decommission 4
node 4 has been decommissioned
select node_id, draining, decommissioning, membership from crdb_internal.gossip_liveness ORDER BY node_id;
[email protected]:26257/movr> select node_id, draining, decommissioning, memb �ership from crdb_internal.gossip_liveness ORDER BY node_id;
  node_id | draining | decommissioning |   membership
----------+----------+-----------------+-----------------
        1 |  false   |      false      | active
        2 |  false   |      false      | active
        3 |  false   |      false      | active
        4 |  false   |      true       | decommissioned
        5 |  false   |      false      | active
(5 rows)

Time: 0ms total (execution 0ms / network 0ms)

\demo recommission 4
[email protected]:26257/movr> \demo recommission 4
internal server error: while trying to mark as decommissioning: rpc error: code = FailedPrecondition desc = can only recommission a decommissioning node; n4 found to be decommissioned
\demo add blah
[email protected]:26257/movr> \demo add blah
internal server error: tier must be in the form "key=value" not "blah"
\demo add region=ca-central,zone=a
[email protected]:26257/movr> \demo add region=ca-central,zone=a
internal server error: error recording initial status summaries: failed to connect to n4 at 127.108.193.241:41497: initial connection heartbeat failed: rpc error: code = PermissionDenied desc = n4 was permanently removed from the cluster at 2021-03-11 06:13:31.482096845 +0000 UTC; it is not allowed to rejoin the cluster
[email protected]:26257/movr> 
.210311 06:13:56.512951636 EXPECT TEST: TIMEOUT WAITING FOR "node 6 has been added with locality "region=ca-central,zone=a""

maybe we're not allowed to recommission a node that has been fully decommissioned anymore? i can remove the test, but rather have someone who has been working on this confirm that to be the case.

knz · 2021-03-15T19:20:57Z

maybe we're not allowed to recommission a node that has been fully decommissioned anymore?

Yes that is exactly correct. We changed that behavior a couple weeks ago. I am surprised this wasn't detected by the test when the corresponding PR was merged.

otan · 2021-03-15T19:27:58Z

on second thoughts, eexpect "can only recommission a decommissioning node" worked as expected.

however, we're failing to start node 6 with internal server error: error recording initial status summaries: failed to connect to n4 at 127.108.193.241:41497: initial connection heartbeat failed: rpc error: code = PermissionDenied desc = n4 was permanently removed from the cluster at 2021-03-11 06:13:31.482096845 +0000 UTC; it is not allowed to rejoin the cluster, which feels like even though the recommission should have failed, the failed recomission still affects adding new nodes?

knz · 2021-03-15T19:30:23Z

I am confused - this later error suggests that the same (virtual) store was used to add a new node. That would be an error. We want new/fresh stores for new nodes.

Is that error also happening for a fresh store?

If so probably @irfansharif or @lunevalex can help.

otan · 2021-03-15T19:31:09Z

This is failing on the adding a fresh store step AFAICT.

lunevalex · 2021-03-16T04:20:25Z

I wonder if @erikgrinaker has an opinion about this, as someone who recently worked in this space?

tbg · 2021-03-16T07:57:38Z

I took a look and I can put blame neither on the decommissioning mechanism nor the transient cluster (though the transient cluster code is a bit jumbled; why do we have both of these:

cockroach/pkg/cli/demo_cluster.go

Lines 55 to 56 in 8c5253b

    
           s          *server.TestServer 
        
           servers    []*server.TestServer

and we certainly don't decommission properly as we're never waiting for the replicas to drain, we just yank the node out right away:

cockroach/pkg/cli/demo_cluster.go

Lines 457 to 480 in 8c5253b

    
           // This (cumbersome) two step process is due to the allowed state 
        
           // transitions for membership status. To mark a node as fully 
        
           // decommissioned, it has to be marked as decommissioning first. 
        
           { 
        
           	req := &serverpb.DecommissionRequest{ 
        
           		NodeIDs:          []roachpb.NodeID{nodeID}, 
        
           		TargetMembership: livenesspb.MembershipStatus_DECOMMISSIONING, 
        
           	} 
        
           	_, err = adminClient.Decommission(ctx, req) 
        
           	if err != nil { 
        
           		return errors.Wrap(err, "while trying to mark as decommissioning") 
        
           	} 
        
           } 
        
           { 
        
           	req := &serverpb.DecommissionRequest{ 
        
           		NodeIDs:          []roachpb.NodeID{nodeID}, 
        
           		TargetMembership: livenesspb.MembershipStatus_DECOMMISSIONED, 
        
           	} 
        
           	_, err = adminClient.Decommission(ctx, req) 
        
           	if err != nil { 
        
           		return errors.Wrap(err, "while trying to mark as decommissioned") 
        
           	} 
        
           }

Another weird thing is that we seem to be using the s field to get a client to decommission/recommission nodes, which I assume is fine if that server is the first one (as I assume it is), at least in this test.

The thing I really would like to see, and which is not available in the artifacts it seems, are the logs for the servers.

knz · 2021-03-16T08:16:11Z

demo server logs are disabled unfortunately
(@otan might be worth adding a --log-dir=logs arg to the demo command in these tests)

knz · 2021-03-16T08:16:29Z

we can probably repro locally too

tbg · 2021-03-16T08:24:31Z

I couldn't repro locally, but weird things did happen and my CPU got pegged at 100% ***@***.***:26257/movr> \demo add region=ca-central,zone=a node 2 has been added with locality "region=ca-central,zone=a" ***@***.***:26257/movr> \demo add region=ca-central,zone=a node 3 has been added with locality "region=ca-central,zone=a" ***@***.***:26257/movr> \demo add region=ca-central,zone=a node 4 has been added with locality "region=ca-central,zone=a" ***@***.***:26257/movr> \demo decommission 4 node 4 has been decommissioned ***@***.***:26257/movr> \demo recommission 4 internal server error: failed to connect to the node: initial connection heartbeat failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:62056: i/o timeout" ***@***.***:26257/movr> \demo recommission 4 internal server error: failed to connect to the node: initial connection heartbeat failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:62056: i/o timeout" ***@***.***:26257/movr> \demo add region=ca-central,zone=a node 5 has been added with locality "region=ca-central,zone=a" Note how I didn't get the expected error during the recommission :shrug:

…

On Tue, Mar 16, 2021 at 9:16 AM kena ***@***.***> wrote: we can probably repro locally too — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#61896 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABGXPZGYTTTTK43F4ORY6X3TD4HXDANCNFSM4ZBW4PYQ> .

knz · 2021-03-16T08:37:27Z

It's not failing all the time, otherwise it would have failed in the PR that caused the failure :)

I'm going to stress this locally and report results.

knz · 2021-03-16T08:45:46Z

After 100 runs locally I am not able to repro either...

However Tobias' remarks above about how the decommissioning process is effected are still applicable. I think the code should be massaged to mimic what is done by the node commands.

tbg · 2021-03-16T09:27:58Z

However Tobias' remarks above about how the decommissioning process is effected are still applicable. I think the code should be massaged to mimic what is done by the node commands.

Or rather the work should be done on the server side, so that we don't have to identical-but-different implementations on the client side. (Or the client side impls should be made shareable).

cockroach-teamcity · 2021-03-25T19:21:36Z

(acceptance).TestDockerCLI failed on release-21.1@8a4e9de1cc8c150f9c95b505a4f748056974bba7:

=== RUN   TestDockerCLI
    test_log_scope.go:73: test logs captured to: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/logTestDockerCLI416694678
    test_log_scope.go:74: use -show-logs to present logs inline
=== CONT  TestDockerCLI
    cli_test.go:93: -- test log scope end --
test logs left over in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/acceptance/logTestDockerCLI416694678
--- FAIL: TestDockerCLI (414.10s)
=== RUN   TestDockerCLI/test_demo_node_cmds.tcl
=== CONT  TestDockerCLI/test_demo_node_cmds.tcl
    cli_test.go:89: non-zero exit code: 1
    --- FAIL: TestDockerCLI/test_demo_node_cmds.tcl (37.43s)

More

Parameters:

GOFLAGS=-json

make stressrace TESTS=TestDockerCLI PKG=./pkg/acceptance TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1

acceptance: TestDockerCLI failed #61834 acceptance: TestDockerCLI failed C-test-failure O-robot branch-master

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

rafiss · 2023-07-06T18:15:32Z

Closing since this test failure is from an old branch

cockroach-teamcity added branch-release-21.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Mar 12, 2021

cockroach-teamcity assigned tbg Mar 12, 2021

knz unassigned tbg Mar 15, 2021

knz mentioned this issue Mar 15, 2021

acceptance: TestDockerCLI failed #61834

Closed

This was referenced Feb 15, 2022

acceptance: TestDockerCLI/test_extern_dir.tcl/runMode=docker failed #76448

Closed

acceptance: TestDockerCLI failed #76874

Closed

cockroach-teamcity mentioned this issue Mar 16, 2022

acceptance: TestDockerCLI failed #77886

Closed

cockroach-teamcity mentioned this issue Mar 31, 2023

acceptance: TestDockerCLI/test_copy failed #100319

Closed

cockroach-teamcity mentioned this issue Apr 18, 2023

acceptance: TestDockerCLI failed #101712

Closed

tbg added the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Jun 30, 2023

rafiss closed this as completed Jul 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

acceptance: TestDockerCLI failed #61896

acceptance: TestDockerCLI failed #61896

cockroach-teamcity commented Mar 12, 2021 •

edited by cockroach-jira-scripts

Loading

knz commented Mar 15, 2021

otan commented Mar 15, 2021

knz commented Mar 15, 2021

otan commented Mar 15, 2021

knz commented Mar 15, 2021

otan commented Mar 15, 2021

lunevalex commented Mar 16, 2021

tbg commented Mar 16, 2021

knz commented Mar 16, 2021

knz commented Mar 16, 2021

tbg commented Mar 16, 2021 via email

knz commented Mar 16, 2021

knz commented Mar 16, 2021

tbg commented Mar 16, 2021

cockroach-teamcity commented Mar 25, 2021

rafiss commented Jul 6, 2023

acceptance: TestDockerCLI failed #61896

acceptance: TestDockerCLI failed #61896

Comments

cockroach-teamcity commented Mar 12, 2021 • edited by cockroach-jira-scripts Loading

knz commented Mar 15, 2021

otan commented Mar 15, 2021

knz commented Mar 15, 2021

otan commented Mar 15, 2021

knz commented Mar 15, 2021

otan commented Mar 15, 2021

lunevalex commented Mar 16, 2021

tbg commented Mar 16, 2021

knz commented Mar 16, 2021

knz commented Mar 16, 2021

tbg commented Mar 16, 2021 via email

knz commented Mar 16, 2021

knz commented Mar 16, 2021

tbg commented Mar 16, 2021

cockroach-teamcity commented Mar 25, 2021

rafiss commented Jul 6, 2023

cockroach-teamcity commented Mar 12, 2021 •

edited by cockroach-jira-scripts

Loading