Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tidb & tiflash instance crash the same time #29944

Closed
zouyixiong opened this issue Nov 19, 2021 · 5 comments
Closed

tidb & tiflash instance crash the same time #29944

zouyixiong opened this issue Nov 19, 2021 · 5 comments
Assignees
Labels
component/tiflash severity/critical sig/execution SIG execution type/bug The issue is confirmed as a bug.

Comments

@zouyixiong
Copy link

cluster info

Cluster type:       tidb
Cluster name:       tidb-bi
Cluster version:    v5.2.1
Deploy user:        tidb
SSH type:           builtin
Dashboard URL:      http://10.90.1.22:2379/dashboard
ID                Role          Host        Ports                            OS/Arch       Status  Since         Data Dir                   Deploy Dir
--                ----          ----        -----                            -------       ------  -----         --------                   ----------
10.90.1.22:9093   alertmanager  10.90.1.22  9093/9094                        linux/x86_64  Up      52d23h6m42s   /data02/alertmanager-9093  /home/tidb/tidb-deploy/alertmanager-9093
10.90.1.22:3000   grafana       10.90.1.22  3000                             linux/x86_64  Up      52d23h6m46s   -                          /home/tidb/tidb-deploy/grafana-3000
10.90.1.21:2379   pd            10.90.1.21  2379/2380                        linux/x86_64  Up|L    52d23h7m      /data01/pd-2379            /home/tidb/tidb-deploy/pd-2379
10.90.1.22:2379   pd            10.90.1.22  2379/2380                        linux/x86_64  Up|UI   52d23h7m      /data01/pd-2379            /home/tidb/tidb-deploy/pd-2379
10.90.1.23:2379   pd            10.90.1.23  2379/2380                        linux/x86_64  Up      52d23h7m      /data01/pd-2379            /home/tidb/tidb-deploy/pd-2379
10.90.1.21:9090   prometheus    10.90.1.21  9090                             linux/x86_64  Up      52d23h6m47s   /data02/prometheus-8249    /home/tidb/tidb-deploy/prometheus-8249
10.90.1.21:4000   tidb          10.90.1.21  4000/10080                       linux/x86_64  Up      51d21h37m59s  -                          /home/tidb/tidb-deploy/tidb-4000
10.90.1.22:4000   tidb          10.90.1.22  4000/10080                       linux/x86_64  Up      31m21s        -                          /home/tidb/tidb-deploy/tidb-4000
10.90.1.23:9000   tiflash       10.90.1.23  9000/8123/3930/20170/20292/8234  linux/x86_64  Up      34m21s        /data02/tiflash-9000       /home/tidb/tidb-deploy/tiflash-9000
10.90.1.24:9000   tiflash       10.90.1.24  9000/8123/3930/20170/20292/8234  linux/x86_64  Up      31m36s        /data02/tiflash-9000       /home/tidb/tidb-deploy/tiflash-9000
10.90.1.27:9000   tiflash       10.90.1.27  9000/8123/3930/20170/20292/8234  linux/x86_64  Up      34m38s        /data02/tiflash-9000       /home/tidb/tidb-deploy/tiflash-9000
10.90.1.23:20160  tikv          10.90.1.23  20160/20180                      linux/x86_64  Up      52d23h6m58s   /data01/tikv-20160         /home/tidb/tidb-deploy/tikv-20160
10.90.1.24:20160  tikv          10.90.1.24  20160/20180                      linux/x86_64  Up      52d55m34s     /data01/tikv-20160         /home/tidb/tidb-deploy/tikv-20160
10.90.1.27:20160  tikv          10.90.1.27  20160/20180                      linux/x86_64  Up      52d23h6m59s   /data01/tikv-20160         /home/tidb/tidb-deploy/tikv-20160
Total nodes: 14

tidb error log

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x23bf367]

goroutine 1935367953 [running]:
github.com/pingcap/tidb/store/copr.balanceBatchCopTask.func1(0xc02ece1990, 0xc0a457b6e0, 0x3, 0x4, 0xc02ece0598, 0xc0213b29c0, 0xc05df9b450281bf7, 0xd31cdb99adc71, 0x5d62880, 0xc00011f770, ...)
	/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/copr/batch_coprocessor.go:161 +0x3a7
created by github.com/pingcap/tidb/store/copr.balanceBatchCopTask
	/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/copr/batch_coprocessor.go:139 +0x16a5

tiflash error log

2021.11.19 17:30:17.197113 [ 53 ] <Warning> CoprocessorHandler: grpc::Status DB::CoprocessorHandler::execute(): RegionException: region 21135, message: NOT_FOUND
2021.11.19 17:30:17.197711 [ 26 ] <Warning> pingcap.tikv: region {21135,8,385} find error:
2021.11.19 17:30:17.518351 [ 89 ] <Warning> CoprocessorHandler: grpc::Status DB::CoprocessorHandler::execute(): RegionException: region 21536, message: NOT_FOUND
2021.11.19 17:30:17.518848 [ 41 ] <Warning> pingcap.tikv: region {21536,8,402} find error:
2021.11.19 17:30:17.588358 [ 111 ] <Warning> CoprocessorHandler: grpc::Status DB::CoprocessorHandler::execute(): RegionException: region 21501, message: NOT_FOUND
2021.11.19 17:30:17.588835 [ 35 ] <Warning> pingcap.tikv: region {21501,8,399} find error:
2021.11.19 17:30:17.673640 [ 115 ] <Warning> CoprocessorHandler: grpc::Status DB::CoprocessorHandler::execute(): RegionException: region 21388, message: NOT_FOUND
2021.11.19 17:30:17.674027 [ 34 ] <Warning> pingcap.tikv: region {21388,8,397} find error:
2021.11.19 17:30:17.866837 [ 55 ] <Warning> CoprocessorHandler: grpc::Status DB::CoprocessorHandler::execute(): RegionException: region 21812, message: NOT_FOUND
2021.11.19 17:30:17.866904 [ 97 ] <Warning> CoprocessorHandler: grpc::Status DB::CoprocessorHandler::execute(): RegionException: region 21086, message: NOT_FOUND
2021.11.19 17:30:17.866965 [ 95 ] <Warning> CoprocessorHandler: grpc::Status DB::CoprocessorHandler::execute(): RegionException: region 18047, message: NOT_FOUND
2021.11.19 17:30:17.867445 [ 48 ] <Warning> pingcap.tikv: region {21812,8,417} find error:
2021.11.19 17:30:17.867644 [ 27 ] <Warning> pingcap.tikv: region {21086,8,383} find error:
2021.11.19 17:30:17.867811 [ 28 ] <Warning> pingcap.tikv: region {18047,8,380} find error:
2021.11.19 17:30:17.956507 [ 63 ] <Warning> CoprocessorHandler: grpc::Status DB::CoprocessorHandler::execute(): RegionException: region 19204, message: NOT_FOUND
2021.11.19 17:30:17.957006 [ 29 ] <Warning> pingcap.tikv: region {19204,8,385} find error:
2021.11.19 17:30:18.286719 [ 21 ] <Warning> pingcap.tikv: region {10908,8,228} find error:
2021.11.19 17:30:18.295061 [ 86 ] <Warning> CoprocessorHandler: grpc::Status DB::CoprocessorHandler::execute(): RegionException: region 10908, message: NOT_FOUND
2021.11.19 17:30:18.295567 [ 21 ] <Warning> pingcap.tikv: region {10908,8,228} find error:
2021.11.19 17:30:18.964232 [ 40 ] <Warning> CoprocessorHandler: grpc::Status DB::CoprocessorHandler::execute(): RegionException: region 21865, message: NOT_FOUND
2021.11.19 17:30:18.964667 [ 51 ] <Warning> pingcap.tikv: region {21865,8,416} find error:
2021.11.19 17:30:19.068749 [ 71 ] <Warning> CoprocessorHandler: grpc::Status DB::CoprocessorHandler::execute(): RegionException: region 21247, message: NOT_FOUND
2021.11.19 17:30:19.069487 [ 33 ] <Warning> pingcap.tikv: region {21247,8,392} find error:
2021.11.19 17:30:19.205673 [ 37 ] <Warning> CoprocessorHandler: grpc::Status DB::CoprocessorHandler::execute(): RegionException: region 16073, message: NOT_FOUND
2021.11.19 17:30:19.206106 [ 24 ] <Warning> pingcap.tikv: region {16073,8,373} find error:
2021.11.19 17:30:19.301921 [ 59 ] <Warning> CoprocessorHandler: grpc::Status DB::CoprocessorHandler::execute(): RegionException: region 21121, message: NOT_FOUND
2021.11.19 17:30:19.302368 [ 31 ] <Warning> pingcap.tikv: region {21121,8,386} find error:
2021.11.19 17:30:19.673579 [ 91 ] <Warning> CoprocessorHandler: grpc::Status DB::CoprocessorHandler::execute(): RegionException: region 21597, message: NOT_FOUND
2021.11.19 17:30:19.674113 [ 42 ] <Warning> pingcap.tikv: region {21597,8,402} find error:
2021.11.19 17:30:19.832922 [ 87 ] <Warning> CoprocessorHandler: grpc::Status DB::CoprocessorHandler::execute(): RegionException: region 21886, message: NOT_FOUND
2021.11.19 17:30:19.833470 [ 47 ] <Warning> pingcap.tikv: region {21886,8,411} find error:
2021.11.19 17:30:20.092732 [ 140 ] <Error> task 9: task running meets error DB::Exception: Failed to write data Stack Trace : 0. bin/tiflash/tiflash(StackTrace::StackTrace()+0x16) [0x3921fe6]
1. bin/tiflash/tiflash(DB::Exception::Exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int)+0x26) [0x3915e66]
2. bin/tiflash/tiflash(DB::MPPTunnel::write(mpp::MPPDataPacket const&, bool)+0x17ff) [0x805917f]
3. bin/tiflash/tiflash(DB::MPPTunnelSet::write(tipb::SelectResponse&)+0x5d) [0x805ad5d]
4. bin/tiflash/tiflash(DB::StreamingDAGResponseWriter<std::shared_ptr<DB::MPPTunnelSet> >::getEncodeTask(std::vector<DB::Block, std::allocator<DB::Block> >&, tipb::SelectResponse&) const::{lambda()#1}::operator()()+0x226) [0x8033e66]
5. bin/tiflash/tiflash(ThreadPool::worker()+0x167) [0x81b1d77]
6. bin/tiflash/tiflash() [0x8e571bf]
7. /lib64/libpthread.so.0(+0x7ea5) [0x7fb221620ea5]
8. /lib64/libc.so.6(clone+0x6d) [0x7fb2210479fd]

2021.11.19 17:30:20.127291 [ 140 ] <Error> tunnel9+-1: Failed to close tunnel: tunnel9+-1: Code: 0, e.displayText() = DB::Exception: Failed to write err, e.what() = DB::Exception, Stack trace:

0. bin/tiflash/tiflash(StackTrace::StackTrace()+0x16) [0x3921fe6]
1. bin/tiflash/tiflash(DB::Exception::Exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int)+0x26) [0x3915e66]
2. bin/tiflash/tiflash(DB::MPPTunnel::close(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1a5) [0x805a9c5]
3. bin/tiflash/tiflash(DB::MPPTask::writeErrToAllTunnel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xf9) [0x8051909]
4. bin/tiflash/tiflash(DB::MPPTask::runImpl()+0x11db) [0x8052c8b]
5. bin/tiflash/tiflash() [0x8e571bf]
6. /lib64/libpthread.so.0(+0x7ea5) [0x7fb221620ea5]
7. /lib64/libc.so.6(clone+0x6d) [0x7fb2210479fd]

2021.11.19 17:30:20.151589 [ 140 ] <Error> task 9: Failed to write error DB::Exception: Failed to write data to tunnel: tunnel9+-1: Code: 0, e.displayText() = DB::Exception: Failed to write data, e.what() = DB::Exception, Stack trace:

0. bin/tiflash/tiflash(StackTrace::StackTrace()+0x16) [0x3921fe6]
1. bin/tiflash/tiflash(DB::Exception::Exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int)+0x26) [0x3915e66]
2. bin/tiflash/tiflash(DB::MPPTunnel::write(mpp::MPPDataPacket const&, bool)+0x17ff) [0x805917f]
3. bin/tiflash/tiflash(DB::MPPTask::writeErrToAllTunnel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x92) [0x80518a2]
4. bin/tiflash/tiflash(DB::MPPTask::runImpl()+0x11db) [0x8052c8b]
5. bin/tiflash/tiflash() [0x8e571bf]
6. /lib64/libpthread.so.0(+0x7ea5) [0x7fb221620ea5]
7. /lib64/libc.so.6(clone+0x6d) [0x7fb2210479fd]

2021.11.19 17:33:36.655373 [ 1487 ] <Warning> TaskManager: Begin cancel query: 429212179412811804
2021.11.19 17:33:36.657085 [ 1487 ] <Warning> TaskManager: Remaining task in query 429212179412811804 are: [429212179412811804,2]
2021.11.19 17:33:36.658763 [ 1488 ] <Warning> task 2: Begin cancel task: [429212179412811804,2]
2021.11.19 17:33:37.496000 [ 1488 ] <Error> tunnel2+-1: Failed to close tunnel: tunnel2+-1: Code: 0, e.displayText() = DB::Exception: Failed to write err, e.what() = DB::Exception, Stack trace:

0. bin/tiflash/tiflash(StackTrace::StackTrace()+0x16) [0x3921fe6]
1. bin/tiflash/tiflash(DB::Exception::Exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int)+0x26) [0x3915e66]
2. bin/tiflash/tiflash(DB::MPPTunnel::close(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1a5) [0x805a9c5]
3. bin/tiflash/tiflash(DB::MPPTask::cancel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x9d) [0x804b40d]
4. bin/tiflash/tiflash() [0x8e571bf]
5. /lib64/libpthread.so.0(+0x7ea5) [0x7fb221620ea5]
6. /lib64/libc.so.6(clone+0x6d) [0x7fb2210479fd]

2021.11.19 17:33:37.508548 [ 1488 ] <Warning> task 2: Finish cancel task: [429212179412811804,2]
2021.11.19 17:33:37.512411 [ 1487 ] <Warning> TaskManager: Finish cancel query: 429212179412811804
2021.11.19 17:33:37.573214 [ 1074 ] <Error> task 2: task running meets error DB::Exception: Failed to write data Stack Trace : 0. bin/tiflash/tiflash(StackTrace::StackTrace()+0x16) [0x3921fe6]
1. bin/tiflash/tiflash(DB::Exception::Exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int)+0x26) [0x3915e66]
2. bin/tiflash/tiflash(DB::MPPTunnel::write(mpp::MPPDataPacket const&, bool)+0x17ff) [0x805917f]
3. bin/tiflash/tiflash(DB::MPPTunnelSet::write(tipb::SelectResponse&)+0x5d) [0x805ad5d]
4. bin/tiflash/tiflash(DB::StreamingDAGResponseWriter<std::shared_ptr<DB::MPPTunnelSet> >::getEncodeTask(std::vector<DB::Block, std::allocator<DB::Block> >&, tipb::SelectResponse&) const::{lambda()#1}::operator()()+0x226) [0x8033e66]
5. bin/tiflash/tiflash(ThreadPool::worker()+0x167) [0x81b1d77]
6. bin/tiflash/tiflash() [0x8e571bf]
7. /lib64/libpthread.so.0(+0x7ea5) [0x7fb221620ea5]
8. /lib64/libc.so.6(clone+0x6d) [0x7fb2210479fd]

2021.11.19 17:33:37.601453 [ 1074 ] <Error> task 2: Failed to write error DB::Exception: Failed to write data to tunnel: tunnel2+-1: Code: 0, e.displayText() = DB::Exception: write to tunnel which is already closed., e.what() = DB::Exception, Stack trace:

0. bin/tiflash/tiflash(StackTrace::StackTrace()+0x16) [0x3921fe6]
1. bin/tiflash/tiflash(DB::Exception::Exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int)+0x26) [0x3915e66]
2. bin/tiflash/tiflash(DB::MPPTunnel::write(mpp::MPPDataPacket const&, bool)+0x1788) [0x8059108]
3. bin/tiflash/tiflash(DB::MPPTask::writeErrToAllTunnel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x92) [0x80518a2]
4. bin/tiflash/tiflash(DB::MPPTask::runImpl()+0x11db) [0x8052c8b]
5. bin/tiflash/tiflash() [0x8e571bf]
6. /lib64/libpthread.so.0(+0x7ea5) [0x7fb221620ea5]
7. /lib64/libc.so.6(clone+0x6d) [0x7fb2210479fd]

2021.11.19 17:33:37.601632 [ 1074 ] <Error> TaskManager: The task [429212179412811804,2] cannot be found and fail to unregister
@zouyixiong zouyixiong added the type/bug The issue is confirmed as a bug. label Nov 19, 2021
@fuzhe1989
Copy link
Contributor

fuzhe1989 commented Nov 23, 2021

@zouyixiong could you provide more details and logs? From the tiflash error log I can't find any evidence the tiflash server had crashed. So far all the error messages are not fatal.

@fuzhe1989
Copy link
Contributor

/assign @wshwsh12 @fuzhe1989

@wshwsh12
Copy link
Contributor

The reason of tidb carsh is same with #28096. Fixed in v5.2.2.

@jebter jebter added the sig/execution SIG execution label Dec 7, 2021
@fuzhe1989
Copy link
Contributor

/unassign @wshwsh12

@github-actions
Copy link

github-actions bot commented Jan 5, 2022

Please check whether the issue should be labeled with 'affects-x.y' or 'fixes-x.y.z', and then remove 'needs-more-info' label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/tiflash severity/critical sig/execution SIG execution type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

5 participants