-
Notifications
You must be signed in to change notification settings - Fork 411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPPTask::runImpl
is not exception safe
#2322
Comments
Related logs: thread_15054_task162.zip # There is a snapshot that been held for 7399.651 seconds from this logging. Actually, the snapshot has not been released for more than 20 hours in later processing. And make trouble for GCing old data.
[2021/07/05 00:34:37.639 +08:00] [WARN] [<unknown>] ["PageStorage: db_45.t_154.data gcApply remove 26 invalid snapshots, 2594 snapshots left, longest lifetime 7399.651 seconds, created from thread_id 15054"] [thread_id=1532]
# Trace back by the thread_id 15054, we find that
[2021/07/04 22:30:31.843 +08:00] [DEBUG] [<unknown>] ["FlashService: virtual grpc::Status DB::FlashService::DispatchMPPTask(grpc::ServerContext*, const mpp::DispatchTaskRequest*, mpp::DispatchTaskResponse*): Handling mpp dispatch request: meta {\n start_ts: 426091221432665442\n task_id: 162\n ..."] [thread_id=15054]
[2021/07/04 22:30:31.870 +08:00] [DEBUG] [<unknown>] ["task 162: begin to register the task [426091221432665442,162]"] [thread_id=15054]
[2021/07/04 22:30:31.870 +08:00] [DEBUG] [<unknown>] ["task 162: begin to register the tunnel tunnel162+185"] [thread_id=15054]
# Execute some wait index and done
[2021/07/04 22:30:31.878 +08:00] [DEBUG] [<unknown>] ["task 162: begin to register the tunnel tunnel162+210"] [thread_id=15054]
[2021/07/04 22:30:31.879 +08:00] [DEBUG] [<unknown>] ["executeQuery: (from 10.4.131.15:44025, query_id: 3fbb0844-ff45-4b13-b816-75fcbf0333a1) ... "] [thread_id=15054]
[2021/07/04 22:30:31.988 +08:00] [DEBUG] [<unknown>] ["DAGQueryBlockInterpreter: Batch read index send 5877 request got 5877 response, cost 86ms"] [thread_id=15054]
[2021/07/04 22:30:31.990 +08:00] [DEBUG] [<unknown>] ["Region: [region 264641616, applied: term 6 index 452567] need to wait learner index: 452635"] [thread_id=15054]
[2021/07/04 22:30:35.949 +08:00] [DEBUG] [<unknown>] ["Region: [region 264641616] wait learner index 452635 done"] [thread_id=15054]
...
[2021/07/04 22:31:17.480 +08:00] [DEBUG] [<unknown>] ["Region: [region 302054948] wait learner index 357795 done"] [thread_id=15054]
[2021/07/04 22:31:17.507 +08:00] [DEBUG] [<unknown>] ["DAGQueryBlockInterpreter: Finish wait index | resolve locks | check memory cache for 5876 regions, cost 175ms"] [thread_id=15054]
[2021/07/04 22:31:17.534 +08:00] [DEBUG] [<unknown>] ["DAGQueryBlockInterpreter: [Learner Read] batch read index | wait index cost 1100 ms totally, regions_num=5876, concurrency=1"] [thread_id=15054]
[2021/07/04 22:31:17.536 +08:00] [DEBUG] [<unknown>] ["DAGQueryBlockInterpreter: DB::DAGQueryBlockInterpreter::getAndLockStorageWithSchemaVersion(DB::TableID, DB::Int64)::<lambda(const String&)> Table 154 schema OK, no syncing required. Schema version [storage, global, query]: [717, 725, 725]."] [thread_id=15054]
[2021/07/04 22:31:17.560 +08:00] [DEBUG] [<unknown>] ["StorageDeltaMerge: Read with tso: 426091221432665442"] [thread_id=15054]
# Get snapshot of the Storage layer
[2021/07/04 22:31:18.550 +08:00] [DEBUG] [<unknown>] ["DeltaMergeStore[db_45.t_154]: Read create segment snapshot done"] [thread_id=15054]
[2021/07/04 22:31:18.562 +08:00] [DEBUG] [<unknown>] ["DeltaMergeStore[db_45.t_154]: Read create stream done"] [thread_id=15054]
[2021/07/04 22:31:18.904 +08:00] [DEBUG] [<unknown>] ["DAGQueryBlockInterpreter: Start to retry 4 regions ({302889825,8716,1251},{321506838,8716,1251},{280139908,12298,1297},{265243024,12831,1241},)"] [thread_id=15054]
[2021/07/04 22:31:18.948 +08:00] [DEBUG] [<unknown>] ["pingcap/coprocessor: build 4 ranges."] [thread_id=15054]
[2021/07/04 22:31:19.229 +08:00] [DEBUG] [<unknown>] ["pingcap/coprocessor: has 4 tasks."] [thread_id=15054]
...
[2021/07/04 22:31:19.835 +08:00] [INFO] [<unknown>] ["DAGQueryBlockInterpreter: execution stream size for query block(before aggregation) __QB_2_ is 36"] [thread_id=15054]
[2021/07/04 22:31:19.852 +08:00] [INFO] [<unknown>] ["DAGQueryBlockInterpreter: execution stream size for query block(before aggregation) __QB_1_ is 36"] [thread_id=15054]
[2021/07/04 22:31:19.902 +08:00] [DEBUG] [<unknown>] ["executeQuery: Query pipeline:...\n"] [thread_id=15054]
[2021/07/04 22:31:20.704 +08:00] [DEBUG] [<unknown>] ["SquashingTransform: Squashing config - min_block_size_rows: 20000 min_block_size_bytes: 0"] [thread_id=15054]
[2021/07/04 22:31:20.774 +08:00] [INFO] [<unknown>] ["MPPHandler: processing dispatch is over; the time cost is 48854 ms"] [thread_id=15054]
...
[2021/07/04 22:31:20.775 +08:00] [INFO] [<unknown>] ["task 162: task starts running"] [thread_id=28459]
[2021/07/04 22:31:20.814 +08:00] [DEBUG] [<unknown>] ["task 162: begin read "] [thread_id=28459]
[2021/07/04 22:31:57.202 +08:00] [WARN] [<unknown>] ["task 162: Begin cancel task: [426091221432665442,162]"] [thread_id=30453]
[2021/07/04 22:31:57.464 +08:00] [WARN] [<unknown>] ["task 162: Finish cancel task: [426091221432665442,162]"] [thread_id=30453]
## Exception thrown in running
[2021/07/04 22:32:00.041 +08:00] [ERROR] [<unknown>] ["task 162: task running meets error DB::Exception: Query was cancelled Stack Trace ...\n"] [thread_id=28459]
## Another exception thrown in write error
[2021/07/04 22:32:00.058 +08:00] [ERROR] [<unknown>] ["task 162: Failed to write error DB::Exception: Query was cancelled to all tunnels: Code: ..."] [thread_id=28459]
[2021/07/04 22:32:00.058 +08:00] [INFO] [<unknown>] ["task 162: task ends, time cost is 39350 ms."] [thread_id=28459]
[2021/07/04 22:32:00.104 +08:00] [DEBUG] [<unknown>] ["task 162: task unregistered"] [thread_id=28459]
|
And the not released We may end up with situation like [2], that TiFlash create lots of threads and can not release them. End up with throwing "Resource temporarily unavailable" exception when creating new threads, TiFlash crashes. [1] Some threads stuck forever:
|
@windtalker Do you have any idea how to fix this problem? |
Looks like we need make sure that |
Steps to reproduce: checkout this commit: JaySon-Huang@597dd86
I've added a failpoint
exception_during_mpp_write_err_to_tunnel
to mock this bug.And we can see that
~MPPTask()
is not called in case 3, there is no logging like "finish MPPTask", while case 1 and case 2 have.JaySon-Huang@c4ba52f#diff-31a1170d1da0609fde7eb7068713665415014352587f2256fd15a29fe581b3caR302-R336
I guess these changes can resolve part of the problem, cause
MPPTunnel::close
won't throw exception likeMPPTunnel::write
when the tunnel get closed.However, if other exceptions are thrown in
MPPTunnel::close
, thenMPPTask
still gets stuck and can not release resources of the Storage layer.The text was updated successfully, but these errors were encountered: