Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix bug that execution info for non root mpp task is not acurate enough (#1835) #1840

Merged
merged 1 commit into from
Apr 30, 2021

Conversation

ti-srebot
Copy link
Collaborator

@ti-srebot ti-srebot commented Apr 28, 2021

cherry-pick #1835 to release-5.0
You can switch your code base to this Pull Request by using git-extras:

# In tics repo:
git pr https://github.com/pingcap/tics/pull/1840

After apply modifications, you can push your change to this PR via:

git push [email protected]:ti-srebot/tics.git pr/1840:release-5.0-eff941f2e2c3

What problem does this PR solve?

Issue Number: close #xxx

Problem Summary:
In current implementation, the execution info for non root mpp task is merged in root mpp task, however, the merge algrithm is not good enough so sometimes the execution time of non root mpp task is very misleading, for example:

d      estRows actRows task    access object   execution info  operator info   memory  disk
Projection_10   0.00    4399    root            time:26.2s, loops:7, Concurrency:OFF    jrscfour_realtime.ods_clrg_pln_cf_inf.bfr_to_stm_txn_id, jrscfour_realtime.ods_tnar_unn_inf.txn_dt, jrscfour_realtime.ods_tnar_unn_inf.rcrd_crt_dt_tm, jrscfour_realtime.ods_tnar_unn_└─TableReader_26        0.00    4399    root            time:26.2s, loops:7, cop_task: {num: 6, max: 0s, min: 0s, avg: 0s, p95: 0s, copr_cache_hit_ratio: 0.00} data:ExchangeSender_25  N/A     N/A
  └─ExchangeSender_25   0.00    4399    cop[tiflash]            tiflash_task:{proc max:20.2s, min:766ms, p80:19.5s, p95:20.2s, iters:295, tasks:6, threads:36}  ExchangeType: PassThrough, tasks: [5, 6, 7]     N/A     N/A
    └─HashJoin_11       0.00    4399    cop[tiflash]            tiflash_task:{proc max:20.2s, min:766ms, p80:19.5s, p95:20.2s, iters:295, tasks:6, threads:36}  left outer join, equal:[eq(jrscfour_realtime.ods_clrg_pln_cf_inf.multi_tenancy_id, jrscfour_realtime.ods_std_r
      ├─ExchangeReceiver_24(Build)      30000.00        90000   cop[tiflash]            tiflash_task:{proc max:471ms, min:0s, p80:253.1ms, p95:471ms, iters:3, tasks:6, threads:60}             N/A     N/A
      │ └─ExchangeSender_23     30000.00        30000   cop[tiflash]            tiflash_task:{proc max:40.6ms, min:0s, p80:40.6ms, p95:40.6ms, iters:1, tasks:6, threads:1}     ExchangeType: Broadcast, tasks: [4]     N/A     N/A
      │   └─Selection_22        30000.00        30000   cop[tiflash]            tiflash_task:{proc max:40.6ms, min:0s, p80:40.6ms, p95:40.6ms, iters:1, tasks:6, threads:1}     eq(jrscfour_realtime.ods_std_rmtnwy_base_inf.multi_tenancy_id, "CN000"), not(isnull(jrscfour_r      │     └─TableFullScan_21  30000.00        30000   cop[tiflash]    table:T3        tiflash_task:{proc max:39.6ms, min:0s, p80:39.6ms, p95:39.6ms, iters:1, tasks:6, threads:1}     keep order:false        N/A     N/A
      └─HashJoin_14(Probe)      0.00    4399    cop[tiflash]            tiflash_task:{proc max:20.1s, min:738ms, p80:19.2s, p95:20.1s, iters:295, tasks:6, threads:36}  inner join, equal:[eq(jrscfour_realtime.ods_clrg_pln_cf_inf.oprt_pltfrm_txn_id, jrscfour_realtime.ods_
        ├─ExchangeReceiver_18(Build)    0.00    16500   cop[tiflash]            tiflash_task:{proc max:19.4s, min:0s, p80:19.2s, p95:19.4s, iters:2424, tasks:6, threads:60}            N/A     N/A
        │ └─ExchangeSender_17   0.00    5500    cop[tiflash]            tiflash_task:{proc max:2.58s, min:0s, p80:2.4s, p95:2.58s, iters:808, tasks:6, threads:46}      ExchangeType: Broadcast, tasks: [1, 2, 3]       N/A     N/A
        │   └─Selection_16      0.00    5500    cop[tiflash]            tiflash_task:{proc max:2.58s, min:0s, p80:2.4s, p95:2.58s, iters:808, tasks:6, threads:46}      eq(jrscfour_realtime.ods_clrg_pln_cf_inf.act_rcvpy_dt, 2000-01-01 00:00:00.000000), eq(jrscfour_realti
        │     └─TableFullScan_15        50000000.00     50000000        cop[tiflash]    table:T1        tiflash_task:{proc max:2.22s, min:0s, p80:1.98s, p95:2.22s, iters:859, tasks:6, threads:46}     keep order:false        N/A     N/A
        └─Selection_20(Probe)   13402171.90     13403267        cop[tiflash]            tiflash_task:{proc max:2.01s, min:418.1ms, p80:1.54s, p95:2.01s, iters:295, tasks:6, threads:36}        eq(jrscfour_realtime.ods_tnar_unn_inf.accnm, "01"), eq(jrscfour_realtime.ods_t
          └─TableFullScan_19    16666660.00     16666660        cop[tiflash]    table:T2        tiflash_task:{proc max:1.33s, min:299ms, p80:462ms, p95:1.33s, iters:295, tasks:6, threads:36}  keep order:false        N/A     N/A

The execution time of mpp task end with ExchangeSender_17 is very short, which may make the user think the data transform from ExchangeSender_17 to ExchangeReceiver_18 cost most of the time. However after a more careful debug, I found ExchangeSender_17 costs at least 18s

What is changed and how it works?

Proposal: xxx

What's Changed:

Some backgrounds:

  1. According to the protocal, For each response returned by TiFlash, we need return the execution_summaries, the execution summary includes execution time, processed rows, processed iters and concurrency
  2. For a unary rpc, only one response will be returned at the end of the rpc, while for a streaming rpc, multiple responses can be returned during the rpc
  3. Currently when TiDB handle the execution_summaries returned by TiFlash/TiKV, it is not aware of whether the rpc call is unray or streaming, it treats all the responses as the unary call reponse, which means for processed rows, it will add all the processed rows in execution_sumamries as the total processed rows.
  4. Obviously, for a streaming rpc, if we return the original execution_summaries, the data will be amplified in TiDB, so in order to make the data more acurate in TiDB side, for a streaming call, we always return a delta execution_summaries
  5. A non root mpp task is always a streaming rpc, and it's parent mpp task will use TiRemoteBlockInputStream to receive the response from the non root mpp task. After decoded the response, TiRemoteBlockInputStream will merge the execution_summaries. When merge the execution_summaries for the streaming call, it merges the delta execution_summaries explicitly.

The problem is: Although TiRemoteBlockInputStream merges the delta execution_summaries explicitly, it just merge part of the streaming call's response, because when reading data from remote, we usually use multiple TiRemoteBlockInputStream to speed up the query, so every TiRemoteBlockInputStream only receive part of the execution info for the non root mpp task, which makes it much more less than the orginal ones.

This pr fix it by supporting streaming execution_summaries in TiRemoteBlockInputStream, so for a non root mpp task, it just return the orignal execution_summaries, and TiRemoteBlockInputStream will handle it correctly, however, since TiDB still not support streaming execution_summaries, TiFlash will return delta execution_summaries if the receiver side is TiDB. In the future, after TiDB support streaming execution_summaries, we can remove delta execution summaries totally.

After this pr, the above example will get the following output:

id      estRows actRows task    access object   execution info  operator info   memory  disk
Projection_10   0.00    4399    root            time:27.3s, loops:7, Concurrency:OFF    jrscfour_realtime.ods_clrg_pln_cf_inf.bfr_to_stm_txn_id, jrscfour_realtime.ods_tnar_unn_inf.txn_dt, jrscfour_realtime.ods_tnar_unn_inf.rcrd_crt_dt_tm, jrscfour_realtime.ods_tnar_unn_
└─TableReader_26        0.00    4399    root            time:27.3s, loops:7, cop_task: {num: 6, max: 0s, min: 0s, avg: 0s, p95: 0s, copr_cache_hit_ratio: 0.00} data:ExchangeSender_25  N/A     N/A
  └─ExchangeSender_25   0.00    4399    cop[tiflash]            tiflash_task:{proc max:22.1s, min:350ms, p80:20.1s, p95:22.1s, iters:295, tasks:6, threads:36}  ExchangeType: PassThrough, tasks: [5, 6, 7]     N/A     N/A
    └─HashJoin_11       0.00    4399    cop[tiflash]            tiflash_task:{proc max:22.1s, min:350ms, p80:20.1s, p95:22.1s, iters:295, tasks:6, threads:36}  left outer join, equal:[eq(jrscfour_realtime.ods_clrg_pln_cf_inf.multi_tenancy_id, jrscfour_realtime.ods_std_r
      ├─ExchangeReceiver_24(Build)      30000.00        90000   cop[tiflash]            tiflash_task:{proc max:959.9ms, min:0s, p80:518.6ms, p95:959.9ms, iters:3, tasks:6, threads:60}         N/A     N/A
      │ └─ExchangeSender_23     30000.00        30000   cop[tiflash]            tiflash_task:{proc max:74.8ms, min:0s, p80:74.8ms, p95:74.8ms, iters:1, tasks:6, threads:1}     ExchangeType: Broadcast, tasks: [4]     N/A     N/A
      │   └─Selection_22        30000.00        30000   cop[tiflash]            tiflash_task:{proc max:74.8ms, min:0s, p80:74.8ms, p95:74.8ms, iters:1, tasks:6, threads:1}     eq(jrscfour_realtime.ods_std_rmtnwy_base_inf.multi_tenancy_id, "CN000"), not(isnull(jrscfour_r
      │     └─TableFullScan_21  30000.00        30000   cop[tiflash]    table:T3        tiflash_task:{proc max:72.8ms, min:0s, p80:72.8ms, p95:72.8ms, iters:1, tasks:6, threads:1}     keep order:false        N/A     N/A
      └─HashJoin_14(Probe)      0.00    4399    cop[tiflash]            tiflash_task:{proc max:21.7s, min:348ms, p80:19.7s, p95:21.7s, iters:295, tasks:6, threads:36}  inner join, equal:[eq(jrscfour_realtime.ods_clrg_pln_cf_inf.oprt_pltfrm_txn_id, jrscfour_realtime.ods_
        ├─ExchangeReceiver_18(Build)    0.00    16500   cop[tiflash]            tiflash_task:{proc max:20.4s, min:0s, p80:20.2s, p95:20.4s, iters:2424, tasks:6, threads:60}            N/A     N/A
        │ └─ExchangeSender_17   0.00    5500    cop[tiflash]            tiflash_task:{proc max:19.7s, min:0s, p80:19.7s, p95:19.7s, iters:808, tasks:6, threads:46}     ExchangeType: Broadcast, tasks: [1, 2, 3]       N/A     N/A
        │   └─Selection_16      0.00    5500    cop[tiflash]            tiflash_task:{proc max:19.7s, min:0s, p80:19.7s, p95:19.7s, iters:808, tasks:6, threads:46}     eq(jrscfour_realtime.ods_clrg_pln_cf_inf.act_rcvpy_dt, 2000-01-01 00:00:00.000000), eq(jrscfour_realti
        │     └─TableFullScan_15        50000000.00     50000000        cop[tiflash]    table:T1        tiflash_task:{proc max:14.7s, min:0s, p80:14.7s, p95:14.7s, iters:859, tasks:6, threads:46}     keep order:false        N/A     N/A
        └─Selection_20(Probe)   13402171.90     13403267        cop[tiflash]            tiflash_task:{proc max:1.89s, min:262ms, p80:1.74s, p95:1.89s, iters:295, tasks:6, threads:36}  eq(jrscfour_realtime.ods_tnar_unn_inf.accnm, "01"), eq(jrscfour_realtime.ods_tnar_unn_
          └─TableFullScan_19    16666660.00     16666660        cop[tiflash]    table:T2        tiflash_task:{proc max:863.8ms, min:131ms, p80:778ms, p95:863.8ms, iters:295, tasks:6, threads:36}      keep order:false        N/A     N/A

As we can see, the execution time of ExchangeSender_17 is 19.7s, which is much more close to the actual execution time.

Related changes

  • PR to update pingcap/docs/pingcap/docs-cn:
  • Need to cherry-pick to the release branch:

Check List

Tests

  • Manual test (add detailed scripts or steps below)

Side effects

Release note

  • fix bug that execution info for non root mpp task is not acurate enough

@ti-srebot ti-srebot added CHERRY-PICK cherry pick status/LGT1 Indicates that a PR has LGTM 1. type/enhancement The issue or PR belongs to an enhancement. labels Apr 28, 2021
@ti-srebot ti-srebot requested a review from fzhedu April 28, 2021 08:30
@ti-srebot ti-srebot added this to the v5.0.2 milestone Apr 28, 2021
@windtalker
Copy link
Contributor

/run-all-tests

@windtalker windtalker merged commit 294c37e into pingcap:release-5.0 Apr 30, 2021
@windtalker windtalker deleted the release-5.0-eff941f2e2c3 branch April 30, 2021 08:33
JaySon-Huang pushed a commit to JaySon-Huang/tiflash that referenced this pull request Jun 8, 2021
Signed-off-by: ti-srebot <[email protected]>

Co-authored-by: Flowyi <[email protected]>

cherry pick pingcap#1821 to release-5.0 (pingcap#1827)

Signed-off-by: ti-srebot <[email protected]>

Co-authored-by: xufei <[email protected]>

Remove ReplicatedMergeTree family (pingcap#1805) (pingcap#1826)

cherry pick pingcap#1835 to release-5.0 (pingcap#1840)

Signed-off-by: ti-srebot <[email protected]>

Co-authored-by: xufei <[email protected]>

Use file path as cache key (pingcap#1852) (pingcap#1862)

Lazily initializing DeltaMergeStore  (pingcap#1423) (pingcap#1751) (pingcap#1868)

Add background gc check thread for DeltaTree storage (pingcap#1742) (pingcap#1828)

* cherry pick pingcap#1742 to release-5.0

Signed-off-by: ti-srebot <[email protected]>

* fix conflict

Co-authored-by: lidezhu <[email protected]>

Fix calculate stable property (pingcap#1839) (pingcap#1878)

* cherry pick pingcap#1839 to release-5.0

Signed-off-by: ti-srebot <[email protected]>

* fix compile and cherry pick a small fix

Co-authored-by: lidezhu <[email protected]>

Fix client-c mistake setting current_ts to 0 when resolving lock for async commit (pingcap#1856) (pingcap#1870)

support constant folding in dbg coprocessor (pingcap#1881) (pingcap#1884)

Optimize & Reduce time cost about ci test (pingcap#1849) (pingcap#1894)

* reduce time cost about ci test by parallel
* add `-DNO_WERROR=ON` to cmake config for release-darwin build
* Fix tidb_ghpr_tics_test fail (pingcap#1895)

Signed-off-by: Zhigao Tong <[email protected]>

Fix panic with feature `compaction filter` in release-5.0 (pingcap#1891)

Init store in background task. (pingcap#1843,pingcap#1896) (pingcap#1874)

Push down filter on timestamp type column to storage layer (pingcap#1875) (pingcap#1906)

* cherry pick pingcap#1875 to release-5.0

Signed-off-by: ti-srebot <[email protected]>

* remove irrevalant code

Co-authored-by: lidezhu <[email protected]>

Fix ExchangeSender: remove duplicated write stream operation (pingcap#1379) (pingcap#1883)

cherry pick pingcap#1917 to release-5.0 (pingcap#1923)

Signed-off-by: ti-srebot <[email protected]>

Co-authored-by: Fu Zhe <[email protected]>

Pre-handle SSTFiles to DTFiles when applying snapshots (pingcap#1439) (pingcap#1867)

* Add `raft.snapshot.method` in configuration file to control the method of applying snapshots
  - "block" decode snapshot data as a block
  - "file1" decode snapshot data as DTFiles (directory mode) [**By default**]
* Add a stream `SSTFilesToBlockInputStream` to decode snapshot data into blocks
* Add a stream `BoundedSSTFilesToBlockInputStream` to bound the blocks read from SSTFilesToBlockInputStream by column `_tidb_rowid`
* Add a stream `SSTFilesToDTFilesOutputStream` that accept `BoundedSSTFilesToBlockInputStream` as input stream to write blocks into DTFiles
* Make `STORAGE_FORMAT_CURRENT` to be `STORAGE_FORMAT_V2` by default to support ingest DTFile into DT engine
* Fix the bug that the block decoded from SSTFile is not sorted by primary key and version (pingcap#1888)
* Fix panic with feature compaction filter with DTFile
* Fix ingest write amplification after snapshot apply optimization (pingcap#1913)

Co-authored-by: Zhigao Tong <[email protected]>
Signed-off-by: JaySon-Huang <[email protected]>

Cast int/real as real (pingcap#1911) (pingcap#1928)

Fix the problem that old dmfile is not removed atomically (pingcap#1918) (pingcap#1925)

Eliminate the table lock between reading,writing and DDL operators for TiFlash (pingcap#1736) (pingcap#1858)

* Eliminate the table lock between reading,writing and DDL operators for TiFlash (pingcap#1736)

1. Port the latest `RWLock` for `IStorage` from Clickhouse. "phase-fair rwlocks have been shown to provide more predictable timings compared to task-fair rwlocks for certain workloads (for relatively low share of writes, <25%). Brandenburg B, Anderson J, (2010)"
2. Refactor the lock model for `IStorage` to eliminate the lock between reading, writing, and DDL operations.
3. Acquire table lock by QueryID/threadName instead of function name
4. Tombstone the database after receiving a "drop" DDL from TiDB. Physically drop the databases / tables after gc safe point.
5. Remove some outdated tests under `tests/mutable-test/txn_schema` (all those tests are ported into `tests/delta-merge-test/raft/schema`)

* Mark FIXME for conflict codes
Conflicts:
  dbms/src/Flash/Coprocessor/DAGQueryBlockInterpreter.cpp

Signed-off-by: JaySon-Huang <[email protected]>

cached tics code incase of clone same code repeatedly (pingcap#1994) (pingcap#2001)

Add row & bytes threshold config for CompactLog (pingcap#1997) (pingcap#2005)

* cherry pick pingcap#1997 to release-5.0

Signed-off-by: ti-srebot <[email protected]>
Co-authored-by: Zhigao Tong <[email protected]>

Fix the bug that incomplete write batches are not truncated (pingcap#1934) (pingcap#2003)

Revert "Push down filter on timestamp type column to storage layer (pingcap#1875) (pingcap#1906)" (pingcap#2010)

This reverts commit 630b612.

Revert Lazily Init Store (pingcap#2011)

* Revert "Init store in background task. (pingcap#1843,pingcap#1896) (pingcap#1874)"

This reverts commit 0882461.

Conflicts:
	dbms/src/Storages/StorageDeltaMerge.cpp

* format code

* Revert "Lazily initializing DeltaMergeStore  (pingcap#1423) (pingcap#1751) (pingcap#1868)"

This reverts commit bbce050.

Conflicts:
	dbms/src/Storages/DeltaMerge/tests/gtest_dm_storage_delta_merge.cpp
	dbms/src/Storages/StorageDeltaMerge.cpp
	dbms/src/Storages/StorageDeltaMerge.h

Co-authored-by: JinheLin <[email protected]>
Co-authored-by: Flowyi <[email protected]>

Revert "background gc thread" (pingcap#2015)

* Revert "Fix calculate stable property (pingcap#1839) (pingcap#1878)"

This reverts commit 6a9eb58.

* Revert "Add background gc check thread for DeltaTree storage (pingcap#1742) (pingcap#1828)"

This reverts commit cbbbd09.

* remove code

* format code

* format code

* fix test compile

* fix comment

Fix the potential concurrency problem when clone the shared delta index (pingcap#2030) (pingcap#2033)

* cherry pick pingcap#2030 to release-5.0

Signed-off-by: ti-srebot <[email protected]>

* Update dbms/src/Storages/DeltaMerge/DeltaIndex.h

Co-authored-by: JaySon <[email protected]>

Co-authored-by: lidezhu <[email protected]>
Co-authored-by: JaySon <[email protected]>

Fix bug for unexpected deleted ingest file (pingcap#2047) (pingcap#2048)

* Only enable GC for DTFiles after they get applied to all segments.
* Refine some loggings

Signed-off-by: JaySon-Huang <[email protected]>

Fix cast int as time (pingcap#1788) (pingcap#1893)

cherry pick pingcap#1903 to release-5.0 (pingcap#1915)

Signed-off-by: ti-srebot <[email protected]>

Fix ingesting file may add ref page to deleted page (pingcap#2054) (pingcap#2055)

1. Put the ingest file id into `storage_pool.data` before ingesting them into segments
2. Add ref pages to the ingest files for each segment
3. Delete the original page id after all

Signed-off-by: JaySon-Huang <[email protected]>

Fix compile error on Mac Clang 12.0.5 (pingcap#2058) (pingcap#2070)

cherry pick pingcap#2017 pingcap#2020 pingcap#2051 to release-5.0 (pingcap#2074)

Revert "cherry pick pingcap#2017 pingcap#2020 pingcap#2051 to release-5.0 (pingcap#2074)"

This reverts commit 8e0712c.

Revert "Fix compile error on Mac Clang 12.0.5 (pingcap#2058) (pingcap#2070)"

This reverts commit 3c31b92.

Revert "Fix ingesting file may add ref page to deleted page (pingcap#2054) (pingcap#2055)"

This reverts commit c773d61.

Revert "Fix bug for unexpected deleted ingest file (pingcap#2047) (pingcap#2048)"

This reverts commit b329618.

Revert "Pre-handle SSTFiles to DTFiles when applying snapshots (pingcap#1439) (pingcap#1867)"

This reverts commit c947fd6.

 Conflicts:
	dbms/src/Common/FailPoint.cpp
	dbms/src/Storages/DeltaMerge/File/DMFileBlockOutputStream.h
	dbms/src/Storages/DeltaMerge/File/DMFileWriter.h
	dbms/src/Storages/DeltaMerge/ReorganizeBlockInputStream.h
	dbms/src/Storages/DeltaMerge/SSTFilesToBlockInputStream.cpp
	dbms/src/Storages/DeltaMerge/SSTFilesToBlockInputStream.h
	dbms/src/Storages/DeltaMerge/SSTFilesToDTFilesOutputStream.cpp
	dbms/src/Storages/DeltaMerge/Segment.cpp
	dbms/src/Storages/DeltaMerge/StableValueSpace.cpp
	dbms/src/Storages/DeltaMerge/tests/gtest_dm_file.cpp
	dbms/src/Storages/DeltaMerge/tests/gtest_dm_storage_delta_merge.cpp
	dbms/src/Storages/StorageDeltaMerge.cpp
	dbms/src/Storages/StorageDeltaMerge.h
	dbms/src/Storages/Transaction/ApplySnapshot.cpp
	dbms/src/Storages/Transaction/PartitionStreams.cpp
	tests/delta-merge-test/raft/schema/drop_on_restart.test

Fix comment

Signed-off-by: JaySon-Huang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CHERRY-PICK cherry pick status/LGT1 Indicates that a PR has LGTM 1. type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants