improve the robust of balance region scheduler #85

bufferflies · 2021-12-28T11:57:39Z

No description provided.

Signed-off-by: bufferflies <[email protected]>

text/0083-Improve-the-robust-of-balance-scheduler.md

tonyxuqqi · 2021-12-29T05:04:04Z

text/0083-Improve-the-robust-of-balance-scheduler.md

+
+### Store pick strategy
+
+It can arrange all the store based on label, like TiKV and TiFlash and allow low score group has more chance to scheduler. But the first score region should has highest priority to be selected.


Please provide more detail about label and how low score group can have more chance to schedule

tonyxuqqi · 2021-12-29T05:08:39Z

text/0083-Improve-the-robust-of-balance-scheduler.md

+#### Consider Influence to leader
+
+Normally, one operator is made of region, source store and target store, the key works finished by region leader such as snapshot generate, snapshot send. It is not friendly to the leader if majority operator is add follow.
+


"It is not friendly to the leader if majority operator is add follow"---could you explain a bit more detail regarding this? Because for a region leader, the add follower operator should be up to 1 or 2. Or you mean the whole store.

I means that region leade generates and sends snapshot will occupy cpu and io resources.

tonyxuqqi · 2021-12-29T05:15:34Z

text/0083-Improve-the-robust-of-balance-scheduler.md

+
+![](https://latex.codecogs.com/gif.image?\dpi{200}&space;\bg_white&space;Influence=\sum_{i=0}^{j}step_{i}.Influence&space;\newline&space;Cost&space;=&space;200*ln{\frac{region_{size}}{100KiB}})
+
+Cost equals 200 if operator influence is 1Mb or equals 600 if operator influence is 1gb.


1Mb--->1MB
1gb --->1GB
ln should be log10? Otherwise 200*ln1MB/100KB won't be 200.
Why use formula log regionsize/100KB, this makes little difference when region size is 500MB and 1GB, for example--but the actual cost difference of 500MB and 1GB is much bigger.

tonyxuqqi · 2021-12-29T05:19:06Z

text/0083-Improve-the-robust-of-balance-scheduler.md

+
+#### Operator life cycle
+
+The operator life cycle can divide into some stages: create, executing(started), complete. PD will check operator stage by region heart beats and cancel operator if one operator‘s running time exceed the fixed value(10m).


10m may not be enough.
Why not make it configurable.

What if Tikv crashed, the heartbeat request may not carry the operator info anymore, how will PD handle it?

tonyxuqqi · 2021-12-29T05:26:16Z

text/0083-Improve-the-robust-of-balance-scheduler.md

+
+The operator life cycle can divide into some stages: create, executing(started), complete. PD will check operator stage by region heart beats and cancel operator if one operator‘s running time exceed the fixed value(10m).
+
+It will be better if we can calculate every step expecting execute duration by major factor includes region size, IO limit and operator concurrency like this:


There's no per snapshot limit and thus each snapshot's speed cannot be just 100MB/6.
Also snapshot generator duration cannot be ignore in single RocksDB instance version, as we have to scan to get the region's snapshot.
I think the total time threshold should be pretty conservative, probably 1hr at least.

I disagree this threshold should be conservative. If one region has two region peer and needs one peer, it will wait one hour to try another target store if the origin target store is down or other reason.

tonyxuqqi · 2021-12-29T05:34:51Z

text/0083-Improve-the-robust-of-balance-scheduler.md

+
+### Sync global config
+
+There are some global config that all components need to synchronize like `region-max-size`, `io-limit`. Using ETCD api to implement global config may be a good idea like [this](<[https://github.com/pingcap/tidb/pull/31010/files](https://github.com/pingcap/tidb/pull/31010/files)>)


My concern is that such configuration is likely same for the whole cluster. Does it worth to ask every tikv report these values?
Even if some TiKv changes the value, then which region size value PD will use for calculating the formula above then?
To me, the region size should be cluster level config.

In the PR 31010, TiKV doesn't need to report pd and watch this config.

tonyxuqqi · 2021-12-29T05:38:59Z

text/0083-Improve-the-robust-of-balance-scheduler.md

+
+Canceling operator can depends on TiKV not by PD, but TiKV should notify PD after canceled one operator.
+
+## Questions


We noticed that when scale-out new node, it's much faster to move the data over if the new node is not the leader until the data is moved over. But of course in some scenarios, we hope the new node can act as leader ASAP. So it will be better to have an option to enable both scenarios.

For scale-in an old node, for current implementation, is transfer leaders the first step before moving data?

In past, scale-in an node will evict leader first.
The new region peers act as leader should depend on configs in different scenarios.

text/0083-Improve-the-robust-of-balance-scheduler.md

rleungx · 2021-12-29T06:14:17Z

text/0083-Improve-the-robust-of-balance-scheduler.md

+
+Normally, one operator is made of region, source store and target store, the key works finished by region leader such as snapshot generate, snapshot send. It is not friendly to the leader if majority operator is add follow.
+
+It will add new store limit as new limit type to decrease leader loads of every store.


Can you provide more details about how should we use this new limit type to decrease the load?

text/0083-Improve-the-robust-of-balance-scheduler.md

Signed-off-by: bufferflies <[email protected]>

…ure/robus

disksing · 2022-01-10T05:44:16Z

text/0085-Improve-the-robust-of-balance-scheduler.md

+
+### Sync global config
+
+There are some global config that all components need to synchronize like `region-max-size`, `io-limit`. Using ETCD api to implement global config may be a good idea like [this](<[https://github.com/pingcap/tidb/pull/31010/files](https://github.com/pingcap/tidb/pull/31010/files)>)


Do you mean sync the config from TiKV to PD, or from PD to TiKV?

improve the robust of scheduler

38b6fe4

Signed-off-by: bufferflies <[email protected]>

bufferflies requested review from HunDunDM, disksing and nolouch December 29, 2021 02:25

bufferflies added 2 commits December 29, 2021 10:35

format

e951aeb

Signed-off-by: bufferflies <[email protected]>

format

6da63b8

Signed-off-by: bufferflies <[email protected]>

tonyxuqqi reviewed Dec 29, 2021

View reviewed changes

text/0083-Improve-the-robust-of-balance-scheduler.md Outdated Show resolved Hide resolved

tonyxuqqi reviewed Dec 29, 2021

View reviewed changes

text/0083-Improve-the-robust-of-balance-scheduler.md Outdated Show resolved Hide resolved

tonyxuqqi reviewed Dec 29, 2021

View reviewed changes

rleungx reviewed Dec 29, 2021

View reviewed changes

gramma && rename title

509e480

Signed-off-by: bufferflies <[email protected]>

bufferflies force-pushed the feature/robus branch from 89b76f4 to 509e480 Compare December 29, 2021 07:44

bufferflies changed the title ~~improve the robust of scheduler~~ improve the robust of balance region scheduler Dec 29, 2021

bufferflies added 3 commits December 29, 2021 19:26

grama

7d1a4db

Signed-off-by: bufferflies <[email protected]>

grama && rename file

47488d5

Signed-off-by: bufferflies <[email protected]>

Merge branch 'feature/robus' of github.com:bufferflies/rfcs into feat…

eef4f8c

…ure/robus

bufferflies force-pushed the feature/robus branch from 31d7868 to eef4f8c Compare December 29, 2021 11:39

disksing reviewed Jan 10, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve the robust of balance region scheduler #85

improve the robust of balance region scheduler #85

bufferflies commented Dec 28, 2021

tonyxuqqi Dec 29, 2021

tonyxuqqi Dec 29, 2021

bufferflies Dec 29, 2021

tonyxuqqi Dec 29, 2021

tonyxuqqi Dec 29, 2021

tonyxuqqi Dec 29, 2021

bufferflies Dec 29, 2021

tonyxuqqi Dec 29, 2021

bufferflies Dec 29, 2021

tonyxuqqi Dec 29, 2021

bufferflies Dec 29, 2021

rleungx Dec 29, 2021

disksing Jan 10, 2022


		### Store pick strategy

		It can arrange all the store based on label, like TiKV and TiFlash and allow low score group has more chance to scheduler. But the first score region should has highest priority to be selected.

		#### Consider Influence to leader

		Normally, one operator is made of region, source store and target store, the key works finished by region leader such as snapshot generate, snapshot send. It is not friendly to the leader if majority operator is add follow.


		![](https://latex.codecogs.com/gif.image?\dpi{200}&space;\bg_white&space;Influence=\sum_{i=0}^{j}step_{i}.Influence&space;\newline&space;Cost&space;=&space;200*ln{\frac{region_{size}}{100KiB}})

		Cost equals 200 if operator influence is 1Mb or equals 600 if operator influence is 1gb.


		#### Operator life cycle

		The operator life cycle can divide into some stages: create, executing(started), complete. PD will check operator stage by region heart beats and cancel operator if one operator‘s running time exceed the fixed value(10m).


		The operator life cycle can divide into some stages: create, executing(started), complete. PD will check operator stage by region heart beats and cancel operator if one operator‘s running time exceed the fixed value(10m).

		It will be better if we can calculate every step expecting execute duration by major factor includes region size, IO limit and operator concurrency like this:


		### Sync global config

		There are some global config that all components need to synchronize like `region-max-size`, `io-limit`. Using ETCD api to implement global config may be a good idea like [this](<[https://github.com/pingcap/tidb/pull/31010/files](https://github.com/pingcap/tidb/pull/31010/files)>)


		Canceling operator can depends on TiKV not by PD, but TiKV should notify PD after canceled one operator.

		## Questions


		Normally, one operator is made of region, source store and target store, the key works finished by region leader such as snapshot generate, snapshot send. It is not friendly to the leader if majority operator is add follow.

		It will add new store limit as new limit type to decrease leader loads of every store.

improve the robust of balance region scheduler #85

Are you sure you want to change the base?

improve the robust of balance region scheduler #85

Conversation

bufferflies commented Dec 28, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment