RawKV Change Data Capture #86

pingyu · 2022-01-12T13:10:54Z

This proposal introduces the technical design of RawKV Change Data Capture.
(Rendered).

Related issue: tikv/tikv#11745

Signed-off-by: pingyu <[email protected]>

pingyu · 2022-01-12T13:13:40Z

/cc @zz-jason @overvenus @andylokandy @zeminzhou @haojinming

Signed-off-by: pingyu <[email protected]>

overvenus · 2022-01-24T03:32:21Z

text/0083-rawkv-cross-cluster-replication.md

+Customers are deploying TiKV clusters as non-transaction Key-Value (RawKV) storage. But the lack of *Cross Cluster Replication* is a obstacle to deploy disaster recovery clusters for providing more highly available services.
+
+The ability of cross cluster replication will help TiKV be more adoptable to industries such as Banking and Finance.


Could you list key challenges of the use case "Banking" plus "Cross Cluster Replication"? It would be helpful for understanding the design in the RFC.

Here is my initial thought,

How to address high latency issue between clusters?

How do we define the consistency of raw kv?

@overvenus I add a chapter about "Key Challenges & Solutions", and specify that we will implement RawKV features in TiCDC as new modules for isolation with original TiDB related logics.
PTAL, thanks~

Signed-off-by: pingyu <[email protected]>

pingyu · 2022-02-10T13:18:19Z

@BusyJay @Little-Wallace PTAL, thanks~

zz-jason · 2022-02-14T06:29:18Z

text/0083-rawkv-cross-cluster-replication.md

+
+Physical part is refreshed by repeatedly acquiring TSO in a period of `500ms`, to keep it being closed the real world time. And it can tolerate fault of TSO no longer than `30s`, to keep time-related metrics such as RPO reasonable.
+
+Besides, on startup, physical part must be initialized by a successful TSO.


what happens if it failed to acquire ts from PD?

The raw write operations (raw_put, raw_del, etc.) will fail.

Even without Raw CDC, TiKV can't startup when PD is out of service.

ok. it would be better to add iit to the RFC for disambiguation.

Done, PTAL~

text/0083-rawkv-cross-cluster-replication.md

andylokandy · 2022-02-14T08:39:15Z

text/0083-rawkv-cross-cluster-replication.md

+
+TSO is a global monotonically increasing timestamp, which help the generation of timestamp be independent to local clock of machine, and be immune to issues such as reverse between machine reboot.
+
+Physical part is refreshed by repeatedly acquiring TSO in a period of `500ms`, to keep it being closed the real world time. And it can tolerate fault of TSO no longer than `30s`, to keep time-related metrics such as RPO reasonable.


Why 30s? And will raw write fails if TSO is offline over 30s?

En... I think this is not a serious decision. Seems that offline of TSO doesn't break the correctness, except some time-related metrics.
Maybe we can drop this restriction, and report failure metrics. And add some tests on this scenario.

Meanwhile, TiKV-CDC is highly dependent to PD. When PD is offline, the replication is not working.

TiKV-CDC is highly dependent to PD. When PD is offline, the replication is not working.

Yes, but the write on the main TiKV may still work.

Maybe we can drop this restriction, and report failure metrics.

Agree

Fixed, PTAL~

text/0083-rawkv-cross-cluster-replication.md

Signed-off-by: Ping Yu <[email protected]> Co-authored-by: Andy Lok <[email protected]>

Signed-off-by: pingyu <[email protected]>

pingyu · 2022-02-15T07:24:27Z

@zz-jason @andylokandy
Comments addressed. PTAL, thanks~

andylokandy · 2022-02-15T07:41:58Z

text/0083-rawkv-cross-cluster-replication.md

-Besides, on startup, physical part must be initialized by a successful TSO.
+On startup, physical part must be initialized by a successful TSO, to ensure that HLC is larger than last running. RawKV write operations will fail if the initialization has not been completed. (TiKV also can't startup when PD is out of service).
+
+After that, HLC can tolerate fault of TSO. RawKV operations are normal, but time-related metrics such as RPO is not reasonable. Meanwhile, as TiKV-CDC being highly dependent to PD (for cluster and task management, checkpoint, etc.), the replication will not be working.


What exactly is the "After that"? From the context, it is the TSO failure on startup but I think it should be TSO failure after startup. And I think the time-related metrics may not be affected unless the clock shift is crazy enough...

It is "After successful initialization". I'm sorry for the not clear enough.

We do not advance HLC by local clock when TSO is offline, which will cause the time-related metrics problem.

Using local clock on fault of TSO need some other logics to ensure causality, such as saving maximum HLC on storage to avoid timestamp reverse between startup.

I do not include the "Falling Back to Local Clock" in this proposal for simplicity, as the PD offline is rare, and when it was offline it would cause more severe issues than the problem of time-related metrics.

Using local clock on fault of TSO need some other logics to ensure causality, such as saving maximum HLC on storage to avoid timestamp reverse between startup.

If I understand it right, the RegionMaxTs and StoreMaxTs should contain both physical time and logical time, thus we always know what is the latest physical time and therefore safe to use the local clock.

Yes it is safe during running.
But unsafe will happen on restart. If the local clock shift backward between TiKV stop & restart, it's possible to get a smaller timestamp than last running.

CockroachDB has a description on this restart issue here. We can refer to it.

Yes it is safe during running.

Yes, that's my point. As long as the TiKV doesn't restart, the metrics of RPO and write operation should not affect by PD failure.

it would be better if you can explain the reason with an example. For example:

In a 3 AZ deployment:

PD leader is located in AZ1

TiKV region leaders are uniformly distributed in AZ1 - AZ3

There is a partial network partition that happens: AZ1 can not communicate with AZ2, but they can both communicate with AZ3, TiKV Client (for example, client-java) can connect to all the three AZs

Would the write requests to AZ2 TiKV Region be blocked because AZ2 can not sync physical timestamp with PD leader in AZ1?

I improve the words here. PTAL~

andylokandy

rest LGTM

zz-jason · 2022-02-15T11:14:21Z

text/0083-rawkv-cross-cluster-replication.md

+
+Timestamp is generated by TiKV internally, to get a better overall performance and client compatibility.
+
+We use [HLC](https://cse.buffalo.edu/tech-reports/2014-04.pdf) method to generate timestamp, which not only is easy to use (by being closed to real world time), but also can capture causality.


could you add move info about how many bits are physical and logical timestamp occupies in this implementation?

OK. Done, PTAL~

zz-jason · 2022-02-15T11:15:21Z

text/0083-rawkv-cross-cluster-replication.md

+
+TSO is a global monotonically increasing timestamp, which help the generation of timestamp be independent to local clock of machine, and be immune to issues such as reverse between machine reboot.
+
+Physical part is refreshed by repeatedly acquiring TSO in a period of `500ms`, to keep it being closed the real world time.


is the period 500ms configurable?

Yes. I add more words for this, PTAL~

Signed-off-by: pingyu <[email protected]>

pingyu · 2022-11-15T02:49:27Z

@zeminzhou @haojinming PTAL, thanks~

haojinming · 2022-11-15T03:19:28Z

text/0083-rawkv-change-data-capture.md

+
+We will use [memory locks] to trace the "inflight" RawKV writes: acquire a lock before scheduling the writes, and release the lock after (asynchronously) send response to client.
+
+Then, CDC worker in TiKV is able to get the **resolved-ts** as `min(timestamp)-1` of all locks. If there is no lock, use current TSO as **resolved-ts**.


As concurrency manager lock the ts before key write. The resolved-ts use the lock ts is enough, no need -1.

text/0083-rawkv-change-data-capture.md

haojinming · 2022-11-15T03:23:15Z

text/0083-rawkv-change-data-capture.md

+
+**3rd**, [TiKV API V2] does not guarantee that a **RawKV** entry with smaller timestamp must be observed first. So **TiKV-CDC** will sort entires by timestamp before send to downstream.
+
+But how **TiKV-CDC** knows it will not receive any even earlier entry from TiKV after the previous batch of data had been sent to downstream ? So we need **Resolved Timestamp** (abbr. **resolved-ts**). **Resolved-ts** is similar to [Watermark], by which TiKV indicating that all entries earlier have been observed. When **TiKV-CDC** receives a `resolved-ts`, all sorted entries with `timestamp <= resolved-ts` are safe to write to downstream.


Suggested change

But how **TiKV-CDC** knows it will not receive any even earlier entry from TiKV after the previous batch of data had been sent to downstream ? So we need **Resolved Timestamp** (abbr. **resolved-ts**). **Resolved-ts** is similar to [Watermark], by which TiKV indicating that all entries earlier have been observed. When **TiKV-CDC** receives a `resolved-ts`, all sorted entries with `timestamp <= resolved-ts` are safe to write to downstream.

But how **TiKV-CDC** knows it will not receive any even earlier entry from TiKV after the previous batch of data had been sent to downstream ? So we need **Resolved Timestamp** (abbr. **resolved-ts**). **Resolved-ts** is similar to [Watermark], by which TiKV indicats that all entries earlier have been observed. When **TiKV-CDC** receives a `resolved-ts`, all sorted entries with `timestamp <= resolved-ts` are safe to write to downstream.

text/0083-rawkv-change-data-capture.md

Signed-off-by: pingyu <[email protected]> Co-authored-by: haojinming <[email protected]>

Signed-off-by: pingyu <[email protected]>

pingyu · 2022-11-15T10:24:03Z

@haojinming All comments are addressed. PTAL~

haojinming

LGTM

Signed-off-by: Ping Yu <[email protected]>

zeminzhou

LGTM～

Signed-off-by: Ping Yu <[email protected]>

sunxiaoguang · 2022-11-18T02:31:04Z

/merge

pingyu added 3 commits January 11, 2022 21:31

first commit

e20918a

Signed-off-by: pingyu <[email protected]>

RawKV Cross Cluster Replication

4407bdb

Signed-off-by: pingyu <[email protected]>

update PR id

987af3b

Signed-off-by: pingyu <[email protected]>

Add TiCDC chapter. Other small improvement

ab8835a

Signed-off-by: pingyu <[email protected]>

overvenus reviewed Jan 24, 2022

View reviewed changes

add challenges and solutions chapter. improve TiCDC chapter

8ab6d31

Signed-off-by: pingyu <[email protected]>

pingyu mentioned this pull request Feb 10, 2022

Initialize TiKV CDC / BR repo tikv/migration#40

Closed

16 tasks

pingyu added 2 commits February 10, 2022 19:47

change to TiKV-CDC & TiKV-BR

26b3292

Signed-off-by: pingyu <[email protected]>

timestamp encoding

0e746d7

Signed-off-by: pingyu <[email protected]>

pingyu mentioned this pull request Feb 14, 2022

TiKV CDC Development Tasks tikv/migration#48

Closed

28 tasks

zz-jason reviewed Feb 14, 2022

View reviewed changes

pingyu mentioned this pull request Feb 14, 2022

Implement RawKV Cross Cluster Replication tikv/tikv#11965

Closed

51 tasks

andylokandy reviewed Feb 14, 2022

View reviewed changes

text/0083-rawkv-cross-cluster-replication.md Outdated Show resolved Hide resolved