Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RawKV Change Data Capture #86

Merged
merged 26 commits into from
Nov 18, 2022
Merged

RawKV Change Data Capture #86

merged 26 commits into from
Nov 18, 2022

Conversation

pingyu
Copy link
Contributor

@pingyu pingyu commented Jan 12, 2022

This proposal introduces the technical design of RawKV Change Data Capture.
(Rendered).

Related issue: tikv/tikv#11745

Signed-off-by: pingyu <[email protected]>
Signed-off-by: pingyu <[email protected]>
@pingyu
Copy link
Contributor Author

pingyu commented Jan 12, 2022

Comment on lines 12 to 14
Customers are deploying TiKV clusters as non-transaction Key-Value (RawKV) storage. But the lack of *Cross Cluster Replication* is a obstacle to deploy disaster recovery clusters for providing more highly available services.

The ability of cross cluster replication will help TiKV be more adoptable to industries such as Banking and Finance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you list key challenges of the use case "Banking" plus "Cross Cluster Replication"? It would be helpful for understanding the design in the RFC.

Here is my initial thought,

  • How to address high latency issue between clusters?
  • How do we define the consistency of raw kv?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@overvenus I add a chapter about "Key Challenges & Solutions", and specify that we will implement RawKV features in TiCDC as new modules for isolation with original TiDB related logics.
PTAL, thanks~

@pingyu
Copy link
Contributor Author

pingyu commented Feb 10, 2022

@BusyJay @Little-Wallace PTAL, thanks~

@pingyu pingyu mentioned this pull request Feb 14, 2022
28 tasks

Physical part is refreshed by repeatedly acquiring TSO in a period of `500ms`, to keep it being closed the real world time. And it can tolerate fault of TSO no longer than `30s`, to keep time-related metrics such as RPO reasonable.

Besides, on startup, physical part must be initialized by a successful TSO.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if it failed to acquire ts from PD?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The raw write operations (raw_put, raw_del, etc.) will fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even without Raw CDC, TiKV can't startup when PD is out of service.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. it would be better to add iit to the RFC for disambiguation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, PTAL~

text/0083-rawkv-cross-cluster-replication.md Outdated Show resolved Hide resolved
text/0083-rawkv-cross-cluster-replication.md Outdated Show resolved Hide resolved
text/0083-rawkv-cross-cluster-replication.md Outdated Show resolved Hide resolved

TSO is a global monotonically increasing timestamp, which help the generation of timestamp be independent to local clock of machine, and be immune to issues such as reverse between machine reboot.

Physical part is refreshed by repeatedly acquiring TSO in a period of `500ms`, to keep it being closed the real world time. And it can tolerate fault of TSO no longer than `30s`, to keep time-related metrics such as RPO reasonable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 30s? And will raw write fails if TSO is offline over 30s?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

En... I think this is not a serious decision. Seems that offline of TSO doesn't break the correctness, except some time-related metrics.
Maybe we can drop this restriction, and report failure metrics. And add some tests on this scenario.

Meanwhile, TiKV-CDC is highly dependent to PD. When PD is offline, the replication is not working.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TiKV-CDC is highly dependent to PD. When PD is offline, the replication is not working.

Yes, but the write on the main TiKV may still work.

Maybe we can drop this restriction, and report failure metrics.

Agree

Copy link
Contributor Author

@pingyu pingyu Feb 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, PTAL~

Signed-off-by: Ping Yu <[email protected]>

Co-authored-by: Andy Lok <[email protected]>
Signed-off-by: pingyu <[email protected]>
@pingyu
Copy link
Contributor Author

pingyu commented Feb 15, 2022

@zz-jason @andylokandy
Comments addressed. PTAL, thanks~

Besides, on startup, physical part must be initialized by a successful TSO.
On startup, physical part must be initialized by a successful TSO, to ensure that HLC is larger than last running. RawKV write operations will fail if the initialization has not been completed. (TiKV also can't startup when PD is out of service).

After that, HLC can tolerate fault of TSO. RawKV operations are normal, but time-related metrics such as RPO is not reasonable. Meanwhile, as TiKV-CDC being highly dependent to PD (for cluster and task management, checkpoint, etc.), the replication will not be working.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is the "After that"? From the context, it is the TSO failure on startup but I think it should be TSO failure after startup. And I think the time-related metrics may not be affected unless the clock shift is crazy enough...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is "After successful initialization". I'm sorry for the not clear enough.

We do not advance HLC by local clock when TSO is offline, which will cause the time-related metrics problem.

Using local clock on fault of TSO need some other logics to ensure causality, such as saving maximum HLC on storage to avoid timestamp reverse between startup.

I do not include the "Falling Back to Local Clock" in this proposal for simplicity, as the PD offline is rare, and when it was offline it would cause more severe issues than the problem of time-related metrics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using local clock on fault of TSO need some other logics to ensure causality, such as saving maximum HLC on storage to avoid timestamp reverse between startup.

If I understand it right, the RegionMaxTs and StoreMaxTs should contain both physical time and logical time, thus we always know what is the latest physical time and therefore safe to use the local clock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is safe during running.
But unsafe will happen on restart. If the local clock shift backward between TiKV stop & restart, it's possible to get a smaller timestamp than last running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CockroachDB has a description on this restart issue here. We can refer to it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is safe during running.

Yes, that's my point. As long as the TiKV doesn't restart, the metrics of RPO and write operation should not affect by PD failure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be better if you can explain the reason with an example. For example:

In a 3 AZ deployment:

  • PD leader is located in AZ1
  • TiKV region leaders are uniformly distributed in AZ1 - AZ3
  • There is a partial network partition that happens: AZ1 can not communicate with AZ2, but they can both communicate with AZ3, TiKV Client (for example, client-java) can connect to all the three AZs

Would the write requests to AZ2 TiKV Region be blocked because AZ2 can not sync physical timestamp with PD leader in AZ1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I improve the words here. PTAL~

Copy link
Contributor

@andylokandy andylokandy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest LGTM


Timestamp is generated by TiKV internally, to get a better overall performance and client compatibility.

We use [HLC](https://cse.buffalo.edu/tech-reports/2014-04.pdf) method to generate timestamp, which not only is easy to use (by being closed to real world time), but also can capture causality.
Copy link
Member

@zz-jason zz-jason Feb 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add move info about how many bits are physical and logical timestamp occupies in this implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Done, PTAL~


TSO is a global monotonically increasing timestamp, which help the generation of timestamp be independent to local clock of machine, and be immune to issues such as reverse between machine reboot.

Physical part is refreshed by repeatedly acquiring TSO in a period of `500ms`, to keep it being closed the real world time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the period 500ms configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I add more words for this, PTAL~

@pingyu pingyu changed the title RawKV Cross Cluster Replication RawKV Change Data Capture Nov 14, 2022
Signed-off-by: pingyu <[email protected]>
@pingyu
Copy link
Contributor Author

pingyu commented Nov 15, 2022

@zeminzhou @haojinming PTAL, thanks~


We will use [memory locks] to trace the "inflight" RawKV writes: acquire a lock before scheduling the writes, and release the lock after (asynchronously) send response to client.

Then, CDC worker in TiKV is able to get the **resolved-ts** as `min(timestamp)-1` of all locks. If there is no lock, use current TSO as **resolved-ts**.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As concurrency manager lock the ts before key write. The resolved-ts use the lock ts is enough, no need -1.

text/0083-rawkv-change-data-capture.md Outdated Show resolved Hide resolved
text/0083-rawkv-change-data-capture.md Outdated Show resolved Hide resolved

**3rd**, [TiKV API V2] does not guarantee that a **RawKV** entry with smaller timestamp must be observed first. So **TiKV-CDC** will sort entires by timestamp before send to downstream.

But how **TiKV-CDC** knows it will not receive any even earlier entry from TiKV after the previous batch of data had been sent to downstream ? So we need **Resolved Timestamp** (abbr. **resolved-ts**). **Resolved-ts** is similar to [Watermark], by which TiKV indicating that all entries earlier have been observed. When **TiKV-CDC** receives a `resolved-ts`, all sorted entries with `timestamp <= resolved-ts` are safe to write to downstream.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
But how **TiKV-CDC** knows it will not receive any even earlier entry from TiKV after the previous batch of data had been sent to downstream ? So we need **Resolved Timestamp** (abbr. **resolved-ts**). **Resolved-ts** is similar to [Watermark], by which TiKV indicating that all entries earlier have been observed. When **TiKV-CDC** receives a `resolved-ts`, all sorted entries with `timestamp <= resolved-ts` are safe to write to downstream.
But how **TiKV-CDC** knows it will not receive any even earlier entry from TiKV after the previous batch of data had been sent to downstream ? So we need **Resolved Timestamp** (abbr. **resolved-ts**). **Resolved-ts** is similar to [Watermark], by which TiKV indicats that all entries earlier have been observed. When **TiKV-CDC** receives a `resolved-ts`, all sorted entries with `timestamp <= resolved-ts` are safe to write to downstream.

text/0083-rawkv-change-data-capture.md Outdated Show resolved Hide resolved
text/0083-rawkv-change-data-capture.md Outdated Show resolved Hide resolved
text/0083-rawkv-change-data-capture.md Outdated Show resolved Hide resolved
pingyu and others added 2 commits November 15, 2022 18:22
Signed-off-by: pingyu <[email protected]>

Co-authored-by: haojinming <[email protected]>
Signed-off-by: pingyu <[email protected]>
@pingyu
Copy link
Contributor Author

pingyu commented Nov 15, 2022

@haojinming All comments are addressed. PTAL~

Copy link
Contributor

@haojinming haojinming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@zeminzhou zeminzhou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM~

Signed-off-by: Ping Yu <[email protected]>
@sunxiaoguang
Copy link
Member

/merge

@pingyu pingyu merged commit 4aeef16 into tikv:master Nov 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants