-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RawKV Change Data Capture #86
Conversation
Signed-off-by: pingyu <[email protected]>
Signed-off-by: pingyu <[email protected]>
Signed-off-by: pingyu <[email protected]>
Signed-off-by: pingyu <[email protected]>
Customers are deploying TiKV clusters as non-transaction Key-Value (RawKV) storage. But the lack of *Cross Cluster Replication* is a obstacle to deploy disaster recovery clusters for providing more highly available services. | ||
|
||
The ability of cross cluster replication will help TiKV be more adoptable to industries such as Banking and Finance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you list key challenges of the use case "Banking" plus "Cross Cluster Replication"? It would be helpful for understanding the design in the RFC.
Here is my initial thought,
- How to address high latency issue between clusters?
- How do we define the consistency of raw kv?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@overvenus I add a chapter about "Key Challenges & Solutions", and specify that we will implement RawKV features in TiCDC as new modules for isolation with original TiDB related logics.
PTAL, thanks~
Signed-off-by: pingyu <[email protected]>
Signed-off-by: pingyu <[email protected]>
Signed-off-by: pingyu <[email protected]>
@BusyJay @Little-Wallace PTAL, thanks~ |
|
||
Physical part is refreshed by repeatedly acquiring TSO in a period of `500ms`, to keep it being closed the real world time. And it can tolerate fault of TSO no longer than `30s`, to keep time-related metrics such as RPO reasonable. | ||
|
||
Besides, on startup, physical part must be initialized by a successful TSO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if it failed to acquire ts from PD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The raw write operations (raw_put, raw_del, etc.) will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even without Raw CDC, TiKV can't startup when PD is out of service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. it would be better to add iit to the RFC for disambiguation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, PTAL~
|
||
TSO is a global monotonically increasing timestamp, which help the generation of timestamp be independent to local clock of machine, and be immune to issues such as reverse between machine reboot. | ||
|
||
Physical part is refreshed by repeatedly acquiring TSO in a period of `500ms`, to keep it being closed the real world time. And it can tolerate fault of TSO no longer than `30s`, to keep time-related metrics such as RPO reasonable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 30s
? And will raw write fails if TSO is offline over 30s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
En... I think this is not a serious decision. Seems that offline of TSO doesn't break the correctness, except some time-related metrics.
Maybe we can drop this restriction, and report failure metrics. And add some tests on this scenario.
Meanwhile, TiKV-CDC is highly dependent to PD. When PD is offline, the replication is not working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TiKV-CDC is highly dependent to PD. When PD is offline, the replication is not working.
Yes, but the write on the main TiKV may still work.
Maybe we can drop this restriction, and report failure metrics.
Agree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, PTAL~
Signed-off-by: Ping Yu <[email protected]> Co-authored-by: Andy Lok <[email protected]>
Signed-off-by: pingyu <[email protected]>
@zz-jason @andylokandy |
Besides, on startup, physical part must be initialized by a successful TSO. | ||
On startup, physical part must be initialized by a successful TSO, to ensure that HLC is larger than last running. RawKV write operations will fail if the initialization has not been completed. (TiKV also can't startup when PD is out of service). | ||
|
||
After that, HLC can tolerate fault of TSO. RawKV operations are normal, but time-related metrics such as RPO is not reasonable. Meanwhile, as TiKV-CDC being highly dependent to PD (for cluster and task management, checkpoint, etc.), the replication will not be working. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly is the "After that"? From the context, it is the TSO failure on startup but I think it should be TSO failure after startup. And I think the time-related metrics may not be affected unless the clock shift is crazy enough...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is "After successful initialization". I'm sorry for the not clear enough.
We do not advance HLC by local clock when TSO is offline, which will cause the time-related metrics problem.
Using local clock on fault of TSO need some other logics to ensure causality, such as saving maximum HLC on storage to avoid timestamp reverse between startup.
I do not include the "Falling Back to Local Clock" in this proposal for simplicity, as the PD offline is rare, and when it was offline it would cause more severe issues than the problem of time-related metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using local clock on fault of TSO need some other logics to ensure causality, such as saving maximum HLC on storage to avoid timestamp reverse between startup.
If I understand it right, the RegionMaxTs
and StoreMaxTs
should contain both physical time and logical time, thus we always know what is the latest physical time and therefore safe to use the local clock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is safe during running.
But unsafe will happen on restart. If the local clock shift backward between TiKV stop & restart, it's possible to get a smaller timestamp than last running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CockroachDB has a description on this restart issue here. We can refer to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is safe during running.
Yes, that's my point. As long as the TiKV doesn't restart, the metrics of RPO and write operation should not affect by PD failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be better if you can explain the reason with an example. For example:
In a 3 AZ deployment:
- PD leader is located in AZ1
- TiKV region leaders are uniformly distributed in AZ1 - AZ3
- There is a partial network partition that happens: AZ1 can not communicate with AZ2, but they can both communicate with AZ3, TiKV Client (for example, client-java) can connect to all the three AZs
Would the write requests to AZ2 TiKV Region be blocked because AZ2 can not sync physical timestamp with PD leader in AZ1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I improve the words here. PTAL~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rest LGTM
|
||
Timestamp is generated by TiKV internally, to get a better overall performance and client compatibility. | ||
|
||
We use [HLC](https://cse.buffalo.edu/tech-reports/2014-04.pdf) method to generate timestamp, which not only is easy to use (by being closed to real world time), but also can capture causality. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add move info about how many bits are physical and logical timestamp occupies in this implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Done, PTAL~
|
||
TSO is a global monotonically increasing timestamp, which help the generation of timestamp be independent to local clock of machine, and be immune to issues such as reverse between machine reboot. | ||
|
||
Physical part is refreshed by repeatedly acquiring TSO in a period of `500ms`, to keep it being closed the real world time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the period 500ms
configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I add more words for this, PTAL~
Signed-off-by: pingyu <[email protected]>
Signed-off-by: pingyu <[email protected]>
Signed-off-by: pingyu <[email protected]>
@zeminzhou @haojinming PTAL, thanks~ |
|
||
We will use [memory locks] to trace the "inflight" RawKV writes: acquire a lock before scheduling the writes, and release the lock after (asynchronously) send response to client. | ||
|
||
Then, CDC worker in TiKV is able to get the **resolved-ts** as `min(timestamp)-1` of all locks. If there is no lock, use current TSO as **resolved-ts**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As concurrency manager lock the ts before key write. The resolved-ts
use the lock ts is enough, no need -1
.
|
||
**3rd**, [TiKV API V2] does not guarantee that a **RawKV** entry with smaller timestamp must be observed first. So **TiKV-CDC** will sort entires by timestamp before send to downstream. | ||
|
||
But how **TiKV-CDC** knows it will not receive any even earlier entry from TiKV after the previous batch of data had been sent to downstream ? So we need **Resolved Timestamp** (abbr. **resolved-ts**). **Resolved-ts** is similar to [Watermark], by which TiKV indicating that all entries earlier have been observed. When **TiKV-CDC** receives a `resolved-ts`, all sorted entries with `timestamp <= resolved-ts` are safe to write to downstream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But how **TiKV-CDC** knows it will not receive any even earlier entry from TiKV after the previous batch of data had been sent to downstream ? So we need **Resolved Timestamp** (abbr. **resolved-ts**). **Resolved-ts** is similar to [Watermark], by which TiKV indicating that all entries earlier have been observed. When **TiKV-CDC** receives a `resolved-ts`, all sorted entries with `timestamp <= resolved-ts` are safe to write to downstream. | |
But how **TiKV-CDC** knows it will not receive any even earlier entry from TiKV after the previous batch of data had been sent to downstream ? So we need **Resolved Timestamp** (abbr. **resolved-ts**). **Resolved-ts** is similar to [Watermark], by which TiKV indicats that all entries earlier have been observed. When **TiKV-CDC** receives a `resolved-ts`, all sorted entries with `timestamp <= resolved-ts` are safe to write to downstream. |
Signed-off-by: pingyu <[email protected]> Co-authored-by: haojinming <[email protected]>
Signed-off-by: pingyu <[email protected]>
@haojinming All comments are addressed. PTAL~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Ping Yu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM~
Signed-off-by: Ping Yu <[email protected]>
Signed-off-by: Ping Yu <[email protected]>
/merge |
This proposal introduces the technical design of RawKV Change Data Capture.
(Rendered).
Related issue: tikv/tikv#11745