Skip to content
This repository has been archived by the owner on Dec 7, 2018. It is now read-only.

support binlog + rocksdb group commit #23

Open
mdcallag opened this issue Dec 2, 2014 · 0 comments
Open

support binlog + rocksdb group commit #23

mdcallag opened this issue Dec 2, 2014 · 0 comments

Comments

@mdcallag
Copy link

mdcallag commented Dec 2, 2014

This requires more discussion but many of us are in favor of it.


It might be time to use the binlog as the source of truth to avoid the complexity and inefficiency of keeping RocksDB and the binlog synchronized via internal XA. There are two modes for this. The first mode is durable in which case fsync is done after writing the binlog and RocksDB WAL. The other mode is non-durable in which case fsync might only be done once per second and we rely on lossless semisync to recover. Binlog as source of truth might have been discussed on a MariaDB mail list many years ago - https://lists.launchpad.net/maria-developers/msg01998.html

Some details are at http://yoshinorimatsunobu.blogspot.com/2014/04/semi-synchronous-replication-at-facebook.html

The new protocol will be:

  1. write binlog
  2. optionally sync binlog
  3. optionally wait for semisync ack
  4. commit rocksdb - this also persists the GTID within RocksDB for the most recent commit, this also makes changes from the transaction visible to others
  5. optionally sync rocksdb WAL

When lossless semisync is used we skip steps 2 and 4. When lossless semisync is not used we do step 2 and skip 3. Step 4 is optional. Recovery in this case is done by:

  1. query RocksDB to determine GTID of last commit it has
  2. extract/replay transactions from binlog >= GTID from previous step

When running in non durable mode, then on a crash one of the following is true where the relation describes which one has more commits:

  1. rocksdb > binlog
  2. binlog > rocksdb
  3. rocksdb == binlog
    If you know which state the server is in, then you can reach state 3. If in state 1 then append events to the binlog without running them on innodb. If in state 2 then replay events to innodb without recording to binlog. If in state 3 then do nothing. Both RocksDB and the binlog can tell us the last GTID they contain and we can compare that with the binlog archived via lossless semisync to determine the state.
@maykov maykov assigned maykov and unassigned maykov Dec 8, 2014
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

No branches or pull requests

2 participants