server_uuid 每次Pod 重启的时候都会修改导致出现大量的Executed_GTID_Set #335

ninehills · 2021-12-17T06:35:27Z

目前Dockerfile中每次mysql启动都会重新生成随机的uuid：

uuid=$(cat /proc/sys/kernel/random/uuid)
printf '[auto]\nserver_uuid=%s' $uuid > /var/lib/mysql/auto.cnf

这样Pod重启的时候，就出现了大量的Executed_GTID_Set，3个节点每重启一次就有3个新的uuid，主从切换三次就是3个Set。

这个地方使用固定的uuid（比如和server序号固定）的问题是什么？

The text was updated successfully, but these errors were encountered:

github-actions · 2021-12-17T06:36:08Z

Hi! thanks for your contribution! great first issue!

runkecheng · 2021-12-17T07:03:48Z

你好，这部分代码是helm版本的，目前已经停止维护了；我们在operator版本中server-id是与序号绑定的，推荐你使用operator版本，安装文档。

ninehills · 2021-12-17T07:35:53Z

你好，这部分代码是helm版本的，目前已经停止维护了；我们在operator版本中server-id是与序号绑定的，推荐你使用operator版本，安装文档。

多谢，请问Operator版本中的MySQL dockerfile在哪里，因为需要做一些arm的移植

runkecheng · 2021-12-17T07:42:55Z

你好，这部分代码是helm版本的，目前已经停止维护了；我们在operator版本中server-id是与序号绑定的，推荐你使用operator版本，安装文档。

多谢，请问Operator版本中的MySQL dockerfile在哪里，因为需要做一些arm的移植

仓库根目录下，Dockerfile(operator)和Dockerfile.sidecar(sidecar),以及hack/xenon/Dockerfile(xenon,现镜像分支应该为master，还没有pr修改)

ninehills · 2021-12-22T04:02:42Z

你好，这部分代码是helm版本的，目前已经停止维护了；我们在operator版本中server-id是与序号绑定的，推荐你使用operator版本，安装文档。

server-id helm 版本也是和序号绑定的，影响gtid set的是 /var/lib/mysql/auto.cnf 中的server_uuid，我仔细看了下。是因为operator版本使用了percona官方镜像，在镜像的启动脚本里没有删除/var/lib/mysql/auto.cnf的逻辑，所以server_uuid 就固定了。而helm版本用的镜像里面有启动时删除auto.cnf的逻辑。

server_uuid固定的好处，是gtid set 的数量可控，只要数据不清空，那么最多就是副本数量。

但是缺点就是当整个集群持续写入场景下，当Leader Pod宕机后，重启后再加入集群会有很大的概率Raft的状态是INVALID（主节点的gtid 序号高于其他节点，而如果server_uuid不固定，就不会进入这个状态）

可以通过3节点集群并使用sysbench大流量持续写入的情况下，删除leader pod来复现。基本上100%复现

ninehills · 2021-12-22T04:07:56Z

根据MySQL官方的解释：
https://bugs.mysql.com/bug.php?id=99370

当主节点crash后，哪怕开启半同步，也需要清空数据加入集群。虽然Operator版本中增加了hack方法可以清空数据，但是依赖人工操作带来操作成本。

官方提供的替代方案是使用 Group Replication

runkecheng · 2021-12-22T06:25:38Z

你好，感谢你的关注。删除auto.cnf后无法保证数据的一致性；你说的异常切换的场景，出现invalid是由于旧主产生了本地事务，是符合MySQL半同步预期的，出现这种情况需要人工判断数据是否需要保留。

ninehills · 2021-12-22T07:29:50Z

你好，感谢你的关注。删除auto.cnf后无法保证数据的一致性；你说的异常切换的场景，出现invalid是由于旧主产生了本地事务，是符合MySQL半同步预期的，出现这种情况需要人工判断数据是否需要保留。

这边有考虑使用 Group Replication 实现么，是不是需要xenon来支持

ninehills · 2021-12-22T07:35:13Z

你好，感谢你的关注。删除auto.cnf后无法保证数据的一致性；你说的异常切换的场景，出现invalid是由于旧主产生了本地事务，是符合MySQL半同步预期的，出现这种情况需要人工判断数据是否需要保留。

大部分情况下，是否考虑自动修复INVALID的Pod，可以作为一个配置项（Auto Repair Invalid Pod）。当配置项打开的时候，如果出现INVALID的Pod，自动清空数据，然后重启Pod，触发rebuild。

andyli029 · 2021-12-22T08:07:14Z

你好，感谢你的关注。删除auto.cnf后无法保证数据的一致性；你说的异常切换的场景，出现invalid是由于旧主产生了本地事务，是符合MySQL半同步预期的，出现这种情况需要人工判断数据是否需要保留。

这边有考虑使用 Group Replication 实现么，是不是需要xenon来支持

Xenon里面集成MGR，没啥必要，因为MGR自身有基础的高可用。但MGR自身还有挺多问题，而且异常情况人工处理更为复杂麻烦

ninehills added the question Further information is requested label Dec 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server_uuid 每次Pod 重启的时候都会修改导致出现大量的Executed_GTID_Set #335

server_uuid 每次Pod 重启的时候都会修改导致出现大量的Executed_GTID_Set #335

ninehills commented Dec 17, 2021

github-actions bot commented Dec 17, 2021

runkecheng commented Dec 17, 2021

ninehills commented Dec 17, 2021

runkecheng commented Dec 17, 2021

ninehills commented Dec 22, 2021

ninehills commented Dec 22, 2021 •

edited

Loading

runkecheng commented Dec 22, 2021

ninehills commented Dec 22, 2021

ninehills commented Dec 22, 2021

andyli029 commented Dec 22, 2021 •

edited

Loading

server_uuid 每次Pod 重启的时候都会修改导致出现大量的Executed_GTID_Set #335

server_uuid 每次Pod 重启的时候都会修改导致出现大量的Executed_GTID_Set #335

Comments

ninehills commented Dec 17, 2021

github-actions bot commented Dec 17, 2021

runkecheng commented Dec 17, 2021

ninehills commented Dec 17, 2021

runkecheng commented Dec 17, 2021

ninehills commented Dec 22, 2021

ninehills commented Dec 22, 2021 • edited Loading

runkecheng commented Dec 22, 2021

ninehills commented Dec 22, 2021

ninehills commented Dec 22, 2021

andyli029 commented Dec 22, 2021 • edited Loading

ninehills commented Dec 22, 2021 •

edited

Loading

andyli029 commented Dec 22, 2021 •

edited

Loading