Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes current handling for replication lag in favor of setting lagging servers to SHUNNED state #3533

Merged
merged 13 commits into from
Sep 14, 2021

Conversation

JavierJF
Copy link
Collaborator

@JavierJF JavierJF commented Jul 22, 2021

This pull request introduces several changes to how lag in 'Group Replication'
is handled.

Old behavior

Servers which lag is above the threshold determined by
'mysql-groupreplication_max_transactions_behind_count' and had read_only=1
were set 'OFFLINE' until replication catch up.

New behavior

Servers which lag is above the threshold determined by
'mysql-groupreplication_max_transactions_behind_count' are 'SHUNNED' depending
on the value of the new introduced variable:
'mysql-monitor_groupreplication_max_transaction_behind_for_read_only'.

This variable has three possible values:

  • '0': Only servers with read_only=0 are placed as 'SHUNNED'.
  • '1': Only servers with read_only=1 are placed as 'SHUNNED' (default).
  • '2': Both servers with read_only=1 and read_only=0 are placed as 'SHUNNED'.

In addition to this behavior regarding to actions when 'groupreplication_max_transactions_behind_count'
is exceeded by a server. Now it's also possible to set severs configured as writers
in 'OFFLINE_SOFT' state, while preserving the server in the 'writer_hostgroup'.

For achieve this behavior, simply set a server which is configured as a 'writer'
i.e. the server 'hostgroup_id' is the 'writer_hostgroup', and set it's state to
be 'OFFLINE_SOFT', after this, issue a 'LOAD MYSQL SERVERS TO RUNTIME'. The
server should be preserved in the writer hostgroup but it's status should change
to 'OFFLINE_SOFT'.

Situation description

We have 3 servers, '2' writers and '1' reader for a MySQL Group Replication
Cluster of 3 nodes, the servers are configure in ProxySQL in the following way:

mysql> select * from mysql_servers;
+--------------+-----------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| hostgroup_id | hostname  | port | gtid_port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
+--------------+-----------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| 3272         | 127.2.1.1 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
| 3272         | 127.2.1.2 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
| 3273         | 127.2.1.3 | 3306 | 0         | ONLINE | 1      | 0           | 1000            | 0                   | 0       | 0              |         |
+--------------+-----------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
3 rows in set (0.00 sec)

Resulting in the following cluster state in 'runtime_mysql_servers' table in
ProxySQL:

mysql: [Warning] Using a password on the command line interface can be insecure.
+--------------+-----------+------+-----------+--------+-----------------+---------+
| hostgroup_id | hostname  | port | gtid_port | status | max_connections | comment |
+--------------+-----------+------+-----------+--------+-----------------+---------+
| 3272         | 127.2.1.1 | 3306 | 0         | ONLINE | 1000            |         |
| 3273         | 127.2.1.2 | 3306 | 0         | ONLINE | 1000            |         |
| 3273         | 127.2.1.1 | 3306 | 0         | ONLINE | 1000            |         |
| 3273         | 127.2.1.3 | 3306 | 0         | ONLINE | 1000            |         |
| 3272         | 127.2.1.2 | 3306 | 0         | ONLINE | 1000            |         |
+--------------+-----------+------+-----------+--------+-----------------+---------+

Now we want to set the writer '127.2.1.2' to OFFLINE_SOFT, so we simply set it
via ProxySQL Admin:

UPDATE mysql_servers SET status='OFFLINE_SOFT' WHERE hostname='127.2.1.2';

And we load mysql_servers to runtime:

LOAD MYSQL SERVERS TO RUNTIME

The runtime_mysql_servers table should transition to the following state:

mysql: [Warning] Using a password on the command line interface can be insecure.
+--------------+-----------+------+-----------+--------------+-----------------+---------+
| hostgroup_id | hostname  | port | gtid_port | status       | max_connections | comment |
+--------------+-----------+------+-----------+--------------+-----------------+---------+
| 3272         | 127.2.1.1 | 3306 | 0         | ONLINE       | 1000            |         |
| 3273         | 127.2.1.2 | 3306 | 0         | OFFLINE_SOFT | 1000            |         |
| 3273         | 127.2.1.1 | 3306 | 0         | ONLINE       | 1000            |         |
| 3273         | 127.2.1.3 | 3306 | 0         | ONLINE       | 1000            |         |
| 3272         | 127.2.1.2 | 3306 | 0         | OFFLINE_SOFT | 1000            |         |
+--------------+-----------+------+-----------+--------------+-----------------+---------+

This change is performed, without afecting to any current transactions
behind executed in the server that have placed as 'OFFLINE_SOFT'. For making the
server operational again, it's required just to set it again to 'ONLINE' state.

JavierJF added 7 commits July 22, 2021 11:30
1. Introduced new global variable: 'monitor_groupreplication_max_transaction_behind_for_read_only',
   that modifies the behavior of 'group_replication_lag'.
2. Improved logic making use of 'MyHGC_find' instead of directly
   searching 'MyHostGroups' structure.
3. Improved 'group_replication_lag' documentation with new
   implementation updates.
4. Introduced changes to 'update_group_replication_set_writer'
   preserving writters placed in 'OFFLINE_SOFT' state.
@JavierJF
Copy link
Collaborator Author

Retest this please.

@renecannao
Copy link
Contributor

We need to merge this.
@JavierJF : can you please document it?

@bskllzh
Copy link
Contributor

bskllzh commented Aug 19, 2021

@JavierJF @renecannao , I think there are not many scenarios for mgr with multiple master. For single master mode, there are more scenarios for using single master mode. And when the slave server is in a state where the delay exceeds the threshold, proxysql will immediately offline the slave server. I think this is inappropriate, because it will interrupt the business and cause the program to report an error. I submitted a fix PR, set it to OFFLINE_SOFT , and softly released the delay slave server. Please review PR: #3473.

@renecannao
Copy link
Contributor

Hi @bskllzh . Thank you for your feedback.
I think I get your point, and I absolutely agree with the problem you are pointing at.
Although, I think OFFLINE_SOFT is not the right approach.
Let me explain.

OFFLINE_SOFT is a configuration state, and a server in this state is configured to not be used for new connections, but not only...
SHUNNED is instead a temporary status from which the server should automatically recover.
In other words, a server shouldn't automatically go from OFFLINE_SOFT to ONLINE , but should automatically go from SHUNNED to ONLINE.

In fact, PR #3473 would conflict with what said previously: a server in OFFLINE_SOFT should never be returned to ONLINE automatically (and this is now implemented in PR #3533).

And when the slave server is in a state where the delay exceeds the threshold, proxysql will immediately offline the slave server. I think this is inappropriate, because it will interrupt the business and cause the program to report an error.

This is by design.
All the details are here: #774
We could set status to MYSQL_SERVER_STATUS_SHUNNED instead of MYSQL_SERVER_STATUS_SHUNNED_REPLICATION_LAG .
This will solve the issue of connections being closed immediately, but hostgroup manager automatically tries to bring a server back online from shunned, no matter if replication lag is still present or not: this is why MYSQL_SERVER_STATUS_SHUNNED_REPLICATION_LAG exists and is different than state MYSQL_SERVER_STATUS_SHUNNED .

Thinking about a possible solution, we could implement a mechanism in which a node is first configured as MYSQL_SERVER_STATUS_SHUNNED and then MYSQL_SERVER_STATUS_SHUNNED_REPLICATION_LAG if replication lag doesn't recover quickly.
The state MYSQL_SERVER_STATUS_SHUNNED_REPLICATION_LAG should be set within a short period of time after MYSQL_SERVER_STATUS_SHUNNED because otherwise the node could go ONLINE while it shouldn't.

@bskllzh
Copy link
Contributor

bskllzh commented Aug 25, 2021

@renecannao, PR #3473 , i added a mgr_replication_lag_status(MGR replication lag flag, true lag, false not lag) parameter to distinguish whether the server was manually configured to the configuration state, or the state changed to OFFLINE_SOFT due to the delay of the mysql slave.

When shunning a node due to replication lag in a group replication cluster,
we first shun the node as MYSQL_SERVER_STATUS_SHUNNED , then we shun it
as MYSQL_SERVER_STATUS_SHUNNED_REPLICATION_LAG .

In this way we prevent (for a short time) to kill connections on that backend.
This backing off from that server can give the server enough time to sync up.

See discussion in comments in #3533
@renecannao
Copy link
Contributor

@bskllzh , thank you for pointing out the new flag.
I implemented what I suggested in my previous comment. See dd71fcd
What is your feedback on that?

About your comment:

I think there are not many scenarios for mgr with multiple master. For single master mode, there are more scenarios for using single master mode

Please note that the enhancements in this PR are driven from the needs of a customer, that requires multi-writers, disable a node no matter if writer of reader (this is why we added a new variable to control this behavior), the ability to prevent configured OFFLINE_SOFT, and to not interfere with the status of the same server in an hostgroup not part of the cluster.
This PR is a combination of enhancements, bugs fixes, and new features.

@bskllzh
Copy link
Contributor

bskllzh commented Aug 26, 2021

@renecannao , PR dd71fcd, I think
the code is too complicated and this is to complicate simple things. Because for the program, do not interrupt its connection due to the delay of the mysql slave until it catches up. And It may take a long time for the slave to catch up with the master, not for a while. For example, when it comes to big transaction.

@JavierJF JavierJF marked this pull request as ready for review September 13, 2021 11:22
@JavierJF JavierJF merged commit 4f94fd3 into v2.x Sep 14, 2021
@renecannao renecannao deleted the v2.x-gr_replication_lag_action branch April 30, 2022 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants