Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProxySQL incorrectly determining writer as reader from a restored AWS Aurora RDS Cluster #3962

Open
Levence opened this issue Sep 1, 2022 · 0 comments

Comments

@Levence
Copy link
Contributor

Levence commented Sep 1, 2022

ProxySQL version: 2.3.2-10-g8cd66cf

We're currently testing our AWS Aurora backup and restoration process, as part of this we have one writer and one reader in a cluster.

ProxySQL is provided with the writer and reader hostname on startup in two hostgroups (10=writer, 20=reader).

What we're seeing though is that on startup after the restore is that ProxySQL is moving the writer into the reader group immediately.

WARNING] AWS Aurora: setting host instance-one.cluster-two.ap-southeast-2.rds.amazonaws.com:3306 (part of cluster with writer_hostgroup=10) in a reader, moving from writer_hostgroup 10 to reader_hostgroup 20

mysql> select hostgroup_id, hostname, status from runtime_mysql_servers;
+--------------+---------------------------------------------------------------------------------------------------------------+--------+
| hostgroup_id | hostname                                                                                                      | status |
+--------------+---------------------------------------------------------------------------------------------------------------+--------+
| 20           | instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | ONLINE |
| 20           | instance-one.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | ONLINE |
+--------------+---------------------------------------------------------------------------------------------------------------+--------+
2 rows in set (0.03 sec)


We also noticed that in the mysql_server_aws_aurora_log table that the SESSION_ID for the master is listed as probably_former_MASTER_SESSION_ID

mysql> select hostname, error, session_id, server_id from mysql_server_aws_aurora_log order by success_time_us desc limit 10;
+---------------------------------------------------------------------------------------------------------------+-------+--------------------------------------+-----------------------------------------------------------------+
| hostname                                                                                                      | error | SESSION_ID                           | SERVER_ID                                                       |
+---------------------------------------------------------------------------------------------------------------+-------+--------------------------------------+-----------------------------------------------------------------+
| instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | probably_former_MASTER_SESSION_ID    | instance-one                                                    |
| instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | 2d9e2da9-220b-4ef7-a36e-6cc8a6fb090b | instance-two                                                    |
| instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | probably_former_MASTER_SESSION_ID    | instance-one                                                    |
| instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | 2d9e2da9-220b-4ef7-a36e-6cc8a6fb090b | instance-two                                                    |
| instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | probably_former_MASTER_SESSION_ID    | instance-one                                                    |
| instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | 2d9e2da9-220b-4ef7-a36e-6cc8a6fb090b | instance-two                                                    |
| instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | probably_former_MASTER_SESSION_ID    | instance-one                                                    |
| instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | 2d9e2da9-220b-4ef7-a36e-6cc8a6fb090b | instance-two                                                    |
| instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | probably_former_MASTER_SESSION_ID    | instance-one                                                    |
| instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | 2d9e2da9-220b-4ef7-a36e-6cc8a6fb090b | instance-two                                                    |
+---------------------------------------------------------------------------------------------------------------+-------+--------------------------------------+-----------------------------------------------------------------+
10 rows in set (0.32 sec)
​


After reading how the monitoring works in proxySQL we checked the INFORMATION_SCHEMA.REPLICA_HOST_STATUS and it lists both the old cluster instances and the new cluster instances, with two having MASTER_SESSION_ID set.

mysql> select server_id, session_id, LAST_UPDATE_TIMESTAMP from INFORMATION_SCHEMA.REPLICA_HOST_STATUS;
+-----------------------------------------------------------------+--------------------------------------+----------------------------+
| server_id                                                       | session_id                           | LAST_UPDATE_TIMESTAMP      |
+-----------------------------------------------------------------+--------------------------------------+----------------------------+
| previous-cluster-instance-one                                   | MASTER_SESSION_ID                    | 2022-08-31 02:05:43.000000 |
| previous-cluster-instance-two                                   | 43cf0527-e07a-441c-b8c2-c746a47e3b14 | 2022-08-31 02:05:43.000000 |
| instance-one                                                    | MASTER_SESSION_ID                    | 2022-08-31 11:27:12.032852 |
| instance-two                                                    | 2d9e2da9-220b-4ef7-a36e-6cc8a6fb090b | 2022-08-31 11:27:11.340061 |
+-----------------------------------------------------------------+--------------------------------------+----------------------------+
4 rows in set (0.03 sec)


I can see in the code that there's something to handle this situation:

proxysql/lib/MySQL_Monitor.cpp

Lines 4467 to 4471 in 301bb09

"IF("
"SESSION_ID = 'MASTER_SESSION_ID' AND "
"SERVER_ID <> (SELECT SERVER_ID FROM INFORMATION_SCHEMA.REPLICA_HOST_STATUS WHERE SESSION_ID = 'MASTER_SESSION_ID' ORDER BY LAST_UPDATE_TIMESTAMP ASC LIMIT 1), "
"'probably_former_MASTER_SESSION_ID', SESSION_ID"
") SESSION_ID, " // it seems that during a failover, the old writer can keep MASTER_SESSION_ID because not updated


However we're getting the opposite result to what you'd expect, our new writer is being labelled with the session probably_former_MASTER_SESSION_ID which I think means that the writer string check is then failing:

proxysql/lib/MySQL_Monitor.cpp

Lines 4949 to 4951 in 301bb09

if (strcmp(hse->session_id,"MASTER_SESSION_ID")==0) {
is_writer = true;
}


After reading that SQL query used to determine the session ID we think that actually conditional check is backwards, we're getting the probably_former_MASTER_SESSION_ID value when the SERVER_ID doesn't match the oldest (which should be the old serer) instead we should return that when the SERVER_ID is anything but the newest (this if course assumes only one writer, but the previous code assumes only a single previous writer).

Based on the description of b8c027e to resolve #3484 we are seeing the opposite of "is not considered a writer if it is not the last updated", where the earliest updated is always considered the writer, as it seems like the LAST_UPDATE_TIMESTAMP ASC should be a DESC.

SELECT
  SERVER_ID
FROM
  INFORMATION_SCHEMA.REPLICA_HOST_STATUS
WHERE
  SESSION_ID = 'MASTER_SESSION_ID'
ORDER BY
  LAST_UPDATE_TIMESTAMP ASC
LIMIT
  1

I've tested this change from ASC to DESC against the same aurora instance and everything works as expected now.

mysql> select hostname, error, session_id, server_id from mysql_server_aws_aurora_log order by success_time_us desc limit 10;
+---------------------------------------------------------------------------------------------------------------+-------+--------------------------------------+-----------------------------------------------------------------+
| hostname                                                                                                      | error | SESSION_ID                           | SERVER_ID                                                       |
+---------------------------------------------------------------------------------------------------------------+-------+--------------------------------------+-----------------------------------------------------------------+
| instance-one.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | MASTER_SESSION_ID                    | instance-one                                                    |
| instance-one.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | 2d9e2da9-220b-4ef7-a36e-6cc8a6fb090b | instance-two                                                    |
| instance-one.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | MASTER_SESSION_ID                    | instance-one                                                    |
| instance-one.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | 2d9e2da9-220b-4ef7-a36e-6cc8a6fb090b | instance-two                                                    |
| instance-one.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | MASTER_SESSION_ID                    | instance-one                                                    |
| instance-one.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | 2d9e2da9-220b-4ef7-a36e-6cc8a6fb090b | instance-two                                                    |
| instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | MASTER_SESSION_ID                    | instance-one                                                    |
| instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | 2d9e2da9-220b-4ef7-a36e-6cc8a6fb090b | instance-two                                                    |
| instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | MASTER_SESSION_ID                    | instance-one                                                    |
| instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | 2d9e2da9-220b-4ef7-a36e-6cc8a6fb090b | instance-two                                                    |
+---------------------------------------------------------------------------------------------------------------+-------+--------------------------------------+-----------------------------------------------------------------+
10 rows in set (0.03 sec)

mysql> select hostgroup_id, hostname, status from runtime_mysql_servers;
+--------------+---------------------------------------------------------------------------------------------------------------+--------+
| hostgroup_id | hostname                                                                                                      | status |
+--------------+---------------------------------------------------------------------------------------------------------------+--------+
| 10           | instance-one.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | ONLINE |
| 20           | instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | ONLINE |
+--------------+---------------------------------------------------------------------------------------------------------------+--------+
2 rows in set (0.03 sec)

This is replicatable by performing the following steps:

  • Launch an AWS Aurora RDS MySQL Cluster with Engine Version 2.07.7 (LTS)
  • Create a snapshot of the AWS Aurora RDS MySQL Cluster
  • Launch a new AWS Aurora RDS MySQL Cluster based on the snapshot

After feedback from AWS Support they've confirmed this is a known issue within their internal team and has been addressed in 2.09.1. Prior to the fixed version both information_schema.replica_host_status and mysql.ro_replica_status tables are based off of persisted data and are included in the cluster snapshot.

logs-fixed.log
logs-original.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant