ProxySQL incorrectly determining writer as reader from a restored AWS Aurora RDS Cluster #3962

Levence · 2022-09-01T00:10:33Z

ProxySQL version: 2.3.2-10-g8cd66cf

We're currently testing our AWS Aurora backup and restoration process, as part of this we have one writer and one reader in a cluster.

ProxySQL is provided with the writer and reader hostname on startup in two hostgroups (10=writer, 20=reader).

What we're seeing though is that on startup after the restore is that ProxySQL is moving the writer into the reader group immediately.

WARNING] AWS Aurora: setting host instance-one.cluster-two.ap-southeast-2.rds.amazonaws.com:3306 (part of cluster with writer_hostgroup=10) in a reader, moving from writer_hostgroup 10 to reader_hostgroup 20

mysql> select hostgroup_id, hostname, status from runtime_mysql_servers;
+--------------+---------------------------------------------------------------------------------------------------------------+--------+
| hostgroup_id | hostname                                                                                                      | status |
+--------------+---------------------------------------------------------------------------------------------------------------+--------+
| 20           | instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | ONLINE |
| 20           | instance-one.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | ONLINE |
+--------------+---------------------------------------------------------------------------------------------------------------+--------+
2 rows in set (0.03 sec)

We also noticed that in the mysql_server_aws_aurora_log table that the SESSION_ID for the master is listed as probably_former_MASTER_SESSION_ID

mysql> select hostname, error, session_id, server_id from mysql_server_aws_aurora_log order by success_time_us desc limit 10;
+---------------------------------------------------------------------------------------------------------------+-------+--------------------------------------+-----------------------------------------------------------------+
| hostname                                                                                                      | error | SESSION_ID                           | SERVER_ID                                                       |
+---------------------------------------------------------------------------------------------------------------+-------+--------------------------------------+-----------------------------------------------------------------+
| instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | probably_former_MASTER_SESSION_ID    | instance-one                                                    |
| instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | 2d9e2da9-220b-4ef7-a36e-6cc8a6fb090b | instance-two                                                    |
| instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | probably_former_MASTER_SESSION_ID    | instance-one                                                    |
| instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | 2d9e2da9-220b-4ef7-a36e-6cc8a6fb090b | instance-two                                                    |
| instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | probably_former_MASTER_SESSION_ID    | instance-one                                                    |
| instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | 2d9e2da9-220b-4ef7-a36e-6cc8a6fb090b | instance-two                                                    |
| instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | probably_former_MASTER_SESSION_ID    | instance-one                                                    |
| instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | 2d9e2da9-220b-4ef7-a36e-6cc8a6fb090b | instance-two                                                    |
| instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | probably_former_MASTER_SESSION_ID    | instance-one                                                    |
| instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | 2d9e2da9-220b-4ef7-a36e-6cc8a6fb090b | instance-two                                                    |
+---------------------------------------------------------------------------------------------------------------+-------+--------------------------------------+-----------------------------------------------------------------+
10 rows in set (0.32 sec)

After reading how the monitoring works in proxySQL we checked the INFORMATION_SCHEMA.REPLICA_HOST_STATUS and it lists both the old cluster instances and the new cluster instances, with two having MASTER_SESSION_ID set.

mysql> select server_id, session_id, LAST_UPDATE_TIMESTAMP from INFORMATION_SCHEMA.REPLICA_HOST_STATUS;
+-----------------------------------------------------------------+--------------------------------------+----------------------------+
| server_id                                                       | session_id                           | LAST_UPDATE_TIMESTAMP      |
+-----------------------------------------------------------------+--------------------------------------+----------------------------+
| previous-cluster-instance-one                                   | MASTER_SESSION_ID                    | 2022-08-31 02:05:43.000000 |
| previous-cluster-instance-two                                   | 43cf0527-e07a-441c-b8c2-c746a47e3b14 | 2022-08-31 02:05:43.000000 |
| instance-one                                                    | MASTER_SESSION_ID                    | 2022-08-31 11:27:12.032852 |
| instance-two                                                    | 2d9e2da9-220b-4ef7-a36e-6cc8a6fb090b | 2022-08-31 11:27:11.340061 |
+-----------------------------------------------------------------+--------------------------------------+----------------------------+
4 rows in set (0.03 sec)

I can see in the code that there's something to handle this situation:

proxysql/lib/MySQL_Monitor.cpp

Lines 4467 to 4471 in 301bb09

    
           "IF(" 
        
           	"SESSION_ID = 'MASTER_SESSION_ID' AND " 
        
           	"SERVER_ID <> (SELECT SERVER_ID FROM INFORMATION_SCHEMA.REPLICA_HOST_STATUS WHERE SESSION_ID = 'MASTER_SESSION_ID' ORDER BY LAST_UPDATE_TIMESTAMP ASC LIMIT 1), " 
        
           	"'probably_former_MASTER_SESSION_ID', SESSION_ID" 
        
           ") SESSION_ID, " // it seems that during a failover, the old writer can keep MASTER_SESSION_ID because not updated

However we're getting the opposite result to what you'd expect, our new writer is being labelled with the session probably_former_MASTER_SESSION_ID which I think means that the writer string check is then failing:

proxysql/lib/MySQL_Monitor.cpp

Lines 4949 to 4951 in 301bb09

    
           if (strcmp(hse->session_id,"MASTER_SESSION_ID")==0) { 
        
           	is_writer = true; 
        
           }

After reading that SQL query used to determine the session ID we think that actually conditional check is backwards, we're getting the probably_former_MASTER_SESSION_ID value when the SERVER_ID doesn't match the oldest (which should be the old serer) instead we should return that when the SERVER_ID is anything but the newest (this if course assumes only one writer, but the previous code assumes only a single previous writer).

Based on the description of b8c027e to resolve #3484 we are seeing the opposite of "is not considered a writer if it is not the last updated", where the earliest updated is always considered the writer, as it seems like the LAST_UPDATE_TIMESTAMP ASC should be a DESC.

SELECT
  SERVER_ID
FROM
  INFORMATION_SCHEMA.REPLICA_HOST_STATUS
WHERE
  SESSION_ID = 'MASTER_SESSION_ID'
ORDER BY
  LAST_UPDATE_TIMESTAMP ASC
LIMIT
  1

I've tested this change from ASC to DESC against the same aurora instance and everything works as expected now.

mysql> select hostname, error, session_id, server_id from mysql_server_aws_aurora_log order by success_time_us desc limit 10;
+---------------------------------------------------------------------------------------------------------------+-------+--------------------------------------+-----------------------------------------------------------------+
| hostname                                                                                                      | error | SESSION_ID                           | SERVER_ID                                                       |
+---------------------------------------------------------------------------------------------------------------+-------+--------------------------------------+-----------------------------------------------------------------+
| instance-one.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | MASTER_SESSION_ID                    | instance-one                                                    |
| instance-one.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | 2d9e2da9-220b-4ef7-a36e-6cc8a6fb090b | instance-two                                                    |
| instance-one.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | MASTER_SESSION_ID                    | instance-one                                                    |
| instance-one.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | 2d9e2da9-220b-4ef7-a36e-6cc8a6fb090b | instance-two                                                    |
| instance-one.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | MASTER_SESSION_ID                    | instance-one                                                    |
| instance-one.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | 2d9e2da9-220b-4ef7-a36e-6cc8a6fb090b | instance-two                                                    |
| instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | MASTER_SESSION_ID                    | instance-one                                                    |
| instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | 2d9e2da9-220b-4ef7-a36e-6cc8a6fb090b | instance-two                                                    |
| instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | MASTER_SESSION_ID                    | instance-one                                                    |
| instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | NULL  | 2d9e2da9-220b-4ef7-a36e-6cc8a6fb090b | instance-two                                                    |
+---------------------------------------------------------------------------------------------------------------+-------+--------------------------------------+-----------------------------------------------------------------+
10 rows in set (0.03 sec)

mysql> select hostgroup_id, hostname, status from runtime_mysql_servers;
+--------------+---------------------------------------------------------------------------------------------------------------+--------+
| hostgroup_id | hostname                                                                                                      | status |
+--------------+---------------------------------------------------------------------------------------------------------------+--------+
| 10           | instance-one.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | ONLINE |
| 20           | instance-two.cluster-two.ap-southeast-2.rds.amazonaws.com                                                     | ONLINE |
+--------------+---------------------------------------------------------------------------------------------------------------+--------+
2 rows in set (0.03 sec)

This is replicatable by performing the following steps:

Launch an AWS Aurora RDS MySQL Cluster with Engine Version 2.07.7 (LTS)
Create a snapshot of the AWS Aurora RDS MySQL Cluster
Launch a new AWS Aurora RDS MySQL Cluster based on the snapshot

After feedback from AWS Support they've confirmed this is a known issue within their internal team and has been addressed in 2.09.1. Prior to the fixed version both information_schema.replica_host_status and mysql.ro_replica_status tables are based off of persisted data and are included in the cluster snapshot.

logs-fixed.log
logs-original.log

The text was updated successfully, but these errors were encountered:

… for sysown#3962

Levence added a commit to Levence/proxysql that referenced this issue Sep 1, 2022

fix(issue): ensure that most recent replica_host is treated as master…

346cb4b

… for sysown#3962

This was referenced Sep 1, 2022

fix(issue): ensure that most recent replica_host is treated as master #3963

Closed

fix(issue): ensure that most recent replica_host is treated as master Levence/proxysql#1

Merged

Levence added a commit to Levence/proxysql that referenced this issue Sep 1, 2022

fix(issue): ensure that most recent replica_host is treated as master…

0ab0757

… for sysown#3962

Levence mentioned this issue Sep 1, 2022

fix(issue): ensure that most recent replica_host is treated as master #3964

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ProxySQL incorrectly determining writer as reader from a restored AWS Aurora RDS Cluster #3962

ProxySQL incorrectly determining writer as reader from a restored AWS Aurora RDS Cluster #3962

Levence commented Sep 1, 2022

ProxySQL incorrectly determining writer as reader from a restored AWS Aurora RDS Cluster #3962

ProxySQL incorrectly determining writer as reader from a restored AWS Aurora RDS Cluster #3962

Comments

Levence commented Sep 1, 2022