Add more test cases about replication and cluster #2671

PragmaTwice · 2024-11-16T15:22:24Z

          By the way, can we add a test with master removed and about 5s' connection timeout in go integration?

Originally posted by @mapleFU in #2662 (comment)

The text was updated successfully, but these errors were encountered:

LindaSummer · 2024-11-22T02:31:01Z

Hi @PragmaTwice ,

I want to follow up on this issue and learn about our replication logic.

Could this issue be assigned to me? 😊

Best Regards,
Edward

PragmaTwice · 2024-11-22T02:51:38Z

Sure! Assigned.

LindaSummer · 2024-12-08T14:13:32Z

Hi @sryanyuan ,

Sorry to bother you in this thread.

I'm trying to create a case from your description for integration test.

If the master is lost, the replication thread will block until the keepalive timer is reached when receiving full-sync SST files. At the same time, if we execute a 'clusterx setnodes' command, it will hold an exclusive lock until the replication thread is stopped. This will cause all other worker threads to block.

But I'm not sure about the configuration of the master and slave.

Could you give me some advice on reproducing the problem? 😊

Best Regards,
Edward

PragmaTwice · 2024-12-08T14:55:39Z

cc @mapleFU could you take a look?

git-hulk · 2024-12-09T02:23:40Z

The command cluster setnodes needs to stop the existing replication thread first before switching to the new master. So it will stop the replication thread when adding the new master, but found the replication thread is hanging on fetching SST files for the sake of the connect/read timeout isn't set and the master is lost.

It will wait for about 3s after the PR #2662.

LindaSummer · 2024-12-09T02:47:55Z

The command cluster setnodes needs to stop the existing replication thread first before switching to the new master. So it will stop the replication thread when adding the new master, but found the replication thread is hanging on fetching SST files for the sake of the connect/read timeout isn't set and the master is lost.

It will wait for about 3s after the PR #2662.

Hi @git-hulk ,

Thanks for your detailed analysis! I will try this way today. 😊

By the way, do we have a way to confirm the SST file fetching status or force triggering the SST file syncing in cluster mode? 😊

Best Regards,
Edward

git-hulk · 2024-12-09T03:10:54Z

@LindaSummer Thanks for your efforts. For this case, I think it should be good to reproduce with an unreachable master IP only. Before PR #2662, it's expected to be hanging while switching to a new master. And after #662, it should be back to normal after a few seconds(3-4). cc @PragmaTwice @sryanyuan

sryanyuan · 2024-12-09T03:21:11Z

Hi @sryanyuan ,

Sorry to bother you in this thread.

I'm trying to create a case from your description for integration test.

If the master is lost, the replication thread will block until the keepalive timer is reached when receiving full-sync SST files. At the same time, if we execute a 'clusterx setnodes' command, it will hold an exclusive lock until the replication thread is stopped. This will cause all other worker threads to block.

But I'm not sure about the configuration of the master and slave.

Could you give me some advice on reproducing the problem? 😊

Best Regards, Edward

Write some data into master
Add a slave

Before the full sync is completed, execute the following commands on the master to simulate network problems:

service network stop
echo `date "+%y-%m-%d %H:%M:%S"`---service network stop
sleep 120s
service network start
echo `date "+%y-%m-%d %H:%M:%S"`---service network start

LindaSummer · 2024-12-09T04:12:45Z

Hi @sryanyuan ,

Sorry to bother you in this thread.

I'm trying to create a case from your description for integration test.

If the master is lost, the replication thread will block until the keepalive timer is reached when receiving full-sync SST files. At the same time, if we execute a 'clusterx setnodes' command, it will hold an exclusive lock until the replication thread is stopped. This will cause all other worker threads to block.

But I'm not sure about the configuration of the master and slave.

Could you give me some advice on reproducing the problem? 😊

Best Regards, Edward
Write some data into master

Add a slave
Before the full sync is completed, execute the following commands on the master to simulate network problems:
service network stop

echo `date "+%y-%m-%d %H:%M:%S"`---service network stop

sleep 120s

service network start

echo `date "+%y-%m-%d %H:%M:%S"`---service network start

Hi @sryanyuan ,

Thanks very much!

Got it! I will try it later today.😊

Best Regards,
Edward

sryanyuan · 2024-12-09T06:05:42Z

Hi @sryanyuan ,

Sorry to bother you in this thread.

I'm trying to create a case from your description for integration test.

If the master is lost, the replication thread will block until the keepalive timer is reached when receiving full-sync SST files. At the same time, if we execute a 'clusterx setnodes' command, it will hold an exclusive lock until the replication thread is stopped. This will cause all other worker threads to block.

But I'm not sure about the configuration of the master and slave.

Could you give me some advice on reproducing the problem? 😊

Best Regards, Edward
Write some data into master

Add a slave
Before the full sync is completed, execute the following commands on the master to simulate network problems:
service network stop

echo `date "+%y-%m-%d %H:%M:%S"`---service network stop

sleep 120s

service network start

echo `date "+%y-%m-%d %H:%M:%S"`---service network start
Hi @sryanyuan ,

Thanks very much!

Got it! I will try it later today.😊

Best Regards, Edward

Glad to help

PragmaTwice assigned LindaSummer Nov 22, 2024

LindaSummer mentioned this issue Dec 11, 2024

test(integration): add integration test for master lost during syncing sst files. #2691

Merged

mapleFU closed this as completed in #2691 Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more test cases about replication and cluster #2671

Add more test cases about replication and cluster #2671

PragmaTwice commented Nov 16, 2024

LindaSummer commented Nov 22, 2024

PragmaTwice commented Nov 22, 2024

LindaSummer commented Dec 8, 2024

PragmaTwice commented Dec 8, 2024

git-hulk commented Dec 9, 2024 •

edited

Loading

LindaSummer commented Dec 9, 2024

git-hulk commented Dec 9, 2024

sryanyuan commented Dec 9, 2024 •

edited

Loading

LindaSummer commented Dec 9, 2024

sryanyuan commented Dec 9, 2024

Add more test cases about replication and cluster #2671

Add more test cases about replication and cluster #2671

Comments

PragmaTwice commented Nov 16, 2024

LindaSummer commented Nov 22, 2024

PragmaTwice commented Nov 22, 2024

LindaSummer commented Dec 8, 2024

PragmaTwice commented Dec 8, 2024

git-hulk commented Dec 9, 2024 • edited Loading

LindaSummer commented Dec 9, 2024

git-hulk commented Dec 9, 2024

sryanyuan commented Dec 9, 2024 • edited Loading

LindaSummer commented Dec 9, 2024

sryanyuan commented Dec 9, 2024

git-hulk commented Dec 9, 2024 •

edited

Loading

sryanyuan commented Dec 9, 2024 •

edited

Loading