Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more test cases about replication and cluster #2671

Closed
PragmaTwice opened this issue Nov 16, 2024 · 10 comments · Fixed by #2691
Closed

Add more test cases about replication and cluster #2671

PragmaTwice opened this issue Nov 16, 2024 · 10 comments · Fixed by #2691
Assignees

Comments

@PragmaTwice
Copy link
Member

          By the way, can we add a test with master removed and about 5s' connection timeout in go integration?

Originally posted by @mapleFU in #2662 (comment)

@LindaSummer
Copy link
Contributor

Hi @PragmaTwice ,

I want to follow up on this issue and learn about our replication logic.

Could this issue be assigned to me? 😊

Best Regards,
Edward

@PragmaTwice
Copy link
Member Author

Sure! Assigned.

@LindaSummer
Copy link
Contributor

Hi @sryanyuan ,

Sorry to bother you in this thread.

I'm trying to create a case from your description for integration test.

If the master is lost, the replication thread will block until the keepalive timer is reached when receiving full-sync SST files. At the same time, if we execute a 'clusterx setnodes' command, it will hold an exclusive lock until the replication thread is stopped. This will cause all other worker threads to block.

But I'm not sure about the configuration of the master and slave.

Could you give me some advice on reproducing the problem? 😊

Best Regards,
Edward

@PragmaTwice
Copy link
Member Author

cc @mapleFU could you take a look?

@git-hulk
Copy link
Member

git-hulk commented Dec 9, 2024

The command cluster setnodes needs to stop the existing replication thread first before switching to the new master. So it will stop the replication thread when adding the new master, but found the replication thread is hanging on fetching SST files for the sake of the connect/read timeout isn't set and the master is lost.

It will wait for about 3s after the PR #2662.

@LindaSummer
Copy link
Contributor

The command cluster setnodes needs to stop the existing replication thread first before switching to the new master. So it will stop the replication thread when adding the new master, but found the replication thread is hanging on fetching SST files for the sake of the connect/read timeout isn't set and the master is lost.

It will wait for about 3s after the PR #2662.

Hi @git-hulk ,

Thanks for your detailed analysis! I will try this way today. 😊

By the way, do we have a way to confirm the SST file fetching status or force triggering the SST file syncing in cluster mode? 😊

Best Regards,
Edward

@git-hulk
Copy link
Member

git-hulk commented Dec 9, 2024

@LindaSummer Thanks for your efforts. For this case, I think it should be good to reproduce with an unreachable master IP only. Before PR #2662, it's expected to be hanging while switching to a new master. And after #662, it should be back to normal after a few seconds(3-4). cc @PragmaTwice @sryanyuan

@sryanyuan
Copy link
Contributor

sryanyuan commented Dec 9, 2024

Hi @sryanyuan ,

Sorry to bother you in this thread.

I'm trying to create a case from your description for integration test.

If the master is lost, the replication thread will block until the keepalive timer is reached when receiving full-sync SST files. At the same time, if we execute a 'clusterx setnodes' command, it will hold an exclusive lock until the replication thread is stopped. This will cause all other worker threads to block.

But I'm not sure about the configuration of the master and slave.

Could you give me some advice on reproducing the problem? 😊

Best Regards, Edward

  • Write some data into master
  • Add a slave
  • Before the full sync is completed, execute the following commands on the master to simulate network problems:
    service network stop
    echo `date "+%y-%m-%d %H:%M:%S"`---service network stop
    sleep 120s
    service network start
    echo `date "+%y-%m-%d %H:%M:%S"`---service network start
    

@LindaSummer
Copy link
Contributor

Hi @sryanyuan ,

Sorry to bother you in this thread.

I'm trying to create a case from your description for integration test.

If the master is lost, the replication thread will block until the keepalive timer is reached when receiving full-sync SST files. At the same time, if we execute a 'clusterx setnodes' command, it will hold an exclusive lock until the replication thread is stopped. This will cause all other worker threads to block.

But I'm not sure about the configuration of the master and slave.

Could you give me some advice on reproducing the problem? 😊

Best Regards, Edward

  • Write some data into master

  • Add a slave

  • Before the full sync is completed, execute the following commands on the master to simulate network problems:

    
    service network stop
    
    echo `date "+%y-%m-%d %H:%M:%S"`---service network stop
    
    sleep 120s
    
    service network start
    
    echo `date "+%y-%m-%d %H:%M:%S"`---service network start
    
    

Hi @sryanyuan ,

Thanks very much!

Got it! I will try it later today.😊

Best Regards,
Edward

@sryanyuan
Copy link
Contributor

Hi @sryanyuan ,

Sorry to bother you in this thread.

I'm trying to create a case from your description for integration test.

If the master is lost, the replication thread will block until the keepalive timer is reached when receiving full-sync SST files. At the same time, if we execute a 'clusterx setnodes' command, it will hold an exclusive lock until the replication thread is stopped. This will cause all other worker threads to block.

But I'm not sure about the configuration of the master and slave.

Could you give me some advice on reproducing the problem? 😊

Best Regards, Edward

  • Write some data into master
  • Add a slave
  • Before the full sync is completed, execute the following commands on the master to simulate network problems:
    
    service network stop
    
    echo `date "+%y-%m-%d %H:%M:%S"`---service network stop
    
    sleep 120s
    
    service network start
    
    echo `date "+%y-%m-%d %H:%M:%S"`---service network start
    

Hi @sryanyuan ,

Thanks very much!

Got it! I will try it later today.😊

Best Regards, Edward

Glad to help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants