fix(replication): slave blocks until keepalive timer is reached when master is gone without fin/rst notification #2662

sryanyuan · 2024-11-14T09:41:46Z

If the master is lost, the replication thread will block until the keepalive timer is reached when receiving full-sync SST files. At the same time, if we execute a 'clusterx setnodes' command, it will hold an exclusive lock until the replication thread is stopped. This will cause all other worker threads to block.

My solution is to enable the socket read timeout on the file descriptor receiving SST files, if a timeout occurs and the replication thread is marked as stopped, the receiving action will be broken.

…master is gone without fin/rst notification

src/cluster/replication.cc

…recv_timeout configurable

…into fix-slave-block

kvrocks.conf

mapleFU · 2024-11-15T05:00:22Z

By the way, can we add a test with master removed and about 5s' connection timeout in go integration?

src/cluster/replication.cc

… loop

sryanyuan · 2024-11-15T09:22:54Z

We can't distinguish whether the result is EOF or error if EvbufferRead returns an error. When the underlying I/O syscall returns EOF, the errno will not be set. So, I added a new EOF status to break the loop if the connection is EOF.

434f148

@git-hulk @PragmaTwice please have a look

src/common/status.h

src/common/io_util.cc

src/cluster/replication.cc

…into fix-slave-block

src/common/io_util.cc

Co-authored-by: Twice <[email protected]>

src/config/config.h

Co-authored-by: Twice <[email protected]>

sonarcloud · 2024-11-16T14:09:45Z

Quality Gate failed

Failed conditions
37.9% Coverage on New Code (required ≥ 50%)

See analysis details on SonarQube Cloud

fix(replication): slave blocks until keepalive timer is reached when …

522669b

…master is gone without fin/rst notification

torwig reviewed Nov 14, 2024

View reviewed changes

src/cluster/replication.cc Outdated Show resolved Hide resolved

chore: change sock_timeout_ms to static constexpr variable

b205dcf

PragmaTwice reviewed Nov 14, 2024

View reviewed changes

src/cluster/replication.cc Outdated Show resolved Hide resolved

PragmaTwice reviewed Nov 14, 2024

View reviewed changes

src/cluster/replication.cc Outdated Show resolved Hide resolved

tclxyxj25245 and others added 4 commits November 14, 2024 18:53

chore: change the default value for connect/recv

aa0b6c4

Merge branch 'unstable' into fix-slave-block

9d75c8f

feat(config): make slave_fullsync_connect_timeout and slave_fullsync_…

7943f3e

…recv_timeout configurable

Merge branch 'fix-slave-block' of https://github.com/sryanyuan/kvrocks …

1db057b

…into fix-slave-block

PragmaTwice reviewed Nov 15, 2024

View reviewed changes

kvrocks.conf Outdated Show resolved Hide resolved

mapleFU reviewed Nov 15, 2024

View reviewed changes

kvrocks.conf Outdated Show resolved Hide resolved

chore: rename config field name

19ea612

git-hulk reviewed Nov 15, 2024

View reviewed changes

src/cluster/replication.cc Outdated Show resolved Hide resolved

chore: unify connection timeout config

3425690

git-hulk previously approved these changes Nov 15, 2024

View reviewed changes

fix(replication): distinguish EOF and network error to avoid infinite…

434f148

… loop

sryanyuan dismissed git-hulk’s stale review via 434f148 November 15, 2024 09:22

Merge branch 'unstable' into fix-slave-block

0198b63

git-hulk previously approved these changes Nov 15, 2024

View reviewed changes

git-hulk requested review from mapleFU, PragmaTwice, git-hulk and torwig November 15, 2024 10:48

PragmaTwice reviewed Nov 15, 2024

View reviewed changes

src/common/status.h Outdated Show resolved Hide resolved

PragmaTwice reviewed Nov 15, 2024

View reviewed changes

src/common/io_util.cc Outdated Show resolved Hide resolved

PragmaTwice reviewed Nov 15, 2024

View reviewed changes

src/common/io_util.cc Show resolved Hide resolved

PragmaTwice reviewed Nov 15, 2024

View reviewed changes

src/cluster/replication.cc Outdated Show resolved Hide resolved

chore: EvbufferRead can distinguish EOF and TryAgain from errors

4593603

Merge branch 'fix-slave-block' of https://github.com/sryanyuan/kvrocks …

e07a231

…into fix-slave-block

sryanyuan dismissed git-hulk’s stale review via e07a231 November 15, 2024 14:31

Merge branch 'unstable' into fix-slave-block

eb71e7d

PragmaTwice reviewed Nov 15, 2024

View reviewed changes

src/common/io_util.cc Outdated Show resolved Hide resolved

Update src/common/io_util.cc

2744038

Co-authored-by: Twice <[email protected]>

PragmaTwice previously approved these changes Nov 16, 2024

View reviewed changes

Merge branch 'unstable' into fix-slave-block

63f73c4

git-hulk previously approved these changes Nov 16, 2024

View reviewed changes

Merge branch 'unstable' into fix-slave-block

3e705c9

torwig reviewed Nov 16, 2024

View reviewed changes

src/config/config.h Outdated Show resolved Hide resolved

PragmaTwice reviewed Nov 16, 2024

View reviewed changes

src/config/config.h Outdated Show resolved Hide resolved

Update src/config/config.h

fdacabe

Co-authored-by: Twice <[email protected]>

sryanyuan dismissed stale reviews from git-hulk and PragmaTwice via fdacabe November 16, 2024 12:20

torwig approved these changes Nov 16, 2024

View reviewed changes

PragmaTwice approved these changes Nov 16, 2024

View reviewed changes

PragmaTwice mentioned this pull request Nov 16, 2024

Add more test cases about replication and cluster #2671

Open

PragmaTwice merged commit 5e9db79 into apache:unstable Nov 16, 2024
31 of 32 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(replication): slave blocks until keepalive timer is reached when master is gone without fin/rst notification #2662

fix(replication): slave blocks until keepalive timer is reached when master is gone without fin/rst notification #2662

sryanyuan commented Nov 14, 2024

mapleFU commented Nov 15, 2024 •

edited

Loading

sryanyuan commented Nov 15, 2024 •

edited

Loading

sonarcloud bot commented Nov 16, 2024

fix(replication): slave blocks until keepalive timer is reached when master is gone without fin/rst notification #2662

fix(replication): slave blocks until keepalive timer is reached when master is gone without fin/rst notification #2662

Conversation

sryanyuan commented Nov 14, 2024

mapleFU commented Nov 15, 2024 • edited Loading

sryanyuan commented Nov 15, 2024 • edited Loading

sonarcloud bot commented Nov 16, 2024

Quality Gate failed

mapleFU commented Nov 15, 2024 •

edited

Loading

sryanyuan commented Nov 15, 2024 •

edited

Loading