Skip to content
This repository has been archived by the owner on Jun 23, 2022. It is now read-only.

fix: support retry when io write incompletely #818

Merged
merged 20 commits into from
Apr 26, 2021

Conversation

foreverneverer
Copy link
Contributor

@foreverneverer foreverneverer commented Apr 19, 2021

apache/incubator-pegasus#659 has reported the io write operation bug and only retry can decrease the probability of occurrence. This pr refer stackovflow: handling-incomplete-write-calls , rocksdb and add the retry operation.

Note:

  • This pr add FAIL_POINT_INJECT_VOID_F macro to mock pwrite to reproduce the bug

@foreverneverer foreverneverer marked this pull request as ready for review April 19, 2021 08:30
@foreverneverer foreverneverer changed the title fix: support retry when io write failed fix: support retry when io write incompletely Apr 19, 2021
Copy link
Contributor

@zhangyifan27 zhangyifan27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could look at the implementation of Kudu: https://github.com/apache/kudu/blob/6983d25c35df8569c1851bf27e2c75e98036c10d/src/kudu/util/env_posix.cc#L490

When errno == EINTR, Kudu would keeps retrying util other errors occur or write successfully, and also, keeps retrying until the number of remaining bytes (unwritten bytes) is 0.

In addition, it's better to test it on a real cluster to see if this issue could be resolved.

src/aio/native_linux_aio_provider.cpp Outdated Show resolved Hide resolved
src/aio/native_linux_aio_provider.cpp Outdated Show resolved Hide resolved
@foreverneverer
Copy link
Contributor Author

foreverneverer commented Apr 20, 2021

I think you could look at the implementation of Kudu: https://github.com/apache/kudu/blob/6983d25c35df8569c1851bf27e2c75e98036c10d/src/kudu/util/env_posix.cc#L490

When errno == EINTR, Kudu would keeps retrying util other errors occur or write successfully, and also, keeps retrying until the number of remaining bytes (unwritten bytes) is 0.

In addition, it's better to test it on a real cluster to see if this issue could be resolved.

Like this ? @zhangyifan27

ssize_t
write_with_retry (int fd, const void* buf, size_t size)
{
    ssize_t ret;
    while (size > 0) {
        do
        {
             ret = write(fd, buf, size);
        } while ((ret < 0) && (errno == EINTR));
        if (ret < 0)
            return ret;
        size -= ret;
        buf += ret;
    }
    return 0;
}

include/dsn/utility/fail_point.h Outdated Show resolved Hide resolved
include/dsn/utility/fail_point.h Outdated Show resolved Hide resolved
src/aio/native_linux_aio_provider.cpp Outdated Show resolved Hide resolved
src/aio/native_linux_aio_provider.cpp Outdated Show resolved Hide resolved
src/aio/native_linux_aio_provider.cpp Outdated Show resolved Hide resolved
levy5307
levy5307 previously approved these changes Apr 23, 2021
src/aio/test/aio.cpp Outdated Show resolved Hide resolved
@foreverneverer foreverneverer marked this pull request as draft April 23, 2021 09:25
@foreverneverer
Copy link
Contributor Author

foreverneverer commented Apr 25, 2021

I have test write incomplete case that the log is too large(>=2147479552 bytes) in test cluster, the result show the retry will resolve the problem:

// this request log only show once, is also say the retry is sucessful
write incomplete, request_size=2816868364, total_write_size=2147479552, this_write_size=2147479552, and will retry it.

however maybe since the log is too large so that the tcmalloc report problem when loop to write such a large amount of data:

asio write faitcmalloc: large alloc 3501096960 bytes == 0x7570e8000 @  0x7faab64b8e9b 0x7faab64dc8cb 0x4c42c2 0x7faab92ca929 0x7faab92cec18 0x7faab92cee04 0x7faab92ccf39 0x7faab8f25aeb 0x7faab8f3acf9 0x7faab8f436b1 0x7faab8f43edc 0x7faab92cac3b 0x7faab9232581 0x7faab924c4fa 0x7faab924c6fd 0x7faab4fbffd0
led: Connection reset by peer

But this is another problem and not the pr target.

@foreverneverer foreverneverer marked this pull request as ready for review April 25, 2021 07:19
src/aio/test/aio.cpp Outdated Show resolved Hide resolved
zhangyifan27
zhangyifan27 previously approved these changes Apr 26, 2021
levy5307
levy5307 previously approved these changes Apr 26, 2021
@foreverneverer foreverneverer dismissed stale reviews from levy5307 and zhangyifan27 via 4c1ca6d April 26, 2021 05:47
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants