DAOS-16686 dfuse: Fix overlapping chunk reads. #15298

ashleypittman · 2024-10-11T14:19:17Z

Handle concurrent read in the chunk_read code. Rather than assuming
each slot only gets requested once save the slot number as part of the
request and handle multiple requests.

This corrects the behaviour and avoids a crash when multiple readers read
the same file concurrently and improves the performance in this case.

Signed-off-by: Ashley Pittman <[email protected]>

github-actions · 2024-10-11T14:19:38Z

Ticket title is 'Concurrent reads hit the network even when caching enabled in dfuse'
Status is 'In Progress'
Labels: 'google-cloud-daos'
https://daosio.atlassian.net/browse/DAOS-16686

Signed-off-by: Ashley Pittman <[email protected]>

daosbuild1 · 2024-10-14T21:07:39Z

Test stage Functional Hardware Large completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15298/4/testReport/

… leak Signed-off-by: Ashley Pittman <[email protected]>

Signed-off-by: Ashley Pittman <[email protected]>

daosbuild1 · 2024-10-18T08:25:37Z

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/6/execution/node/357/log

daosbuild1 · 2024-10-18T08:30:12Z

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/6/execution/node/356/log

daosbuild1 · 2024-10-18T08:30:25Z

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/6/execution/node/351/log

daosbuild1 · 2024-10-18T08:31:09Z

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/6/execution/node/348/log

Signed-off-by: Ashley Pittman <[email protected]>

daosbuild1 · 2024-10-18T12:43:16Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15298/8/testReport/

Signed-off-by: Ashley Pittman <[email protected]>

Test-tag: dfuse Signed-off-by: Ashley Pittman <[email protected]>

daosbuild1 · 2024-11-11T11:57:53Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15298/9/testReport/

daosbuild1 · 2024-11-11T22:46:32Z

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/9/execution/node/1479/log

Test-tag: dfuse

Test-tag: dfuse Signed-off-by: Ashley Pittman <[email protected]>

Comments from Ashely In chunk_cb(), there's no reference on cd held here after the last call to DFUSE_REPLY../. so the list needs to be spliced onto the stack before the list is iterated. Required-githooks: true Signed-off-by: Di Wang <[email protected]>

phender

Removing -1.

mchaarawi

change looks ok to me. but since the description mentions an issue with a crash and not just a perf improvement, a test needs to be added.

Also please push with Features: dfuse

ashleypittman · 2024-12-16T20:26:36Z

change looks ok to me. but since the description mentions an issue with a crash and not just a perf improvement, a test needs to be added.

I actually wrote a PR to test this issue, it's easy enough to trigger locally but the timing in CI meant that I wasn't able to reproduce it via ftest, possibly because reads are just faster than running over loopback.

#15312

mchaarawi · 2024-12-16T20:31:20Z

change looks ok to me. but since the description mentions an issue with a crash and not just a perf improvement, a test needs to be added.

I actually wrote a PR to test this issue, it's easy enough to trigger locally but the timing in CI meant that I wasn't able to reproduce it via ftest, possibly because reads are just faster than running over loopback.

#15312

Yea i see how this can be a race that is not always reproducible. but at least having that test in would still give someone a way to run that.
curious though if putting that case in NLT would trigger that issue since NLT runs on 1 host?

ashleypittman · 2024-12-16T20:39:56Z

Yea i see how this can be a race that is not always reproducible. but at least having that test in would still give someone a way to run that.

curious though if putting that case in NLT would trigger that issue since NLT runs on 1 host?

Yes, it's reproducible in NLT every time. In that case dfuse is run under valgrind so will be much slower and hence it's reproducible every time. I didn't want to do that though as NLT runs with full debug on so I've typically only used that for very small datasets/iteration counts. I don't see another way to get this into CI though.

I'll paste the script I've been using tomorrow when I'm at my work laptop but it's the same script I copied to make that test, essentially create a 1M file in dfuse, evict it and then read it in parallel in at least 128k chunks. The bug/race of course is when the same 128k bunk of a file is requested multiple times concurrently before the first request has been replied to.

From #15298 Handle concurrent read in the chunk_read code. Rather than assuming each slot only gets requested once save the slot number as part of the request and handle multiple requests. This corrects the behaviour and avoids a crash when multiple readers read the same file concurrently and improves the performance in this case. Required-githooks: true Signed-off-by: Ashley Pittman <[email protected]>

ashleypittman · 2025-01-03T10:32:57Z

I can't seem to push to this PR today for some reason but it want's c7a870377d632454fd99c97dc7cdb4042b01d444 added to it.

Signed-off-by: Ashley Pittman <[email protected]>

daosbuild1 · 2025-01-13T07:37:25Z

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/37/execution/node/342/log

daosbuild1 · 2025-01-13T07:38:54Z

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/37/execution/node/339/log

daosbuild1 · 2025-01-13T07:39:01Z

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/37/execution/node/345/log

Allow-unstable-test: true Signed-off-by: Jeff Olivier <[email protected]>

daosbuild1 · 2025-01-13T07:44:24Z

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/38/execution/node/338/log

daosbuild1 · 2025-01-13T07:44:34Z

Test stage Build on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/38/execution/node/361/log

daosbuild1 · 2025-01-13T07:45:50Z

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/38/execution/node/357/log

daosbuild1 · 2025-01-13T07:46:00Z

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/38/execution/node/337/log

daosbuild1 · 2025-01-13T07:51:09Z

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/38/execution/node/341/log

daosbuild1 · 2025-01-13T07:52:02Z

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/38/execution/node/481/log

Allow-unstable-test: true Signed-off-by: Jeff Olivier <[email protected]>

From #15298 Handle concurrent read in the chunk_read code. Rather than assuming each slot only gets requested once save the slot number as part of the request and handle multiple requests. This corrects the behaviour and avoids a crash when multiple readers read the same file concurrently and improves the performance in this case. Required-githooks: true Signed-off-by: Ashley Pittman <[email protected]>

ashleypittman added 3 commits October 10, 2024 16:18

DAOS-15682 dfuse: Fail on concurrent read.

9aa13e0

Signed-off-by: Ashley Pittman <[email protected]>

Try and fix issue.

1d4019f

Signed-off-by: Ashley Pittman <[email protected]>

First stab at a fix.

122faf0

Signed-off-by: Ashley Pittman <[email protected]>

ashleypittman changed the title ~~amd/dfuse concurrent read~~ DAOS-16686 dfuse: Improve concurrent overlapping read handling Oct 11, 2024

ashleypittman added 4 commits October 11, 2024 14:48

Merge branch 'master' into amd/dfuse-concurrent-read

270360f

Fix invalid free and leak.

f74f053

Signed-off-by: Ashley Pittman <[email protected]>

Fix a logging line.

6ad750b

Signed-off-by: Ashley Pittman <[email protected]>

Add some debugging.

578be24

Signed-off-by: Ashley Pittman <[email protected]>

ashleypittman added 4 commits October 17, 2024 10:31

Track duplicate reads. This avoids a crash but there's still a memory…

e771912

… leak Signed-off-by: Ashley Pittman <[email protected]>

Fix logic.

6c407e7

Signed-off-by: Ashley Pittman <[email protected]>

Rework to support blocking on network requests.

7f892f3

Signed-off-by: Ashley Pittman <[email protected]>

Bump array size and add stats.

6e286d1

Signed-off-by: Ashley Pittman <[email protected]>

ashleypittman added 2 commits October 18, 2024 10:16

Fix a segv in the stats.

b2a21c3

Signed-off-by: Ashley Pittman <[email protected]>

Track EOF better in reads.

b685591

Signed-off-by: Ashley Pittman <[email protected]>

ashleypittman mentioned this pull request Nov 5, 2024

DAOS-16783 dfuse: move readahead track to inode #15450

Closed

18 tasks

ashleypittman added 3 commits November 11, 2024 10:19

Merge branch 'master' into amd/dfuse-concurrent-read

33409b3

Fixup after merge

4fcedda

Signed-off-by: Ashley Pittman <[email protected]>

Move active read list to active.

018449e

Test-tag: dfuse Signed-off-by: Ashley Pittman <[email protected]>

ashleypittman requested a review from wangdi1 November 11, 2024 17:32

ashleypittman added 2 commits November 12, 2024 12:40

Merge branch 'master' into amd/dfuse-concurrent-read

52e827c

Test-tag: dfuse

Rebase and iterate on comments.

221c849

Test-tag: dfuse Signed-off-by: Ashley Pittman <[email protected]>

jolivier23 previously approved these changes Dec 16, 2024

View reviewed changes

wangdi1 added 2 commits December 16, 2024 18:44

Merge branch 'master' into amd/dfuse-concurrent-read

e8b1722

wangdi1 dismissed stale reviews from jolivier23 and themself via e8b1722 December 16, 2024 18:47

phender previously approved these changes Dec 16, 2024

View reviewed changes

mchaarawi requested changes Dec 16, 2024

View reviewed changes

jolivier23 added 2 commits January 13, 2025 00:29

Merge branch 'master' into amd/dfuse-concurrent-read

f8dfa56

Patch from Ashley

07b0800

Signed-off-by: Ashley Pittman <[email protected]>

jolivier23 dismissed phender’s stale review via 07b0800 January 13, 2025 07:33

Test-tag: dfuse

3b353ae

Allow-unstable-test: true Signed-off-by: Jeff Olivier <[email protected]>

jolivier23 added 2 commits January 17, 2025 17:28

Merge branch 'master' into amd/dfuse-concurrent-read

04885c9

Fix a typo

10ab8f9

Allow-unstable-test: true Signed-off-by: Jeff Olivier <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS-16686 dfuse: Fix overlapping chunk reads. #15298

DAOS-16686 dfuse: Fix overlapping chunk reads. #15298

ashleypittman commented Oct 11, 2024 •

edited

Loading

github-actions bot commented Oct 11, 2024 •

edited

Loading

daosbuild1 commented Oct 14, 2024

daosbuild1 commented Oct 18, 2024

daosbuild1 commented Oct 18, 2024

daosbuild1 commented Oct 18, 2024

daosbuild1 commented Oct 18, 2024

daosbuild1 commented Oct 18, 2024

daosbuild1 commented Nov 11, 2024

daosbuild1 commented Nov 11, 2024

phender left a comment

mchaarawi left a comment

ashleypittman commented Dec 16, 2024

mchaarawi commented Dec 16, 2024

ashleypittman commented Dec 16, 2024

ashleypittman commented Jan 3, 2025

daosbuild1 commented Jan 13, 2025

daosbuild1 commented Jan 13, 2025

daosbuild1 commented Jan 13, 2025

daosbuild1 commented Jan 13, 2025

daosbuild1 commented Jan 13, 2025

daosbuild1 commented Jan 13, 2025

daosbuild1 commented Jan 13, 2025

daosbuild1 commented Jan 13, 2025

daosbuild1 commented Jan 13, 2025

DAOS-16686 dfuse: Fix overlapping chunk reads. #15298

Are you sure you want to change the base?

DAOS-16686 dfuse: Fix overlapping chunk reads. #15298

Conversation

ashleypittman commented Oct 11, 2024 • edited Loading

github-actions bot commented Oct 11, 2024 • edited Loading

daosbuild1 commented Oct 14, 2024

daosbuild1 commented Oct 18, 2024

daosbuild1 commented Oct 18, 2024

daosbuild1 commented Oct 18, 2024

daosbuild1 commented Oct 18, 2024

daosbuild1 commented Oct 18, 2024

daosbuild1 commented Nov 11, 2024

daosbuild1 commented Nov 11, 2024

phender left a comment

Choose a reason for hiding this comment

mchaarawi left a comment

Choose a reason for hiding this comment

ashleypittman commented Dec 16, 2024

mchaarawi commented Dec 16, 2024

ashleypittman commented Dec 16, 2024

ashleypittman commented Jan 3, 2025

daosbuild1 commented Jan 13, 2025

daosbuild1 commented Jan 13, 2025

daosbuild1 commented Jan 13, 2025

daosbuild1 commented Jan 13, 2025

daosbuild1 commented Jan 13, 2025

daosbuild1 commented Jan 13, 2025

daosbuild1 commented Jan 13, 2025

daosbuild1 commented Jan 13, 2025

daosbuild1 commented Jan 13, 2025

ashleypittman commented Oct 11, 2024 •

edited

Loading

github-actions bot commented Oct 11, 2024 •

edited

Loading