Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16686 dfuse: Fix overlapping chunk reads. #15298

Open
wants to merge 48 commits into
base: master
Choose a base branch
from

Conversation

ashleypittman
Copy link
Contributor

@ashleypittman ashleypittman commented Oct 11, 2024

Handle concurrent read in the chunk_read code. Rather than assuming
each slot only gets requested once save the slot number as part of the
request and handle multiple requests.

This corrects the behaviour and avoids a crash when multiple readers read
the same file concurrently and improves the performance in this case.

Signed-off-by: Ashley Pittman <[email protected]>
Signed-off-by: Ashley Pittman <[email protected]>
Copy link

github-actions bot commented Oct 11, 2024

Ticket title is 'Concurrent reads hit the network even when caching enabled in dfuse'
Status is 'In Progress'
Labels: 'google-cloud-daos'
https://daosio.atlassian.net/browse/DAOS-16686

@ashleypittman ashleypittman changed the title amd/dfuse concurrent read DAOS-16686 dfuse: Improve concurrent overlapping read handling Oct 11, 2024
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15298/4/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/6/execution/node/357/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/6/execution/node/356/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/6/execution/node/351/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/6/execution/node/348/log

Signed-off-by: Ashley Pittman <[email protected]>
Signed-off-by: Ashley Pittman <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15298/8/testReport/

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15298/9/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/9/execution/node/1479/log

jolivier23
jolivier23 previously approved these changes Dec 16, 2024
Comments from Ashely

In chunk_cb(), there's no reference on cd held here after the
last call to DFUSE_REPLY../. so the list needs to be spliced
onto the stack before the list is iterated.

Required-githooks: true

Signed-off-by: Di Wang <[email protected]>
@wangdi1 wangdi1 dismissed stale reviews from jolivier23 and themself via e8b1722 December 16, 2024 18:47
phender
phender previously approved these changes Dec 16, 2024
Copy link
Contributor

@phender phender left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing -1.

Copy link
Contributor

@mchaarawi mchaarawi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change looks ok to me. but since the description mentions an issue with a crash and not just a perf improvement, a test needs to be added.

Also please push with Features: dfuse

@ashleypittman
Copy link
Contributor Author

change looks ok to me. but since the description mentions an issue with a crash and not just a perf improvement, a test needs to be added.

I actually wrote a PR to test this issue, it's easy enough to trigger locally but the timing in CI meant that I wasn't able to reproduce it via ftest, possibly because reads are just faster than running over loopback.

#15312

@mchaarawi
Copy link
Contributor

change looks ok to me. but since the description mentions an issue with a crash and not just a perf improvement, a test needs to be added.

I actually wrote a PR to test this issue, it's easy enough to trigger locally but the timing in CI meant that I wasn't able to reproduce it via ftest, possibly because reads are just faster than running over loopback.

#15312

Yea i see how this can be a race that is not always reproducible. but at least having that test in would still give someone a way to run that.
curious though if putting that case in NLT would trigger that issue since NLT runs on 1 host?

@ashleypittman
Copy link
Contributor Author

Yea i see how this can be a race that is not always reproducible. but at least having that test in would still give someone a way to run that.

curious though if putting that case in NLT would trigger that issue since NLT runs on 1 host?

Yes, it's reproducible in NLT every time. In that case dfuse is run under valgrind so will be much slower and hence it's reproducible every time. I didn't want to do that though as NLT runs with full debug on so I've typically only used that for very small datasets/iteration counts. I don't see another way to get this into CI though.

I'll paste the script I've been using tomorrow when I'm at my work laptop but it's the same script I copied to make that test, essentially create a 1M file in dfuse, evict it and then read it in parallel in at least 128k chunks. The bug/race of course is when the same 128k bunk of a file is requested multiple times concurrently before the first request has been replied to.

wangdi1 added a commit that referenced this pull request Dec 18, 2024
From #15298

Handle concurrent read in the chunk_read code. Rather than assuming
each slot only gets requested once save the slot number as part of the
request and handle multiple requests.

This corrects the behaviour and avoids a crash when multiple readers read
the same file concurrently and improves the performance in this case.

Required-githooks: true

Signed-off-by: Ashley Pittman <[email protected]>
wangdi1 added a commit that referenced this pull request Dec 20, 2024
From #15298

Handle concurrent read in the chunk_read code. Rather than assuming
each slot only gets requested once save the slot number as part of the
request and handle multiple requests.

This corrects the behaviour and avoids a crash when multiple readers read
the same file concurrently and improves the performance in this case.

Required-githooks: true

Signed-off-by: Ashley Pittman <[email protected]>
wangdi1 added a commit that referenced this pull request Dec 28, 2024
From #15298

Handle concurrent read in the chunk_read code. Rather than assuming
each slot only gets requested once save the slot number as part of the
request and handle multiple requests.

This corrects the behaviour and avoids a crash when multiple readers read
the same file concurrently and improves the performance in this case.

Required-githooks: true

Signed-off-by: Ashley Pittman <[email protected]>
@ashleypittman
Copy link
Contributor Author

I can't seem to push to this PR today for some reason but it want's c7a870377d632454fd99c97dc7cdb4042b01d444 added to it.

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/37/execution/node/342/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/37/execution/node/339/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/37/execution/node/345/log

Allow-unstable-test: true

Signed-off-by: Jeff Olivier <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/38/execution/node/338/log

@daosbuild1
Copy link
Collaborator

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/38/execution/node/357/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/38/execution/node/337/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/38/execution/node/341/log

@daosbuild1
Copy link
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/38/execution/node/481/log

wangdi1 added a commit that referenced this pull request Jan 27, 2025
From #15298

Handle concurrent read in the chunk_read code. Rather than assuming
each slot only gets requested once save the slot number as part of the
request and handle multiple requests.

This corrects the behaviour and avoids a crash when multiple readers read
the same file concurrently and improves the performance in this case.

Required-githooks: true

Signed-off-by: Ashley Pittman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants