-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16686 dfuse: Fix overlapping chunk reads. #15298
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Ashley Pittman <[email protected]>
Signed-off-by: Ashley Pittman <[email protected]>
Signed-off-by: Ashley Pittman <[email protected]>
Ticket title is 'Concurrent reads hit the network even when caching enabled in dfuse' |
Signed-off-by: Ashley Pittman <[email protected]>
Signed-off-by: Ashley Pittman <[email protected]>
Signed-off-by: Ashley Pittman <[email protected]>
Test stage Functional Hardware Large completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15298/4/testReport/ |
… leak Signed-off-by: Ashley Pittman <[email protected]>
Signed-off-by: Ashley Pittman <[email protected]>
Signed-off-by: Ashley Pittman <[email protected]>
Signed-off-by: Ashley Pittman <[email protected]>
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/6/execution/node/357/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/6/execution/node/356/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/6/execution/node/351/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/6/execution/node/348/log |
Signed-off-by: Ashley Pittman <[email protected]>
Signed-off-by: Ashley Pittman <[email protected]>
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15298/8/testReport/ |
Signed-off-by: Ashley Pittman <[email protected]>
Test-tag: dfuse Signed-off-by: Ashley Pittman <[email protected]>
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15298/9/testReport/ |
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/9/execution/node/1479/log |
Test-tag: dfuse
Test-tag: dfuse Signed-off-by: Ashley Pittman <[email protected]>
Comments from Ashely In chunk_cb(), there's no reference on cd held here after the last call to DFUSE_REPLY../. so the list needs to be spliced onto the stack before the list is iterated. Required-githooks: true Signed-off-by: Di Wang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing -1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change looks ok to me. but since the description mentions an issue with a crash and not just a perf improvement, a test needs to be added.
Also please push with Features: dfuse
I actually wrote a PR to test this issue, it's easy enough to trigger locally but the timing in CI meant that I wasn't able to reproduce it via ftest, possibly because reads are just faster than running over loopback. |
Yea i see how this can be a race that is not always reproducible. but at least having that test in would still give someone a way to run that. |
Yes, it's reproducible in NLT every time. In that case dfuse is run under valgrind so will be much slower and hence it's reproducible every time. I didn't want to do that though as NLT runs with full debug on so I've typically only used that for very small datasets/iteration counts. I don't see another way to get this into CI though. I'll paste the script I've been using tomorrow when I'm at my work laptop but it's the same script I copied to make that test, essentially create a 1M file in dfuse, evict it and then read it in parallel in at least 128k chunks. The bug/race of course is when the same 128k bunk of a file is requested multiple times concurrently before the first request has been replied to. |
From #15298 Handle concurrent read in the chunk_read code. Rather than assuming each slot only gets requested once save the slot number as part of the request and handle multiple requests. This corrects the behaviour and avoids a crash when multiple readers read the same file concurrently and improves the performance in this case. Required-githooks: true Signed-off-by: Ashley Pittman <[email protected]>
From #15298 Handle concurrent read in the chunk_read code. Rather than assuming each slot only gets requested once save the slot number as part of the request and handle multiple requests. This corrects the behaviour and avoids a crash when multiple readers read the same file concurrently and improves the performance in this case. Required-githooks: true Signed-off-by: Ashley Pittman <[email protected]>
From #15298 Handle concurrent read in the chunk_read code. Rather than assuming each slot only gets requested once save the slot number as part of the request and handle multiple requests. This corrects the behaviour and avoids a crash when multiple readers read the same file concurrently and improves the performance in this case. Required-githooks: true Signed-off-by: Ashley Pittman <[email protected]>
I can't seem to push to this PR today for some reason but it want's c7a870377d632454fd99c97dc7cdb4042b01d444 added to it. |
Signed-off-by: Ashley Pittman <[email protected]>
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/37/execution/node/342/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/37/execution/node/339/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/37/execution/node/345/log |
Allow-unstable-test: true Signed-off-by: Jeff Olivier <[email protected]>
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/38/execution/node/338/log |
Test stage Build on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/38/execution/node/361/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/38/execution/node/357/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/38/execution/node/337/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/38/execution/node/341/log |
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15298/38/execution/node/481/log |
Allow-unstable-test: true Signed-off-by: Jeff Olivier <[email protected]>
From #15298 Handle concurrent read in the chunk_read code. Rather than assuming each slot only gets requested once save the slot number as part of the request and handle multiple requests. This corrects the behaviour and avoids a crash when multiple readers read the same file concurrently and improves the performance in this case. Required-githooks: true Signed-off-by: Ashley Pittman <[email protected]>
Handle concurrent read in the chunk_read code. Rather than assuming
each slot only gets requested once save the slot number as part of the
request and handle multiple requests.
This corrects the behaviour and avoids a crash when multiple readers read
the same file concurrently and improves the performance in this case.