-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDFS-16262 Async refresh of cached locations in DFSInputStream #3527
Conversation
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
f932816
to
ceba806
Compare
I've had this running in one of our test clusters, under load and with block moves occurring. I had it tuned to a short interval of 10s just to put it in an extreme condition. It works really well. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
ceba806
to
a27f3c9
Compare
💔 -1 overall
This message was automatically generated. |
a27f3c9
to
85a1a82
Compare
💔 -1 overall
This message was automatically generated. |
85a1a82
to
e6b877b
Compare
🎊 +1 overall
This message was automatically generated. |
c70e40d
to
2ceae4c
Compare
🎊 +1 overall
This message was automatically generated. |
2ceae4c
to
0c7d1c7
Compare
int retriesForLastBlockLength = conf.getRetryTimesForGetLastBlockLength(); | ||
while (retriesForLastBlockLength > 0) { | ||
|
||
while (true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This openInfo
rewrite is functionally identical to the previous implementation, but is cleaner and makes it possible to share code between this method and the new refreshBlockLocations
method below. I didn't want to simply re-use this method for the background refresh, because it does multiple RPC calls within the infoLock.
The background refreshBlockLocations
is meant to be minimally impactful to the hot path of reads, so it does these RPC calls outside of any locks and then quickly swaps them in with the lock.
0c7d1c7
to
9604ba9
Compare
🎊 +1 overall
This message was automatically generated. |
9604ba9
to
ddeccfd
Compare
💔 -1 overall
This message was automatically generated. |
ddeccfd
to
43daa5f
Compare
💔 -1 overall
This message was automatically generated. |
The last CI runs flagged some checkstyle issues. Could you check those please and fix them? A few others parts of that run failed too, but it may have been some wider issue. Fixing the checkstyle will trigger a new run and we can see how it looks then. The patch looks mostly good to me, but there is a lot of change in DFSInputStream, so I will need to take some more time to go over that part of the PR again. |
Sorry! This had been succeeding last time i looked at it, I should have verified that it was still succeeding before pinging you. I'll take a look at the failures shortly. |
43daa5f
to
5409631
Compare
I just pushed a rebase of this branch on latest trunk, with checkstyle issues fixed. I'll keep an eye on the build and ping you once it looks good. |
💔 -1 overall
This message was automatically generated. |
Hey @sodonnel , the build has finished. Unfortunately there was 1 failure, yet I don't think it's related. It's a timeout in RollingUpgradeTest. Otherwise the checkstyle issues are cleaned up as are all the other failures. |
5409631
to
52f925d
Compare
💔 -1 overall
This message was automatically generated. |
@sodonnel So there is one failing test, TestRollingUpgrade. I've now run the test locally 3 times and it always succeeds. I don't really have a lot of context on this test, but I hope it won't block this PR. The changes here should be unrelated. |
It seems like TestRollingUpgrade has some larger issues. I was curious about this so I looked up recent merged PRs: https://github.com/apache/hadoop/pulls?q=is%3Apr+is%3Aclosed+TestRollingUpgrade. Looks like a bunch of recently merged PRs had failure sin TestRollingUpgrade. |
@bbeaudreault The CI looks good - that test often fails, so I don't believe it is related to this. I gave the changes another check and I don't see any issues, plus you have been running this change for a long time in production without issues, which further validates it. Thanks for your patience and sorry it took me so long to get back to this! +1 from me, I will commit it shortly. |
(cherry picked from commit 94b884a)
Thank you @sodonnel! |
Description of PR
Refactor refreshing of cached block locations so that it happens as part of an async process. This new implementation only refreshes DFSInputStreams whose deadNodes list is non-empty or contains non-local blocks.
Many DFSInputStreams can be very short lived. In order to reduce unnecessary contention and work fetching locations for a stream that doesn't need it, we only register the DFSInputStream for the async refresh on the next read after the first interval. In my testing in an HBase cluster, this drastically cuts down on the amount of registering and deregistering of streams.
One can disable automatic registration altogether if they'd prefer to handle that manually. One can manually register or de-register a stream using DFSClient.addLocatedBlocksRefresh and removeLocatedBlocksRefresh.
See https://issues.apache.org/jira/browse/HDFS-16262
How was this patch tested?
I added a new test class TestLocatedBlocksRefresher. This has also been running on a few of our internal test clusters.
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?