-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-18829. S3A prefetch LRU cache eviction metrics #5893
Conversation
us-west-2:
|
|
🎊 +1 overall
This message was automatically generated. |
@mehakmeet @mukund-thakur this one is one of the follow-ups from HADOOP-18291, could you please review this PR? |
this (and other future follow-ups from HADOOP-18291) can go after #5832 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, we could add some more metrics here.
How about
- a Gauge for the size that tracks the size of the LRU cache?
- Duration tracker for the removal of a block?
Maybe some other ones that you think would be helpful for the LRU cache?
@@ -444,6 +444,7 @@ private void deleteBlockFileAndEvictCache(Entry elementToPurge) { | |||
entryListSize--; | |||
prefetchingStatistics.blockRemovedFromFileCache(); | |||
blocks.remove(elementToPurge.blockNumber); | |||
prefetchingStatistics.blockEvictedFromFileCache(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, blocks removed from the cache would be equal to the blocks evicted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They will be equal, until the input stream is closed. When closed, all blocks get removed, which only increments blockRemovedFromFileCache
but not blockEvictedFromFileCache
@@ -172,6 +174,8 @@ public void testSeeksWithLruEviction() throws Throwable { | |||
LambdaTestUtils.eventually(TIMEOUT_MILLIS, INTERVAL_MILLIS, () -> { | |||
LOG.info("IO stats: {}", ioStats); | |||
verifyStatisticGaugeValue(ioStats, STREAM_READ_BLOCKS_IN_FILE_CACHE, 0); | |||
assertThatStatisticCounter(ioStats, | |||
STREAM_EVICT_BLOCKS_FROM_FILE_CACHE).isGreaterThanOrEqualTo(5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is eventually()
, should we set the number of evictions that would be happening eventually rather than a least number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is bit of an async behavior that will likely make the test flaky even with eventually because the order in which lazy seek is going to trigger block cache is not guaranteed and that can still sometimes hurt the exact number that we are trying to match.
i really wish we could do this but it's still lazy seek and verifying the exact order of seeks leading to cache load is not guaranteed every time (some of the problems are also being discussed on unbuffer PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a Gauge for the size that tracks the size of the LRU cache?
i thought of that earlier but then realized that we already have stream_read_blocks_in_cache
:
/**
* Total number of block in disk cache.
*/
public static final String STREAM_READ_BLOCKS_IN_FILE_CACHE
= "stream_read_blocks_in_cache";
@@ -444,6 +444,7 @@ private void deleteBlockFileAndEvictCache(Entry elementToPurge) { | |||
entryListSize--; | |||
prefetchingStatistics.blockRemovedFromFileCache(); | |||
blocks.remove(elementToPurge.blockNumber); | |||
prefetchingStatistics.blockEvictedFromFileCache(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They will be equal, until the input stream is closed. When closed, all blocks get removed, which only increments blockRemovedFromFileCache
but not blockEvictedFromFileCache
@@ -172,6 +174,8 @@ public void testSeeksWithLruEviction() throws Throwable { | |||
LambdaTestUtils.eventually(TIMEOUT_MILLIS, INTERVAL_MILLIS, () -> { | |||
LOG.info("IO stats: {}", ioStats); | |||
verifyStatisticGaugeValue(ioStats, STREAM_READ_BLOCKS_IN_FILE_CACHE, 0); | |||
assertThatStatisticCounter(ioStats, | |||
STREAM_EVICT_BLOCKS_FROM_FILE_CACHE).isGreaterThanOrEqualTo(5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is bit of an async behavior that will likely make the test flaky even with eventually because the order in which lazy seek is going to trigger block cache is not guaranteed and that can still sometimes hurt the exact number that we are trying to match.
i really wish we could do this but it's still lazy seek and verifying the exact order of seeks leading to cache load is not guaranteed every time (some of the problems are also being discussed on unbuffer PR).
done, thank you for this suggestion @mehakmeet! |
🎊 +1 overall
This message was automatically generated. |
Checkstyle complains about 8th param added to CachingBlockManager c'tor. Shall we mark it ignored for now and use builder class (for both CachingBlockManager and S3ACachingBlockManager) later for this patch to not focus on refactor work? we can put builder pattern before upcoming 3.3 release (since HADOOP-18028 has not yet been released on 3.3) |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good and is close to getting merged.
A suggestion for future PRs, please don't rebase to trunk in the middle of a review, It makes reviewing changes a bit difficult unless there are merge conflicts we should have a consistent base for all commits stacked on top of each other.
@@ -175,7 +177,9 @@ public void testSeeksWithLruEviction() throws Throwable { | |||
LOG.info("IO stats: {}", ioStats); | |||
verifyStatisticGaugeValue(ioStats, STREAM_READ_BLOCKS_IN_FILE_CACHE, 0); | |||
assertThatStatisticCounter(ioStats, | |||
STREAM_EVICT_BLOCKS_FROM_FILE_CACHE).isGreaterThanOrEqualTo(5); | |||
STREAM_EVICT_BLOCKS_FROM_FILE_CACHE).isGreaterThanOrEqualTo(4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we changed this to 4, Are we seeing some flakiness with this still? If this still fails rarely maybe we can add an error message saying this may be transient for example so that people can re-run it before opening a bug jira or such :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, since the eventually() throws last exception (which comes from assertj library, they are not taking any error msg as argument), let me add comments on this.
we don't see flakiness as such but 4 is good enough based on multiple rounds of tests that i did last week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mehakmeet does this sound good to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I was thinking of adding the error message to the assertion itself and not the eventually() block. It's always tricky with multi-threaded tests and flakiness.
sure, for this PR, since i just wanted to run the whole test suite with latest head from trunk, i did the merge from latest trunk but next time i can find other way of re-running the test suite on latest trunk (by manually applying patch on latest trunk, only locally) :) |
🎊 +1 overall
This message was automatically generated. |
@mehakmeet sorry to bother you again, could you please take another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for the patch @virajjasani, and sorry for the delay from the last review. Can you please cherrypick the same into branch-3.3 with whole test suite run and then we can merge it there as well. |
Contributed by: Viraj Jasani
Contributed by: Viraj Jasani
Jira: HADOOP-18829