Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the local cache fallback behavior for corrupted pages #18498

Merged

Conversation

beinan
Copy link
Contributor

@beinan beinan commented Jan 25, 2024

What changes are proposed in this pull request?

Throw a PageCorruptedException when the length of page inconsistent with the metadata
Do our best efforts to delete the corrupted page file
Reset the offset of buffer when we found the data has been corrupted to avoid ArrayOutOfBound exception.

Why are the changes needed?

We found the cache make presto keep failing when some of the page file got corrupted

Does this PR introduce any user facing changes?

No

@beinan beinan force-pushed the implement_fallback_for_data_corruption branch 3 times, most recently from 0d4b82f to c715238 Compare January 25, 2024 19:16
@beinan beinan force-pushed the implement_fallback_for_data_corruption branch from c715238 to 9b39619 Compare January 25, 2024 19:35
@@ -927,10 +929,20 @@ private int getPage(PageInfo pageInfo, int pageOffset, int bytesToRead,
// data read from page store is inconsistent from the metastore
LOG.error("Failed to read page {}: supposed to read {} bytes, {} bytes actually read",
pageInfo.getPageId(), bytesToRead, ret);
target.offset(originOffset); //reset the offset
//best efforts to delete the corrupted file without acquire the write lock
deletePage(pageInfo, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to delete the metadata as well?

LOG.error("Data corrupted page {} from pageStore", pageInfo.getPageId(), e);
target.offset(originOffset); //reset the offset
//best efforts to delete the corrupted file without acquire the write lock
deletePage(pageInfo, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to delete the metadata as well?

Comment on lines +586 to +601
@Test
public void testPageDataFileCorrupted() throws Exception
{
int pages = 10;
int fileSize = mPageSize * pages;
byte[] testData = BufferUtils.getIncreasingByteArray(fileSize);
ByteArrayCacheManager manager = new ByteArrayCacheManager();
//by default local cache fallback is not enabled, the read should fail for any error
LocalCacheFileInStream streamWithOutFallback = setupWithSingleFile(testData, manager);

sConf.set(PropertyKey.USER_CLIENT_CACHE_FALLBACK_ENABLED, true);
LocalCacheFileInStream streamWithFallback = setupWithSingleFile(testData, manager);
Assert.assertEquals(100, streamWithFallback.positionedRead(0, new byte[10], 100, 100));
Assert.assertEquals(1,
MetricsSystem.counter(MetricKey.CLIENT_CACHE_POSITION_READ_FALLBACK.getName()).getCount());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method duplicated with the method testPositionReadFallBack()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhhh, I see, I got your point, yes, this test is covered by testPositionReadFallBack. I will remove this test case. Thanks!

Copy link
Contributor

@JiamingMai JiamingMai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, and I left some comments.

@JiamingMai JiamingMai added the type-code-quality code quality improvement label Feb 4, 2024
@beinan
Copy link
Contributor Author

beinan commented Feb 23, 2024

alluxio-bot, merge this please.

@alluxio-bot alluxio-bot merged commit a3ecbc7 into Alluxio:main Feb 23, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-code-quality code quality improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants