Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement] trim redundant read when scanning entry log. #4161

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

thetumbled
Copy link
Member

@thetumbled thetumbled commented Dec 21, 2023

Descriptions of the changes in this PR:
Remove the redundant disk read when scannig the entry log file, which could reduce the pressure of disk a lot.

Motivation

We have org.apache.bookkeeper.bookie.storage.EntryLogScanner to process the entry data when scanning the entry log file, in which process method is used to process the entry data.
We have a lot of implementations for EntryLogScanner for various reason.
image
But some of them do not need to access all data in the entry log file.
For example:

  • scanner implemented in method extractEntryLogMetadataByScanning only need to know the entrySize field, in such case we do not need to read any data in the entry.
  • scanner implemented in method org.apache.bookkeeper.bookie.InterleavedStorageRegenerateIndexOp#initiate only need to know the entryId field, in such case we only need to read the first 16 bytes(ledgerId+entryId).
  • many more simillar situations...

So, we do not need to read the entry data when scanning the entry log in some cases.

Changes

Add method getLengthToRead in EntryLogScanner to indicate how much data do we need to read, and read this mount of data only.

@thetumbled
Copy link
Member Author

Could you help to reviw this PR? thanks!
@eolivelli @dlg99 @hangc0276 @nicoloboschi @shoothzj @zymap @wenbingshen

@thetumbled thetumbled changed the title [Improvement] skip read rebundunct content when scanning entry log. [Improvement] trim rebundunct read when scanning entry log. Dec 21, 2023
@thetumbled thetumbled changed the title [Improvement] trim rebundunct read when scanning entry log. [Improvement] trim redundant read when scanning entry log. Dec 21, 2023
@thetumbled thetumbled requested a review from eolivelli December 21, 2023 12:46
Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very good, I left one final comment about the "switch" statement

return;
}
scanner.process(ledgerId, offset, data);
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add the "default" case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, PTAL.

Copy link
Contributor

@hangc0276 hangc0276 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure of the extent of the improvements that can be achieved with this change, given the following concerns:

  • All the changes in this PR are specific to uncommon cases, occurring only when the entrylog file's index is missing.
  • The readFromLogChannel function utilizes BufferedLogChannel, which is a RandomAccessFile channel. When reading data from the file channel, the data is pre-fetched into the OS PageCache, with a default pre-fetch size of 4KB. Even though we only retrieve the entryId, the actual data read from the disk is a minimum of 4KB.

There is a potential risk associated with this PR:

  • We solely read the entryID without validating the entry data. If the file is corrupted, the scan operation will be unable to detect it and will incorrectly populate the index with the wrong ledger size. This introduces a high level of risk.

@hangc0276 hangc0276 requested review from merlimat and zymap December 25, 2023 03:25
@thetumbled
Copy link
Member Author

thetumbled commented Dec 25, 2023

All the changes in this PR are specific to uncommon cases, occurring only when the entrylog file's index is missing.

We have encountered cases that the ledger map in entry log is missed, maybe because the bookie crashed before flushing entry log. As long as the corrupted entry log file exists, bookie will scan the entry log by extractEntryLogMetadataByScanning to generate EntryLogMetadataMap when doing gc every gcWaitTime milliseconds (default 15min).
And other cases is related to index rebuilding, which may be rarely used.

The readFromLogChannel function utilizes BufferedLogChannel, which is a RandomAccessFile channel. When reading data from the file channel, the data is pre-fetched into the OS PageCache, with a default pre-fetch size of 4KB. Even though we only retrieve the entryId, the actual data read from the disk is a minimum of 4KB.

It is common that the size of one entry reach 4MB ( we set the max batch size of Pulsar Client to be 4MB), so we can decrease 99.9% of the disk read with this enhancement.

There is a potential risk associated with this PR:
We solely read the entryID without validating the entry data. If the file is corrupted, the scan operation will be unable to detect it and will incorrectly populate the index with the wrong ledger size. This introduces a high level of risk.

The fault detecting logic is not changed, in the old logic we just read entrySize amount of data and check if we have read such amount of data. In the new read type, READ_NOTHING or READ_LEDGER_ENTRY_ID, we still move the read position in the loop and check if we have read expected amount of data. I don't think it is a break change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants