Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix regression issue of too large score #10518

Closed
wants to merge 4 commits into from

Conversation

siying
Copy link
Contributor

@siying siying commented Aug 10, 2022

Summary:
#10057 caused a regression bug: since the base level size is not adjusted based on L0 size anymore, L0 score might become very large. This makes compaction heavily favor L0->L1 compaction against L1->L2 compaction, and cause in some cases, data stuck in L1 without being moved down. We fix calculating a score of L0 by size(L0)/size(L1) in the case where L0 is large..

Test Plan: run db_bench against data on tmpfs and watch the behavior of data stuck in L1 goes away.

Summary:
facebook#10057 caused a regression bug: since the base level size is not adjusted based on L0 size anymore, L0 score might become very large. This makes compaction heavily favor L0->L1 compaction against L1->L2 compaction, and cause in some cases, data stucks in L1 without being moved down. We fix calculating a score of L0 in the same way of L1 so that L0->L1 is favored if L0 size is larger than L1.

Test Plan: run db_bench against data on tmpfs and watch the behavior of data stuck in L1 goes away.
@facebook-github-bot
Copy link
Contributor

@siying has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@siying has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@siying has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

db/version_set.cc Outdated Show resolved Hide resolved
db/version_set.cc Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@siying has updated the pull request. You must reimport the pull request before landing.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change makes sense to me. Score(L0) looks lower than pre-#10057 but higher than post-#10057 in case of growing L0 and Lbase. In any case I am fine to try this.

@facebook-github-bot
Copy link
Contributor

@siying has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@siying has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

// It might be guafanteed by logic below anyway, but we are
// explicit here to make sure we don't stop writes with no
// compaction scheduled.
score = std::max(score, 1.01);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use std::max(score, std::min(level_multiplier_, total_size / max_bytes_for_level_base)) so increased Size(L0) would actually cause score to increase even when it's smaller than Size(Lbase)? In the current PR, for example, if Score(Lbase) = 3, then Size(L0)/Size(Lbase) must be larger than 3 to prioritize L0->Lbase. That seems like too large an L0 and also an hourglass shape. With total_size / max_bytes_for_level_base we will at least prefer L0 whenever it's larger than Lbase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that. I faced a problem that total_size / max_bytes_for_level_base can be very large. In the current algorithm, actually L1 will have a much smaller score than L0 (because we de-prioritize a level when there are lots of pending data in upper levels), and this causes data stuck in L1. That's why it is always divided by Size(Lbase) if Lbase is relatively large, this is to make it possible for L0 score could be smaller than Lbase if Lbase is large.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I faced a problem that total_size / max_bytes_for_level_base can be very large

The std::min(level_multiplier_, total_size / max_bytes_for_level_base) intended to prevent that case

That's why it is always divided by Size(Lbase) if Lbase is relatively large, this is to make it possible for L0 score could be smaller than Lbase if Lbase is large.

I think Score(L0) < Score(Lbase) when Size(L0) < Size(Lbase) is good enough. The current formula makes Score(L0) < Score(Lbase) when Size(L0) < Score(Lbase) * Size(Lbase), which just seems like L0 may have to be much much bigger than Lbase in order for L0 to be prioritized.

Anyway I don't mind trying it if you want since hourglass LSM might work fine. I haven't measured it. So my accept still stands.

Copy link
Contributor Author

@siying siying Aug 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Score(L0) < Score(Lbase) when Size(L0) < Size(Lbase) is good enough. The current formula makes Score(L0) < Score(Lbase) when Size(L0) < Score(Lbase) * Size(Lbase), which just seems like L0 may have to be much much bigger than Lbase in order for L0 to be prioritized.

The benchmark I ran (the one Mark used) actually shows that L0->L1 still feels prioritized perhaps too much against L1->L2. If L0 is larger than L1, L0->L1 is always prioritized against L1->L2, because the score Math makes L1 score < 0.5 (5 after scaling). Similarly, if number of files are more than number of L0 file trigger, we almost always pick from L0. L1->L2 usually gets picked up with L1 is significantly larger than L0.

Here are some snapshots of the benchmark runs (level multiplier is 8 and base level size base is 64MB, exact parameters used by Mark):

  L0     12/10  321.94 MB 
  L4     23/23  368.26 MB 
  L5     11/0   176.77 MB 
  L6    112/4    1.50 GB  
  L7    798/23  11.42 GB  

  L0      8/6   218.41 MB
  L4     24/24  374.77 MB
  L5     20/0   317.81 MB
  L6    118/5    1.53 GB 
  L7    814/25  11.69 GB 

  L0     10/6   278.14 MB
  L4     20/20  312.10 MB
  L5     11/0   175.13 MB
  L6    119/4    1.55 GB 
  L7    830/18  11.96 GB 

  L0      9/6   307.93 MB
  L4     20/20  321.85 MB
  L5     12/0   193.10 MB
  L6    113/0    1.52 GB 
  L7    853/0   12.21 GB 

  L0      5/0   131.31 MB 
  L4     26/0   402.99 MB 
  L5     18/0   282.12 MB 
  L6    123/3    1.60 GB  
  L7    871/15  12.42 GB  

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If L0 is larger than L1, L0->L1 is always prioritized against L1->L2, because the score Math makes L1 score < 0.5 (5 after scaling).

Got it, did the math and agreed Score(L0) > Score(Lbase) when Size(L0) > Size(Lbase). I had forgotten to incorporate Size(L0) in the denominator of Score(Lbase) earlier (courtesy of total_downcompact_bytes). This is challenging to reason about but I am glad it works.

Similarly, if number of files are more than number of L0 file trigger, we almost always pick from L0.

Understood this part already. The size-based component of Score(L0) is not used for this case.

Copy link
Contributor

@ajkr ajkr Aug 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the score Math makes L1 score < 0.5 (5 after scaling).

My calculation for Score(Lbase) is 10 * S'(Lbase) / (max_bytes_for_level_base + S(L0)). S'(Li) is the sum of compensated size of files not being compacted on level i, and S(Li) is sum of all actual file sizes on level i.

I don't see why S'(L0) > S'(Lbase) (assuming that's what you mean by L0 is larger than L1) means Score(Lbase) is < 5. It looks to be typically less than 10, unless Lbase files have high size compensation.

For my attempt to prove Score(L0) > Score(Lbase) in this scenario (S'(L0) > S'(Lbase)), it required the following two small tweaks. I think they fix a case where Lbase has a lot of deletions and wonder if you think they make sense.

  • base_level_size uses S'(Lbase) instead of S*(Lbase). S*(Li) is sum of all compensated file sizes on level i. This change removes files undergoing compaction from base_level_size
  • total_downcompact_bytes uses S'(Li) instead of S(Li). This changes actual file size to compensated file size in total_downcompact_bytes, and removes files undergoing compaction.

siying added a commit that referenced this pull request Aug 12, 2022
Summary:
#10057 caused a regression bug: since the base level size is not adjusted based on L0 size anymore, L0 score might become very large. This makes compaction heavily favor L0->L1 compaction against L1->L2 compaction, and cause in some cases, data stuck in L1 without being moved down. We fix calculating a score of L0 by size(L0)/size(L1) in the case where L0 is large..

Pull Request resolved: #10518

Test Plan: run db_bench against data on tmpfs and watch the behavior of data stuck in L1 goes away.

Reviewed By: ajkr

Differential Revision: D38603145

fbshipit-source-id: 4949e52dc28b54aacfe08417c6e6cc7e40a27225
tabokie pushed a commit to tabokie/rocksdb that referenced this pull request Aug 14, 2023
Summary:
facebook#10057 caused a regression bug: since the base level size is not adjusted based on L0 size anymore, L0 score might become very large. This makes compaction heavily favor L0->L1 compaction against L1->L2 compaction, and cause in some cases, data stuck in L1 without being moved down. We fix calculating a score of L0 by size(L0)/size(L1) in the case where L0 is large..

Pull Request resolved: facebook#10518

Test Plan: run db_bench against data on tmpfs and watch the behavior of data stuck in L1 goes away.

Reviewed By: ajkr

Differential Revision: D38603145

fbshipit-source-id: 4949e52dc28b54aacfe08417c6e6cc7e40a27225
Signed-off-by: tabokie <[email protected]>
tabokie pushed a commit to tabokie/rocksdb that referenced this pull request Aug 15, 2023
Summary:
facebook#10057 caused a regression bug: since the base level size is not adjusted based on L0 size anymore, L0 score might become very large. This makes compaction heavily favor L0->L1 compaction against L1->L2 compaction, and cause in some cases, data stuck in L1 without being moved down. We fix calculating a score of L0 by size(L0)/size(L1) in the case where L0 is large..

Pull Request resolved: facebook#10518

Test Plan: run db_bench against data on tmpfs and watch the behavior of data stuck in L1 goes away.

Reviewed By: ajkr

Differential Revision: D38603145

fbshipit-source-id: 4949e52dc28b54aacfe08417c6e6cc7e40a27225
Signed-off-by: tabokie <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants