-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix bug in auto-compact #9443
fix bug in auto-compact #9443
Conversation
ping @xiang90 |
work as expected now
|
I think this bug is brought by this pr #8563. |
yeah I agree. we need to add test for this |
Can you fix test failures? Also, this only needs backport to 3.3. |
compactor/periodic.go
Outdated
@@ -86,6 +86,10 @@ func (t *Periodic) Run() { | |||
} | |||
plog.Noticef("Starting auto-compaction at revision %d (retention: %v)", rev, t.period) | |||
_, err := t.c.Compact(t.ctx, &pb.CompactionRequest{Revision: rev}) | |||
|
|||
// update the last compaction time | |||
last = clock.Now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why adding here? should we record last compaction time only when it succeeds?
put it under ?
if err == nil || err == mvcc.ErrCompacted {
t.revs = remaining
last = clock.Now()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think it makes sense to reset last
for every Compact
call, whether it's failed or not.
Otherwise, the interval would be 1/10 of original period, once clock.Now().Sub(last) >= t.period
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gyuho the bug was introduced here #8563 (diff)
as you can see, the original code sets last
only when compaction succeeds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah saw that. We can just reset last
only when it succeeds, and maybe change this later in refactoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, when I write this fix, I think it a while.I think we should update the last
whether failed or not. when it failed, we don't want to retry faster either(10X faster), otherwise you should write a doc to tell the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after second thought, we'd better only update the last when the compact succeeds. Or we have to wait for another compaction duration rather than checkCompactInterval
. It also makes the logging at line 98 wrong.
let us fix this as what @fanminshi suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that I don't want to retry faster(10x faster is too fast) less safe when compact fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WIZARD-CXY if you have a compaction duration, which is 10hours. you 100% do not want to retry after another 10 hours. it means that your etcd db size will likely be doubled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiang90 the thing is compaction duration is 10min in our case, not 10hr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it is all about different user cases. For fixing this bug first, I agree to change it to update the last
only when compaction goes successfully.
I get why the test didn't pick up this finally, below code is from periodic_test.go
because as @fanminshi's comment "// after 2 hours, compaction happens at every checkCompactInterval." |
changed according to the comment, but I don't know quite understand how this test works. so don't change it now, @gyuho can u give another test code? |
@fanminshi can you take a final look at this one? |
@xiang90 yes I am taking a look at this issue. |
I did a more in-depth digging into what is going on with #8563. Before change in #8563: etcd only supports compaction retention at the hour level.
Hence, in this case the etcd always retain at least 1 hr worth of history However, what if I want to retain 5hrs worth of history where I set etcd flag
However, that's not the case. etcd actually first waits 5hrs for the first compaction and then does compaction every hour afterward like this:
The above was the default behavior (please correct me if I am wrong). After change in #8563: The purpose of #8563 is to support the compaction retention at user defined level Suppose that the user specifies retention policy to be More, if the user specifies retention policy to be Hence, the #8563 decides to divide the user defined retention
Question to ask? |
@WIZARD-CXY I see two options going forward with this.
Then update the test to reflect the agreed behavior. |
I think we should reset compact interval on success, so that we only retain by the given retention number in hours. This is how v3.2 |
@gyuho are you sure about this. |
ah, |
@gyuho yeah, we need to give more thought on the correct behavior before continuing with this pr. |
see #7868. For hourly compaction, it makes sense to run more frequently to keep X hour history. For short compaction time, probably we can just compact every X interval. |
@fanminshi I like the option2 "Change the auto retention policy to be that compact every AutoCompactionRetention defined by the user." It is straightforward and right match for my case. |
@WIZARD-CXY the behavior was like option 2, but it was changed because of #7868.
|
Let me provide some context. I once modified the periodic compactor to run auto compaction hourly (#7868, #7875). Before #7875, when you set 10 hours, etcd can keep revisions for up to 20hours. This means you need to prepare your DB quota for 20 hours data. It's not a big problem for shorter duration, but if you set a larger compaction retention time, the overhead can be too large (especially, when the cap was limited to up to 8GB). I once thought that we could define another flag for auto compaction interval, separated from auto compaction retention time or retention revisions. But I thought it was over-engineering at the moment. However, it could be worth considering, given we can see different needs for the auto compaction interval. The 5 minute interval is used for the revision compactor as well. I therefore think we want to check if the interval is reasonable for the revision compactor as well. For the default value for the periodic compactor. As @xiang90 suggested, I think that |
Oops, sorry I had some misunderstanding on the current behavior.
Please forget about this. The variable is not shared any longer from 3.3. The current interval is As already suggested, I agree that just setting the minimum interval is enough to prevent too aggressive auto compaction. |
|
we need to update the
last
variable to record the last compaction time in the for loop. If not we will do compaction in checkCompactInterval instead of t.period . Not quite sure about whether unit test failed to test it out.