-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
changefeedccl: Emit span resolved event when end time reached #109439
Conversation
@jayshrivastava -- ignore first commit; it's being reviewed at #109346 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @miretskiy)
-- commits
line 40 at r3:
typo: improve
pkg/ccl/changefeedccl/kvevent/event.go
line 100 at r3 (raw file):
// number of event types. numEventTypes = 3
I think we should do TypeResolved + 1
pkg/ccl/changefeedccl/kvevent/event.go
line 131 at r3 (raw file):
return int(TypeKV) default: return int(TypeResolved)
Do you think we could add a case for Unknown? The blocking buf should never see an event of type unknown bc the kvfeed will check before adding, but maybe someone in the future will call this without doing the check first.
pkg/ccl/changefeedccl/changefeed_test.go
line 6873 at r4 (raw file):
// Don't care much about the values emitted (tested elsewhere) -- all // we want to make sure is that the feed terminates.
We should check if we get resolved timestamps at the boundary when end_time is reached. Can use the filter span with mutation knob.
pkg/ccl/changefeedccl/kvevent/blocking_buffer_test.go
line 180 at r3 (raw file):
require.NoError(t, buf.CloseWithReason(context.Background(), context.Canceled)) require.EqualValues(t, 0, metrics.AllocatedMem.Value())
I think we should assert that the numKV, numResolved, numFlush metrics are reset too. You probably need to store in flight counters to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @jayshrivastava)
pkg/ccl/changefeedccl/kvevent/event.go
line 100 at r3 (raw file):
Previously, jayshrivastava (Jayant) wrote…
I think we should do
TypeResolved + 1
Done.
pkg/ccl/changefeedccl/kvevent/event.go
line 131 at r3 (raw file):
Previously, jayshrivastava (Jayant) wrote…
Do you think we could add a case for Unknown? The blocking buf should never see an event of type unknown bc the kvfeed will check before adding, but maybe someone in the future will call this without doing the check first.
made it more explicit + log.warning in case of unknown event.
pkg/ccl/changefeedccl/changefeed_test.go
line 6873 at r4 (raw file):
Previously, jayshrivastava (Jayant) wrote…
We should check if we get resolved timestamps at the boundary when end_time is reached. Can use the filter span with mutation knob.
Good idea -- done.
pkg/ccl/changefeedccl/kvevent/blocking_buffer_test.go
line 180 at r3 (raw file):
Previously, jayshrivastava (Jayant) wrote…
I think we should assert that the numKV, numResolved, numFlush metrics are reset too. You probably need to store in flight counters to do so.
i'll pass for now. I tried that. First, numKV/flush/resolved are counters, not gauges.
And even those things that are counters (like Acquired/Released counters) are updated only when
release is called. I realized that release sometimes is not called -- for example when buffer is closed w/out releasing all buffered events.
Plus, there is fair bit of variability in these flush events being generated.
Point is -- this test was becoming more and more elaborate; and it's not clear if trying to get all metrics correctly reset was proving to be more trouble than worth.
Instead, I focused on the allocated memory -- that one is correct to be reset because whether or not
I release events, the memory is returned when closing quota.
I suspect we'll have to revisit this issue since I'm not 100% convinced that we're not generating excessive number of flush requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Ty for working on this
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @miretskiy)
pkg/ccl/changefeedccl/changefeed_test.go
line 6920 at r5 (raw file):
// After changefeed completes, verify we have seen all ranges emit resolved // event with end_time timestamp. That is: verify frontier.Frontier() is at end_time. expectedFrontier := endTime.Prev()
For my own understanding, can you explain why it's Prev()
?
pkg/ccl/changefeedccl/kvevent/blocking_buffer_test.go
line 180 at r3 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
i'll pass for now. I tried that. First, numKV/flush/resolved are counters, not gauges.
And even those things that are counters (like Acquired/Released counters) are updated only when
release is called. I realized that release sometimes is not called -- for example when buffer is closed w/out releasing all buffered events.
Plus, there is fair bit of variability in these flush events being generated.Point is -- this test was becoming more and more elaborate; and it's not clear if trying to get all metrics correctly reset was proving to be more trouble than worth.
Instead, I focused on the allocated memory -- that one is correct to be reset because whether or not
I release events, the memory is returned when closing quota.I suspect we'll have to revisit this issue since I'm not 100% convinced that we're not generating excessive number of flush requests.
Ack!
We always treated end_time exclusive. So, when end time boundary reached, we emit "EXIT" span boundary at boundary.Prev. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not too familiar with the CDC parts, so only skimmed this, but overall LGTM.
Reviewed 3 of 8 files at r1, 1 of 1 files at r2, 4 of 5 files at r3, 7 of 9 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava and @miretskiy)
pkg/cmd/roachtest/tests/cdc_bench.go
line 201 at r5 (raw file):
// (and release allocations), resulting in significant drop in completed ranges throughput. settings.ClusterSettings["kv.rangefeed.closed_timestamp_refresh_interval"] = "5s" settings.ClusterSettings["changefeed.memory.per_changefeed_limit"] = "4G"
Shouldn't we benchmark out-of-the-box performance, since that's what users will typically experience? If the defaults are bad, then we should fix the defaults, not fudge the benchmark.
If we really need this, I'd argue it should be a separate variant, so that we benchmark both default performance and tuned performance -- it's important that we have a realistic view of what the real-world performance is.
pkg/cmd/roachtest/tests/cdc_bench.go
line 215 at r5 (raw file):
// due to elevated goroutine scheduling latency. // Default is 2s; make it 3x. settings.Env = append(settings.Env, "COCKROACH_NETWORK_TIMEOUT=6s")
Fixed on master, see #109578.
Good points, @erikgrinaker . I will test to see if this benchmark passes with the new closed timestamp
Nice to know; I'm going to try to see if 4s is sufficient. |
@erikgrinaker -- verified benchmark still passes once I removed some of the settings you pointed out (timeout of 4 sec is fine, and 3s closed ts is also fine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The typical user isn't going to change any of these settings unless we tell them to. I think we should focus on the default case, and make the default case better if necessary. We can have separate tuned benchmark variants if we want to look at their effect. Your call though, these are CDC's benchmarks.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava and @miretskiy)
Fix data race in management of `InCatchup` range information. Epic: None Release note: None
Use `quotapool.Inf()` as the throttle limit instead of max int64 to ensure that when throttler is desabled, attempts to acquire quota are roughly free. Epic: None Release note: None
Improve blocking buffer observability by adding: * `changefeed.buffer_entries.allocated_memory` -- gauge keeping track of currently allocated memory for the events added to the blocking buffer. * `changefeed.buffer_entries.<event_type>` -- counters keeping track of the number of event type events (flush, KV, resolved) added. Informs cockroachdb#108464 Release note: None
Changefeed supports a mode where the user wants to emit all events that occurred since some time in the past (`cursor`), and end the changefeed (`end_time) at the time in the near future. In this mode, the rangefeed catchup scan starting from `cursor` position could take some time -- maybe even a lot of time -- and in this case, the very first checkpoint kvfeed will observe will be after `end_time`. All of the events, including checkpoints after `end_time` are skipped, as they should. However, this meant that no changefeed checkpoint records could be produced until entire changefeed completes. This PR ensures that once the `end_time` is reached, we will emit 1 "resolved event" for that span, so that changefeed can produce span based checkpoint if needed. Fixes cockroachdb#108464 Release note: None
Re-enable regular rangefeed catchup benchmark over 100k ranges. Adjust cdc bench configuration to ensure the benchmark completes in reasonable time. Fixes cockroachdb#108157 Release note: None
bors r+ |
Build succeeded: |
Changefeed supports a mode where the user wants to emit
all events that occurred since some time in the past (
cursor
),and end the changefeed (`end_time) at the time in the near future.
In this mode, the rangefeed catchup scan starting from
cursor
position could take some time -- maybe even a lot of time --
and in this case, the very first checkpoint kvfeed will observe
will be after
end_time
. All of the events, includingcheckpoints after
end_time
are skipped, as they should.However, this meant that no changefeed checkpoint
records could be produced until entire changefeed completes.
This PR ensures that once the
end_time
is reached, we willemit 1 "resolved event" for that span, so that changefeed
can produce span based checkpoint if needed.
Fixes #108464
Release note: None