Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[aggregator] Propagate cancellation through tick #3895

Merged
merged 8 commits into from
Nov 3, 2021

Conversation

arnikola
Copy link
Collaborator

@arnikola arnikola commented Nov 2, 2021

Aggregator's tickInternal method can soft-block aggregator.Close() for
a long period of time if the doneCh is signalled during a tick; these end up
iterating over internal shard ticks, which can take up to EntryCheckInterval,
which defaults to an hour. This is much longer than the graceful close period,
which is 15s by default, so aggregator graceful close almost never completes,
and ends up timing out instead.

@codecov
Copy link

codecov bot commented Nov 2, 2021

Codecov Report

Merging #3895 (ab6289d) into master (246e22f) will decrease coverage by 0.1%.
The diff coverage is 77.7%.

❗ Current head ab6289d differs from pull request most recent head 62d82ee. Consider uploading reports for the commit 62d82ee to get more accurate results

Impacted file tree graph

@@           Coverage Diff            @@
##           master   #3895     +/-   ##
========================================
- Coverage    56.9%   56.7%   -0.2%     
========================================
  Files         553     553             
  Lines       63286   63323     +37     
========================================
- Hits        36051   35961     -90     
- Misses      24047   24163    +116     
- Partials     3188    3199     +11     
Flag Coverage Δ
aggregator 63.2% <77.7%> (+<0.1%) ⬆️
cluster ∅ <ø> (∅)
collector 58.4% <ø> (ø)
dbnode 60.4% <ø> (-0.3%) ⬇️
m3em 46.4% <ø> (ø)
metrics 19.7% <ø> (ø)
msg 74.2% <ø> (-0.3%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 246e22f...62d82ee. Read the comment docs.

@@ -51,7 +66,14 @@ func Serve(
if err := m3msgServer.ListenAndServe(); err != nil {
return fmt.Errorf("could not start m3msg server at: addr=%s, err=%v", m3msgAddr, err)
}
defer m3msgServer.Close()

defer func() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably add all of this logging into x/server server.go Close func so you don't need to do these custom defer func

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though prob only a subset of the servers we want this for

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm actually not all of these are using this x/server, e.g. the http handler just implements x/server here, so this may not be the best approach

Copy link
Collaborator

@ryanhall07 ryanhall07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logging feels like debug to me and kind of noisy when things are actually working as intended.

src/aggregator/aggregator/aggregator.go Outdated Show resolved Hide resolved
src/aggregator/aggregator/aggregator.go Outdated Show resolved Hide resolved
)

// NB: if no doneChan provided, do not interrupt the tick.
if doneCh == nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when is a done chan not provided?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be if everything is hooked up correctly, this is largely just to be defensive here; might remove it and hard panic or better yet error if we don't have a chan passed in to make it more explicit

Copy link
Collaborator Author

@arnikola arnikola Nov 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, actually... TIL that you can use nil channels in a select https://play.golang.org/p/DltRooS3v7D, just going to get rid of this section

shardTickResult := shard.Tick(perShardTickDuration)
tickResult = tickResult.merge(shardTickResult)
select {
case <-agg.doneCh:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we don't actually need to check here right? since we check within the tickShardFn when iterating over every metric right? just checking my understanding.

the only reason I mention this nit, is it at first it reads like we only check once per shard, which would be bad.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is if we've signaled on doneCh, we can avoid ticking the remainder of the shards, I'll make this more explicit

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah right. good point. nah it makes sense.

@arnikola arnikola enabled auto-merge (squash) November 3, 2021 14:24
@arnikola arnikola disabled auto-merge November 3, 2021 14:36
@arnikola arnikola merged commit 27ed1c0 into master Nov 3, 2021
@arnikola arnikola deleted the arnikola/graceful-close branch November 3, 2021 14:59
soundvibe added a commit that referenced this pull request Nov 4, 2021
* master:
  Fix race when checking for dirty aggregations (#3886)
  [aggregator] Add test coverage to expireValues (#3898)
  [aggregator] Propagate cancellation through tick (#3895)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants