Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flush aggregator on shutdown #3971

Merged
merged 7 commits into from
Aug 4, 2020
Merged

Flush aggregator on shutdown #3971

merged 7 commits into from
Aug 4, 2020

Conversation

axw
Copy link
Member

@axw axw commented Jul 14, 2020

Motivation/summary

Flush aggregated transaction metrics on shutdown, to minimise data loss. This could be important in a rolling upgrade.

There are some significant, and necessary, changes to the publisher to support an interruptible graceful shutdown-with-timeout. The publisher's Stop method now takes a context, and waits until the context is cancelled, or the published events are acknowledged, before returning. This is all necessary because the shutdown timeout, from the user's point of view, should encompass the entire shutdown procedure -- not just closing the pipeline client.

Checklist

I have considered changes for:
- [ ] documentation
- [ ] logging (add log lines, choose appropriate log selector, etc.)
- [ ] metrics and monitoring (create issue for Kibana team to add metrics to visualizations, e.g. Kibana#44001)

How to test these changes

  1. Enable transaction aggregation
  2. Send some transactions
  3. Shutdown APM Server (Ctrl+C) before the aggregation interval elapses
  4. Check that metrics are published within the shutdown interval (default 5s)

Related issues

Closes #3789

@apmmachine
Copy link
Contributor

apmmachine commented Jul 14, 2020

💚 Build Succeeded

Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: [Pull request #3971 event]

  • Start Time: 2020-08-04T05:32:15.574+0000

  • Duration: 45 min 36 sec

Test stats 🧪

Test Results
Failed 0
Passed 3220
Skipped 153
Total 3373

Steps errors

Expand to view the steps failures

  • Name: Compress

    • Description: tar --exclude=coverage-files.tgz -czf coverage-files.tgz coverage

    • Duration: 0 min 0 sec

    • Start Time: 2020-08-04T05:46:37.151+0000

    • log

  • Name: Compress

    • Description: tar --exclude=system-tests-linux-files.tgz -czf system-tests-linux-files.tgz system-tests

    • Duration: 0 min 0 sec

    • Start Time: 2020-08-04T06:09:12.518+0000

    • log

@axw axw force-pushed the aggregator-shutdown branch from 3f9b8aa to 1b3b98d Compare July 14, 2020 07:14
axw added 5 commits July 23, 2020 17:53
Change the publisher to wait for enqueued events to
be published when Stop is called, up to the configured
ShutdownTimeout.
Remove the context param from Aggregator.Run, and
instead add a Stop method which accepts a context.
When Stop is called it directs Run to exit after
performing one final publication. Stop waits for
Run to exit, or for the context to be cancelled,
whichever comes first.

We call the Stop method with a timeout based on
the `apm-server.shutdown_timeout` configuration.
publisher.Stop now accepts a context, which is used
to interrupt waiting for published events to be
acknowledged. This is implemented by introducing a
custom beat.ACKer.
@axw axw force-pushed the aggregator-shutdown branch from 1b3b98d to 5d435e8 Compare July 23, 2020 10:12
@axw axw changed the title Aggregator shutdown Flush aggregator on shutdown Jul 23, 2020
@axw axw marked this pull request as ready for review July 23, 2020 10:37
@axw axw requested a review from a team July 23, 2020 10:37
@axw axw merged commit a937ee8 into elastic:master Aug 4, 2020
@axw axw deleted the aggregator-shutdown branch August 4, 2020 06:33
axw added a commit to axw/apm-server that referenced this pull request Aug 20, 2020
* publish: publish enqueued events on shutdown

Change the publisher to wait for enqueued events to
be published when Stop is called, up to the configured
ShutdownTimeout.

* aggregation/txmetrics: graceful shutdown

Remove the context param from Aggregator.Run, and
instead add a Stop method which accepts a context.
When Stop is called it directs Run to exit after
performing one final publication. Stop waits for
Run to exit, or for the context to be cancelled,
whichever comes first.

We call the Stop method with a timeout based on
the `apm-server.shutdown_timeout` configuration.

* tests/system: add test for flushing aggregations

* publish: add context to publisher.Stop

publisher.Stop now accepts a context, which is used
to interrupt waiting for published events to be
acknowledged. This is implemented by introducing a
custom beat.ACKer.

* Move publisher ShutdownTimeout handling to beater

* Update changelog
axw added a commit that referenced this pull request Aug 20, 2020
* publish: publish enqueued events on shutdown

Change the publisher to wait for enqueued events to
be published when Stop is called, up to the configured
ShutdownTimeout.

* aggregation/txmetrics: graceful shutdown

Remove the context param from Aggregator.Run, and
instead add a Stop method which accepts a context.
When Stop is called it directs Run to exit after
performing one final publication. Stop waits for
Run to exit, or for the context to be cancelled,
whichever comes first.

We call the Stop method with a timeout based on
the `apm-server.shutdown_timeout` configuration.

* tests/system: add test for flushing aggregations

* publish: add context to publisher.Stop

publisher.Stop now accepts a context, which is used
to interrupt waiting for published events to be
acknowledged. This is implemented by introducing a
custom beat.ACKer.

* Move publisher ShutdownTimeout handling to beater

* Update changelog
@simitt
Copy link
Contributor

simitt commented Oct 15, 2020

successfully tested;

  • started APM Server with aggregation enabled
  • sent a couple of transaction docs
  • ensured histogram metrics are not yet sent to ES
  • stopped APM Server, ensured histogram metrics are indexed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flush current histogram data on shutdown
3 participants