-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable APM profiling for edxapp #749
Comments
We should roll out to Stage, then Edge, then Prod. |
DD support ticket for latency issues we encountered during the most recent rollout attempt: https://help.datadoghq.com/hc/requests/1909564 |
It seems like the newer version might be more efficient, so we should switch to using it. edx/edx-arch-experiments#749
I think I've managed to repro slow gunicorn startup on a sandbox instance. Profiling setupAdded to
And then:
(Can also restart workers with To get DD profiling data on both sides, pushed buttons in instructor dashboard and made calls to Gunicorn reproIn a dev terminal, make short HTTP calls to the LMS 1-2 times per second: For each config:
nginx output will look something like this:
The initial transition of For comparison, here's
In this sample, it appears that those calls that were recorded as a 499 did eventually get received by the LMS and were all processed in a burst about 10 seconds after workers actually started. EvaluationAfter the 503s end: Find the number of seconds from the first 499 to the first 200. This is the "startup period". |
ExperimentsProfiling offWith profiling off (no profiling-related settings), the startup period lasts 12 seconds. ReproWith the below profiling config, which is what we most recently used in the stage environment, the startup period lasts 20 seconds.
BaselinesJust with profiling enabled, nothing else:
19 seconds (with one 499 a few seconds after the first 200s); 18; 18 Profiling, but v2 stack:
21; 22; 21 Experiment designI'll keep this disabled for now since it's not needed for repro, and since we'll probably only want to use it when we want to actually look at the generated profiles:
To experiment with:
|
With a baseline of
On to the toggles... Turning every profiling feature off (except for profiling itself) gets to the "good" situation:
11, 9, 11
19, 19
12, 11, 11
18, 17, 17
16, 16, 16
14, 13, 13
13, 15, 15 |
More experiments... Profiling on (but v2 stack not enabled), memory disabled:
13, 12, 13 |
Also able to reproduce this on devstack. Setup
MeasurementsBaseline: [11, 15, 14, 14] seconds from "Booting worker" to first GET 200 in logs — 13.4 seconds geometric mean Profiling enabled:
[27, 32, 31, 28] — 29.4 seconds geometric mean Profiling, but not memory:
[19, 19, 20, 15, 20] — 18.5 seconds geometric mean |
ddtrace 2.18.1 is even worse, with the basic |
Ultimately, we want to enable APM profiling for edxapp, when we think it is safe.
Notes:
The 2U Slack thread may be able to be found, if it would be helpful, but I'm guessing it won't because we were just guessing.
The text was updated successfully, but these errors were encountered: