-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RecursionError observed in gettext calls #674
Comments
@timmc-edx: Is there any chance this could interfere with Datadog-related changes, and should it be prioritized? Or do you think it is fine to wait? Note that having the APM Profiler in place would be nice if this were a memory-related issue. |
I suspect it's an unrelated issue. It's troubling, but probably not our top priority. |
We recently saw another burst of RecursionError. This occurred on 7/28 UTC, a Sunday in the US, and there were no deploys around the start. This logs-based timeline shows a sustained burst starting around 21:30 and then slowly trailing off until around 03:00 the next day, with spikes at about 15 minutes past each hour (perhaps some automated job): A zoom-in of the start time shows the entire ramp-up occurring from 21:35 to 21:40: Note that the numbers are larger in this case, probably because logs are sampled and filtered less heavily than traces. |
One of the strangest things is that this issue has each time been limited to a few hosts at a time, but occurring almost simultaneously, and across availability zones. A zoom-in of the start of the 7/28 event indicates that it affected the three hosts progressively, but in close succession (1-2 minutes apart): This either indicates the hosts were responding to the same network event or were running into some resource limit around the same time. (For reference: The hosts here were |
We're observing bursts of RecursionError in the LMS relating to Python's
gettext
module.The earliest known case was May 15, but we don't know how far back these actually go, due to telemetry limitations.
Looking at the June 6 burst using
env:prod service:edx-edxapp-lms status:error error.type:builtins.RecursionError
, I do not see any visible precipitating incident, although the end coincided with a deploy. No correlation with code owner, IP address, user ID, route, org, or course ID. 3 hosts were involved across 2 AZs.The two instances I've spot-checked during the June 6 burst appear to involve this frame repeating:
The earliest one I can find (trace
66587fa500000000d95d5070b24064bd
on May 30 at 13:31:17.356 UTC) is a little different but still involves gettext and fallbacks:To investigate
The text was updated successfully, but these errors were encountered: