Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RecursionError observed in gettext calls #674

Open
timmc-edx opened this issue Jun 7, 2024 · 4 comments
Open

RecursionError observed in gettext calls #674

timmc-edx opened this issue Jun 7, 2024 · 4 comments

Comments

@timmc-edx
Copy link
Member

timmc-edx commented Jun 7, 2024

We're observing bursts of RecursionError in the LMS relating to Python's gettext module.

The earliest known case was May 15, but we don't know how far back these actually go, due to telemetry limitations.

Looking at the June 6 burst using env:prod service:edx-edxapp-lms status:error error.type:builtins.RecursionError, I do not see any visible precipitating incident, although the end coincided with a deploy. No correlation with code owner, IP address, user ID, route, org, or course ID. 3 hosts were involved across 2 AZs.

The two instances I've spot-checked during the June 6 burst appear to involve this frame repeating:

  File "/usr/lib/python3.8/gettext.py", line 497, in gettext
    return self._fallback.gettext(message)

The earliest one I can find (trace 66587fa500000000d95d5070b24064bd on May 30 at 13:31:17.356 UTC) is a little different but still involves gettext and fallbacks:

  File "/usr/lib/python3.8/gettext.py", line 268, in add_fallback
    self._fallback.add_fallback(fallback)
  File "/usr/lib/python3.8/gettext.py", line 268, in add_fallback
    self._fallback.add_fallback(fallback)

To investigate

  • Given that this burst ended after a deploy and only involved a few hosts, is there a chance that this is analogous to a memory leak? Perhaps some code makes recursive calls, and this situation gets worse over time. Is something adding hundreds of fallbacks?
  • Any correlation with user agent or referrer?
  • django-countries shows up in some of these stack traces -- is it involved or just incident to them?
@timmc-edx timmc-edx converted this from a draft issue Jun 7, 2024
@robrap robrap moved this to Prioritized in Arch-BOM Jun 7, 2024
@robrap robrap removed the status in Arch-BOM Jun 7, 2024
@robrap
Copy link
Contributor

robrap commented Jun 7, 2024

@timmc-edx: Is there any chance this could interfere with Datadog-related changes, and should it be prioritized? Or do you think it is fine to wait? Note that having the APM Profiler in place would be nice if this were a memory-related issue.

@timmc-edx
Copy link
Member Author

timmc-edx commented Jun 10, 2024

I suspect it's an unrelated issue. It's troubling, but probably not our top priority.

@jristau1984 jristau1984 moved this to Backlog in Arch-BOM Jul 1, 2024
@timmc-edx
Copy link
Member Author

We recently saw another burst of RecursionError. This occurred on 7/28 UTC, a Sunday in the US, and there were no deploys around the start. This logs-based timeline shows a sustained burst starting around 21:30 and then slowly trailing off until around 03:00 the next day, with spikes at about 15 minutes past each hour (perhaps some automated job):

image

A zoom-in of the start time shows the entire ramp-up occurring from 21:35 to 21:40:

image

Note that the numbers are larger in this case, probably because logs are sampled and filtered less heavily than traces.

@timmc-edx
Copy link
Member Author

One of the strangest things is that this issue has each time been limited to a few hosts at a time, but occurring almost simultaneously, and across availability zones.

image

A zoom-in of the start of the 7/28 event indicates that it affected the three hosts progressively, but in close succession (1-2 minutes apart):

image

This either indicates the hosts were responding to the same network event or were running into some resource limit around the same time.

(For reference: The hosts here were i-075567276c7863e51, i-06f30d88537d08101916, i-002f78eb6f8519bd9819)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

2 participants