-
Notifications
You must be signed in to change notification settings - Fork 967
Conversation
PR Cover Letter
ReviewersIf you've been tagged for review, please check your corresponding box once you've given the 👍.
Post-review
|
@feanil almost all the rest of the Dogwood changes. |
👍 |
This PR broke sandbox builds for us. Restarting the certs service now fails:
This is the log output for the certs service:
Reverting this PR fixed the problem for us. |
@smarnach odd, because much of the change in this PR was to fix a problem where certs wouldn't start because xqueue wasn't ready yet, which looks like what's happening here. |
@nedbat Running the automated installation exactly as documented on a freshly installed Ubuntu 12.04 box results in the same error, so I think this is a genuine problem. We can work around the problem by reverting this PR, so we don't need a fix urgently. However, we probably don't want to leave it in the state it currently is. How should we proceed? @benpatterson Just for your information, here is another problem with sandbox deployment that wasn't caught by tests on your side. |
@smarnach I ran those instructions today, and it worked. It sounds like we have a race condition at work here. But the code changed specifically to fix this problem. The old way would use notify to restart servers, and the cert and xqueue restarts both happened very close together, which caused the problem. The new code restarts each server at the end of its own role, which should separate them more in time. Perhaps we need more separation? @feanil any thoughts? |
@nedbat Just as additional data points, we experienced the error in our automated sandbox deployment in 9 out of 9 attempts. Then I tried it manually on a freshly provisioned Vagrant instance using the |
@smarnach Has this been solved or discussed further somewhere else? |
@antoviaque I currently don't know whether we still experience this particular bug. Until last week, we were able to successfully provision sandboxes with the branch that reverts this pull request; then sandbox provisioning started failing again, but due to a different issue. I haven't tested specifically whether this is still a problem (and there haven't been any further discussions). |
What's the latest error? |
@nedbat It's a longer-standing issue with Celery workers hanging when being restarted by supervisorctl. There seems to be an (intermittently occurring) deadlock either in supervisord or in the Celery workers. The symptoms are that the playbook either hangs indefinitely when trying to restart one of the Celery workers, or it errors out completely. For a while, the changes in #2557 solved the problem for us, but currently we experience the problem consistently in every single deployment attempt again. In one respect, this is better than the intermittent failures, since we can now actually debug the problem, but I wasn't able to pin it down so far. My current theory is that Celery is the culprit (mostly because Celery Billiard is doing horrible things with mixing threads and fork() that definitely aren't POSIX compliant and in my opinion can't work reliably). |
@nedbat I can now confirm we are still experiencing the What mechanism is used to delay Ansible log (check out the timestamps):
Edit: Ah, I see now there is no "wait until xqueue is ready" mechanism. Sounds like we need to add an explicit check, or further increase the delay between these tasks. |
@bradenmacdonald let's open a new issue about this so we can get devops eyes on it. Also, how do you get timestamps on the ansible output? :) |
@nedbat I opened https://openedx.atlassian.net/browse/CRI-56 for now, since I wasn't sure which project to use (DEVOPS doesn't accept "Bug" reports). Feel free to move that issue if necessary. Timestamps are not from ansible but from the “OpenCraft Instance Manager” software which builds our sandboxes; it timestamps the logs from each instance as they stream back to the central console. |
The dogwood-specific detail about box defaults is not merged.