Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deployment shouldn't use a static temp directory #45

Open
tkellen opened this issue Sep 22, 2015 · 15 comments
Open

deployment shouldn't use a static temp directory #45

tkellen opened this issue Sep 22, 2015 · 15 comments

Comments

@tkellen
Copy link

tkellen commented Sep 22, 2015

if two deploys are happening simultaneously for some reason, they'll clobber.

@cowboy
Copy link
Contributor

cowboy commented Sep 22, 2015

But if we change the temp dir name, if two deploys are happening simultaneously, only one will actually deploy.

The real solution is probably to fail if a deployment is in-progress. Just checking for the existence of the temp dir won't be adequate, however, because it could be leftover if a deploy fails partway through.

Ideas?

@tkellen
Copy link
Author

tkellen commented Sep 22, 2015

I think it's okay if two deploys happen at the same time and whichever ends last wins. This relates to the work @brianloveswords is doing right now on http://github.com/brianloveswords/deployer, though, so I'm curious what he thinks. @MattSurabian too!

@tkellen
Copy link
Author

tkellen commented Sep 22, 2015

@cowboy
Copy link
Contributor

cowboy commented Sep 23, 2015

Do we need to consider that people might be running other playbooks than deploy at the same time, like configure?

@tkellen
Copy link
Author

tkellen commented Sep 23, 2015

Multiple people running configure on a box in our infra seems incredibly unlikely to me. Trying to prevent something like that feels like an upstream concern to me (if we need to prevent concurrent runs period, that should be something supported by Ansible, not us here).

The only thing I want to address in this issue is handling concurrent deploys gracefully so we don't wind up with a site in production that is borked. I will admit the situation seems almost as rare as the configure one, but given how easy it is to mitigate, it seems worthwhile to put in the effort.

A temp directory suffixed with a hash is easy to implement and would mean n number of deploys could be running simultaneously and whichever finished last would be the one that stayed live.

The only (sane) alternative that comes to mind is checking for the existence of fixed temp directory and bailing out, telling the user to force. This is bad though, because a temp directory could easily be left hanging around if a deploy was cut off mid-process, and wouldn't work well in situations where we are automating things (re: @brianloveswords work on deployer).

Finally, in addition to doing this, we should probably address this concern with simple communication. At previous gigs (where lots of people were competent in how to deploy), we all agreed to announce in chat when a deploy was kicked off. Also, the deployment scripts posted messages in chat as it rolled out across our infra.

@cowboy
Copy link
Contributor

cowboy commented Sep 23, 2015

Sounds like a good idea!

@brianloveswords
Copy link

deployer is being designed such that each playbook about gets a separate queue and only one instance of a given playbook can run at a time and any requests-to-deploy that happen while a run is in progress gets queued, so the last one in wins [1].

Beyond that, deployer doesn't know or care what's in the playbooks, but I believe they should be designed to be a fault-tolerant as possible, even if in the automated world concurrent runs of the same playbook should be disallowed.


1. "Last one" meaning currently "last request-to-deploy message received". In an unlikely edge case where two messages come fairly close to each other and one gets delayed for some reason or needs to retry, the temporally earlier deploy message might end up getting received after the later message. We can mitigate this programmatically by adding a delay to the queue processor and sorting the messages by time sent so we process them in the order-of-intent rather than order-of-receipt. We can also mitigate this socially by telling people to avoid triggering two deploys in quick succession.

@cowboy
Copy link
Contributor

cowboy commented Sep 23, 2015

I was really surprised when I searched Google for ansible locking I found just one stack overflow thread and nothing official. Ansible doesn't seem to care about this problem.

@cowboy
Copy link
Contributor

cowboy commented Sep 23, 2015

I'm wondering if there's a way to write a task that would go at the top of each playbook where this mattered to check to see if that playbook is also being run in another process.

Of course, I'm seeing stuff like this, which isn't exactly helpful at first glance.

$ ps aux | grep python
cowboy    7424  0.0  0.0   4440   652 pts/1    Ss+  16:23   0:00 /bin/sh -c sudo -k && sudo -H -S -p "[sudo via ansible, key=lgozyxysjhhceuqljyfxkvansecqpnou] password: " -u root /bin/sh -c 'echo BECOME-SUCCESS-lgozyxysjhhceuqljyfxkvansecqpnou; LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 /usr/bin/python /home/cowboy/.ansible/tmp/ansible-tmp-1443025429.21-51224197571946/npm; rm -rf /home/cowboy/.ansible/tmp/ansible-tmp-1443025429.21-51224197571946/ >/dev/null 2>&1'
root      7426  0.0  0.2  63668  2088 pts/1    S+   16:23   0:00 sudo -H -S -p [sudo via ansible, key=lgozyxysjhhceuqljyfxkvansecqpnou] password:  -u root /bin/sh -c echo BECOME-SUCCESS-lgozyxysjhhceuqljyfxkvansecqpnou; LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 /usr/bin/python /home/cowboy/.ansible/tmp/ansible-tmp-1443025429.21-51224197571946/npm; rm -rf /home/cowboy/.ansible/tmp/ansible-tmp-1443025429.21-51224197571946/ >/dev/null 2>&1
root      7427  0.0  0.0   4440   656 pts/1    S+   16:23   0:00 /bin/sh -c echo BECOME-SUCCESS-lgozyxysjhhceuqljyfxkvansecqpnou; LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 /usr/bin/python /home/cowboy/.ansible/tmp/ansible-tmp-1443025429.21-51224197571946/npm; rm -rf /home/cowboy/.ansible/tmp/ansible-tmp-1443025429.21-51224197571946/ >/dev/null 2>&1
root      7428  0.0  0.8  34052  8888 pts/1    S+   16:23   0:00 /usr/bin/python /home/cowboy/.ansible/tmp/ansible-tmp-1443025429.21-51224197571946/npm
cowboy    7491  0.0  0.0  10464   936 pts/0    S+   16:24   0:00 grep --color=auto python

@cowboy
Copy link
Contributor

cowboy commented Sep 23, 2015

At the very least, we could have a playbook fail if it detects another ansible process. Thoughts?

@jugglinmike
Copy link
Member

@cowboy what about touching a file in $HOME/.ansible?

@jugglinmike
Copy link
Member

D'oh

Just checking for the existence of the temp dir won't be adequate, however, because it could be leftover if a deploy fails partway through.

@jugglinmike
Copy link
Member

Maybe blocks would help with lock cleanup?

The always section runs no matter what previous error did or did not occur in the block and rescue sections.

But I have no context for this stuff. I'm going to leave you all alone now!

@tkellen
Copy link
Author

tkellen commented Sep 23, 2015

@jugglinmike That also wouldn't work in the case of someone killing the Ansible process midway.

I really think this is a problem for Ansible itself, not for userland to resolve (except in the case of @brianloveswords's work, where a series of commits could kick off a ton of deploys and we can't guarantee that every one will use whatever special hack we put in place).

For the purposes of this issue, I believe our problem would be resolved with the temp-<hash> folder name. Broadly speaking, it would be awesome if someone opened an issue over on the Ansible tracker. I feel sure they've encountered this question before and have some good reason for why things are the way they are.

@tkellen
Copy link
Author

tkellen commented Sep 25, 2015

Just leaving a note to myself, or whoever implements this...

This line:
https://github.com/bocoup/deployment-workflow/blob/master/deploy/ansible/roles/deploy/tasks/checkout.yml#L5-L6
...should not delete all temp directories wholesale, it should only remove ones whose creation date is older than say.... an hour. This will prevent one deploy from clobbering a concurrent's folder, and also keep any that were left lingering by aborted deploys cleaned up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants