Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crash looping on GKE: wal-e pipe capacity issue with newer kernels #154

Open
tmc opened this issue Oct 22, 2016 · 11 comments
Open

crash looping on GKE: wal-e pipe capacity issue with newer kernels #154

tmc opened this issue Oct 22, 2016 · 11 comments

Comments

@tmc
Copy link

tmc commented Oct 22, 2016

This is apparently due to wal-e/wal-e#270

example spew:

wal_e.retries WARNING  MSG: retrying after encountering exception
        DETAIL: Exception information dump:
        Traceback (most recent call last):
          File "/usr/local/lib/python2.7/dist-packages/wal_e/retries.py", line 62, in shim
            return f(*args, **kwargs)
          File "/usr/local/lib/python2.7/dist-packages/wal_e/worker/gs/gs_worker.py", line 76, in fetch_partition
            with get_download_pipeline(PIPE, PIPE, self.decrypt) as pl:
          File "/usr/local/lib/python2.7/dist-packages/wal_e/pipeline.py", line 92, in __enter__
            self.stdin = pipebuf.NonBlockBufferedWriter(stdin)
          File "/usr/local/lib/python2.7/dist-packages/wal_e/pipebuf.py", line 225, in __init__
            _setup_fd(self._fd)
          File "/usr/local/lib/python2.7/dist-packages/wal_e/pipebuf.py", line 62, in _setup_fd
            set_buf_size(fd)
          File "/usr/local/lib/python2.7/dist-packages/wal_e/pipebuf.py", line 53, in set_buf_size
            fcntl.fcntl(fd, fcntl.F_SETPIPE_SZ, OS_PIPE_SZ)
        IOError: [Errno 1] Operation not permitted

        HINT: A better error message should be written to handle this exception.  Please report this output and, if possible, the situation under which it arises.
        STRUCTURED: time=2016-10-22T20:14:41.882253-00 pid=221
@tmc tmc changed the title crash loop wal-e failing on GKE crash looping on GKE: wal-e fcntl issue Oct 22, 2016
@tmc tmc changed the title crash looping on GKE: wal-e fcntl issue crash looping on GKE: wal-e pipe capacity issue with newer kernels Oct 22, 2016
@bacongobbler
Copy link
Member

Thanks for the report @tmc! What provider and k8s version did you use to deploy Workflow? That can help us nail down the issue so we can figure out a fix we could propose upstream and then bump wal-e to a release with that fix.

Did the fix in wal-e/wal-e#270 (comment) work for you?

@bacongobbler
Copy link
Member

It might be a neat experiment to try a slightly older kernel version with kubernetes and see if this issue still persists.

@tmc
Copy link
Author

tmc commented Oct 28, 2016

Yes that fix worked, GKE, k8s 1.4

@tmc
Copy link
Author

tmc commented Nov 23, 2016

this is still present on k8s 1.4.6 on GKE

@bacongobbler
Copy link
Member

bacongobbler commented Nov 23, 2016

Unfortunately there is nothing we can do on our end to fix this other than to use the provided workaround or to fix it in wal-e and bump the installed version. If you can provide a patch that fixes this issue for you, please make a PR upstream and we can bump wal-e forwards to the fix once it's merged.

I haven't seen this issue in the wild on GKE with k8s 1.4+ so I don't have a reliable test case (or even the slightest idea how this issue crops up) to test a fix against. Until then I cannot help you.

@fdr
Copy link

fdr commented Jan 3, 2017

Hey all, WAL-E maintainer here.

I will accept a patch with a lower pipe size that doesn't tank performance that works with defaults or some adaptive code to deal with this new limit. I suspect the adaptive approaches may be more trouble than its worth, but if someone can surprise me, that'd be great.

@lgastako
Copy link

I'm on k8s 1.5.2 on GKE. When I try the workaround from wal-e/wal-e#270 (comment) I get:

root@deis-database-540367895-x6r6d:/# echo 0 > /proc/sys/fs/pipe-user-pages-soft
bash: /proc/sys/fs/pipe-user-pages-soft: Read-only file system

Am I doing something wrong?

@lgastako
Copy link

For anyone else that runs into the same problem I had, I was able to solve it by downloading the workflow chart, unpacking it and editing database-deployment.yml to add the annotation security.alpha.kubernetes.io/sysctls: fs.pipe-user-pages-soft=0 to the Deployment.

@tmc
Copy link
Author

tmc commented May 10, 2017

I ran into this again on an upgrade.

@Bregor
Copy link

Bregor commented May 10, 2017

security.alpha.kubernetes.io will not work in GKE, because all alpha features disabled there.

@fdr
Copy link

fdr commented May 10, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants