-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
failed to open /proc/0/status: No such file or directory #2467
Comments
If by any chance we fail to handle "/proc/$pid/status" file, there was a crash which used to happen. With this patch, that error is gracefully handled with a single group added as root by default. Updates: gluster#2467 Change-Id: I897a8f954deecabc48598dce03806154c7c1d189 Signed-off-by: Amar Tumballi <[email protected]>
Hello, I am having the same issue.
Happy to provide any more information needed |
Not a kadalu user but I started getting this intermittently after upgrading 8.4 -> 9.2. OS: Debian bullseye This also results in the FUSE mount on the client going down ( In my case it seems to be triggered by synchronous writes:
Relevant line:
Without full understanding of what's going on, it seems it for some reason fails to get the root PID (getting 0 instead), which would then be used to get process GIDs. May be worth a try to add the mount option Though I suspect it will fail in a similar way, if the cause is that |
Some more information on this error/log. We noticed that this happens mostly in container ecosystem, Specially when some operations are done with 'bind' mount parameters. With the added PR, the crash is not happening, but the logs are still coming, hinting at issue still being present. Yet to debug completely. |
Can confirm it's containers with bind mounts/Docker volumes in my case as well. |
@csabahenk With a commit like amarts@181d41f I was able to figure out the issue was happening in a READ call. Any possibility of getting a READ call from kernel module with pid 0 when its a bind mount? |
* There was no clue on which operation caused the pid to be '0'. * When the error happened without setting ngroups, it crashed the process. Updates: gluster#2467 Change-Id: Ic3a4561f73947c4acfeef40028c3a6cf3975392e Signed-off-by: Amar Tumballi <[email protected]>
Can you maybe release the quickwin, mentioned here so that the client isn´t crashing anymore? |
* There was no clue on which operation caused the pid to be '0' - Added relevant op in log. * When the error happened without setting ngroups, it crashed the process. * Looks like in container usecases, when namespace pid is different, there are chances of fuse not getting proper pid, hence would have it as 0. Handled the crash, and treated it as 'root' user. Fixes: gluster#2467 Change-Id: Ic3a4561f73947c4acfeef40028c3a6cf3975392e Signed-off-by: Amar Tumballi <[email protected]>
Thanks @mohit84 for pointing at the issue at libfuse. Looks like |
* There was no clue on which operation caused the pid to be '0' - Added relevant op in log. * When the error happened without setting ngroups, it crashed the process. * Looks like in container usecases, when namespace pid is different, there are chances of fuse not getting proper pid, hence would have it as 0. Handled the crash, and treated it as 'root' user. Fixes: #2467 Change-Id: Ic3a4561f73947c4acfeef40028c3a6cf3975392e Signed-off-by: Amar Tumballi <[email protected]>
* There was no clue on which operation caused the pid to be '0' - Added relevant op in log. * When the error happened without setting ngroups, it crashed the process. * Looks like in container usecases, when namespace pid is different, there are chances of fuse not getting proper pid, hence would have it as 0. Handled the crash, and treated it as 'root' user. Fixes: gluster#2467 Change-Id: Ic3a4561f73947c4acfeef40028c3a6cf3975392e Signed-off-by: Amar Tumballi <[email protected]> (cherry picked from commit 387fcb0)
* There was no clue on which operation caused the pid to be '0' - Added relevant op in log. * When the error happened without setting ngroups, it crashed the process. * Looks like in container usecases, when namespace pid is different, there are chances of fuse not getting proper pid, hence would have it as 0. Handled the crash, and treated it as 'root' user. Fixes: gluster#2467 Change-Id: Ic3a4561f73947c4acfeef40028c3a6cf3975392e Signed-off-by: Amar Tumballi <[email protected]> (cherry picked from commit 387fcb0) Signed-off-by: Shree Vatsa N <[email protected]>
* There was no clue on which operation caused the pid to be '0' - Added relevant op in log. * When the error happened without setting ngroups, it crashed the process. * Looks like in container usecases, when namespace pid is different, there are chances of fuse not getting proper pid, hence would have it as 0. Handled the crash, and treated it as 'root' user. Fixes: #2467 Change-Id: Ic3a4561f73947c4acfeef40028c3a6cf3975392e Signed-off-by: Amar Tumballi <[email protected]> (cherry picked from commit 387fcb0) Signed-off-by: Shree Vatsa N <[email protected]> Co-authored-by: Amar Tumballi <[email protected]>
Hey there - after a massive struggle for 2 weeks now, and searching all over I've finally found this thread exactly describing my issue, also in the same context of bind mounts from containers. I see that there are some commits that have gone in to resolve this; what version of gluster do I need to be on for this to be fixed? I'm currently on 9.2 as is included in the default repos of Ubuntu impish. |
If it helps to track down the route cause, I can pretty much cause it on demand with my setup. If diagnosis/logs/dumps are fairly trivial to get and it would help you with route cause diagnosis, I'm happy to. Preferably though I'd get this system stable again ASAP. I'm assuming my best option is to downgrade to 9.0 (where this issue either never happened, or happened only once or so a week)? |
9.5 will be available by second week of Jan. The delay is due to year end holidays. |
Is there a way I can avoid, workaround or patch this issue without having to wait for 9.5? I've had to drop all but 1 container from my cluster as a result and I don't want to temporarily rearchitect around another storage solution. |
I would be interested in hearing if there is an interim workaround or anything, as I have just migrated from NFS share to glusterfs for my docker swarm, and am seeing this issue. |
@Shwetha-Acharya as per my understanding a release is done with this fix. Right? |
Hi @pranithk we have not yet officially announced the release as we are handling some issue in centos stream releases. Rest of the paclkages are built and are available. I hope to announce the release as quickly as possible. |
Hi there, thank you for the quick reply on this, it is very much appreciated. Does this mean it is available to download manually and apply the new packages? |
I recommend to wait till the official announcement, which can happen in a day or two. |
I will do that then, again, thank you for all that you put into the project and for replying quickly |
I can´t see any notes to this issue in the 10.1 release docs. Is this sure fixed in 10.1? |
Its fixed as part of 9.5 and can be expected in next minor release of gluster 10. |
@dfoxg I was verifying the commits that went into the gluster 10 releases for this issue:
|
@Shwetha-Acharya okay, thank you! |
In trying to find the official announcements for versions, I came across this Roadmap page on the website that seems a little out of date. Shall I raise a separate issue for this? I did find the release notes for 9.5 - assume this means it has been released to all package repos and should be safe (as ever an upgrade can be!) to upgrade? Has anyone else watching/commenting on this thread tried it with containers/Docker and found their mounts are no longer crashing? |
9.5 isn't being offered as an upgrade package on Ubuntu 21.10 - has it been published to major OS repos? |
The ubuntu 21.10 (impish) was successfully built: https://launchpad.net/~gluster/+archive/ubuntu/glusterfs-9/+packages and was uploaded to ubuntu launchpad. From which version are you trying upgrade? and what error message/code are you seeing? |
Right - so its only on Launchpad, not on the standard Ubuntu repos? I've added the glusterfs-9 source on launchpad now - so I'll just use that moving forward. I had originally installed Gluster via just the standard Ubuntu sources for 21.10/impish, so I was somewhat expecting updates to be published there too. |
I too am not seeing the 9.5 being released to bullseye or bullseye-backports... should we be seeing them by now? |
There seems to be some confusion on the debian repos. On one hand, there are the official debian repos (which are managed by Debian maintainers, not glusterfs). This can be tracked and maintainer located here: https://tracker.debian.org/pkg/glusterfs Apart from that, glusterfs hosts its own debian repos. 9.5 is available there under On bullseye you can start using glusterfs repos by adding Ubuntu users can also use the launchpad PPA linked above. IIRC think there is some minor discrepancies in systemd service names between the two, so pay attention if migrating an existing installation, and don't attempt to mix-and-match between them. |
ah, I am looking for ARMhf, do you have the source on the gluster repo (deb-src), so that I could pull and build with apt? at the moment I have pulled and built the release-9 branch on git for the hf and for the arm64s that I have I have built the release-10 branch |
I'm not sure actually, I would expect this to be it but it's still at 9.4: https://github.com/gluster/glusterfs-debian/tree/bullseye-glusterfs-9 |
Got this problem using 9.2 on Bullseye arm64 docker swarm cluster. Workaround that work now for me - |
looks like this is now updated to 9.5 GA... but still 9.4 in deb.debian.org bullseye-backports repo |
Finally updated my nodes to Gluster 9.5; unfortunately I'm still getting the exact same problem I was before: soon after a sqlite database hosted over the gluster fuse mount is accessed by a multi-threaded container (eg, a web server), the fuse mount crashes. Unmounting and remounting works temporarily before the fuse mount crashes again. The gluster server nodes are still online, and other gluster clients on other nodes connected to the same volume do not crash, assuming they are not running one of these mulit-threaded sqlite containers. I really don't understand this at all. My understanding is that sqlite should work without any issues over the native gluster fuse mount, but 9.2 and 9.5 have had this issue. Which logs can I look at to get more detail? |
@webash Do you still get the error reported related to Do you have WAL enabled on the sqlite db/process? If so, that is known to cause issues on glusterfs - though breaking the fuse mount does sound like a glusterfs issue, even when using wal.
|
You're right, @3nprob I can't seem to find the There are two databases that trigger this, and both of them appear to have WAL enabled. What's extremely bizarre is that I was running one of them on a gluster volume without any issue, until something changed around the time I upgraded to 9.2 from 9.1, I believe. I would've expected that gluster's architecture meant that sqlite wouldn't experience the same issues as say across to a completely network-based filesystem (eg, NFS) due to the local element. I'm happy to spin out another issue to explore this if you think its worth anyone's time - otherwise I might just rearchitect around a different storage solution. Despite all the articles out there recommending gluster for Docker Swarm, every issue I've had since implementing a Swarm has been traced to the storage. |
@webash i had similar experiences you described. But since I moved from docker swarm to k3s most of the errors are gone - maybe it is also a solution for you. |
@dfoxg so that suggests its some kind of issue between the way that Docker mounts the vol and gluster's FUSE? Converting all my infrastructure over to k3s just because of an issue with clustered storage is painful :( |
I'd be really surprised if that would be a solution - if that is indeed the case, it would be really helpful to get a repro |
I am still seeing some issues with glusterfs/docker swarm, although it is more stable now. I am recompiling on armv7 when there are updates available in the release-9 branch, mostly the errors I see are with mongodb (unifi controller) writing to gluster. |
Description of problem:
Some of our users are seeing logs like above (container usecase), which results in crash of glusterfs process. examples
kadalu/kadalu#540 kadalu/kadalu#468
It may be possible that the issue is with setup, but the crash shouldn't happen regardless.
The exact command to reproduce the issue:
Not clear right now. Only 2 users reported this out of 200+
The full output of the command that failed:
Expected results:
Mandatory info:
- The output of the
gluster volume info
command:- The output of the
gluster volume status
command:- The output of the
gluster volume heal
command:**- Provide logs present on following locations of client and server nodes -
/var/log/glusterfs/
**- Is there any crash ? Provide the backtrace and coredump
Additional info:
- The operating system / glusterfs version:
kubernetes deployment.
glusterfs version is series_1 - Few patches on top of glusterfs devel branch.
Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration
The text was updated successfully, but these errors were encountered: