-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
soft lockups und NULL pointer dereferenes after upgrade to ubuntu 12.04 #837
Comments
We just had another dereference when about 200 processes were accessing zfs concurrently (this is our normal load). Load jumped quickly from ~20(normal) to 140 and system froze (we have that symptoms everytime we experience a dereference).
|
Today around 12pm there where 14 soft lockups an this error showed up in the syslog:
And there’s an log entry longer than usual:
|
As this seems to be a freeing issue we're trying to limit the arc size to 20GB now. |
One other thing came to my mind: when we upgraded to ubuntu 12.04 we changed copied the root partition from a reiserfs partition to a new xfs root partition. As i look through the other issues, zfs and xfs have big trouble working together... |
Today we had to restart the server twice, the dmesg output:
|
Btw: after seeing lots of problems with xfs and zfs on the same system in the issues, we have moved the root fs from xfs to ext4 the day before. What can we do to help fixing this? |
Is there any chance that you could post the zfs.ko kernel module somewhere online with an accompanying panic message? That would let me disassemble it to get a better idea of where the NULL pointer dereference occurs. |
ryao, I’ve just send you an E-Mail with a panic message and the compiled module. |
Several hours after the NULL pointer dereference:
|
Issues are fixed now (system has 8 days uptime) after we downgraded to zfs 0.6.0.56 and spl 0.6.0.56. |
I suspect commit 302f753 might be responsible for your issue. This change was introduced in 0.6.0.62, if you get a chance could you try 0.6.0.61 |
We just installed 0.6.0.61 on our server. If there are any troubles we let you know. |
There we go … some hours after we installed version 0.6.0.61:
|
That's surprising... the changes between 0.6.0.56 and 0.6.0.61 for the most part are pretty safe. None of them look like they could cause this. Is there any chance you could bisect the remaining couple tags and determine which one introduced the issue. $ git log upstream/0.6.0.56..upstream/0.6.0.61 --oneline |
Short summary: March 2012: We switched to ZFS 0.6.0.56 on Ubuntu 10.04; no problems In July 2012 we upgraded Ubuntu to 12.04. After the Upgrade the first Problems with ZFS occurred, so we thought it maybe is the version an tried different versions. After we tried version 0.6.0.56 we had an Uptime of ~8 Days, but then it failed again … We experienced high load average after a null pointer dereference. Then after reboot there immediately showed up a null pointer dereference. With And finally we compiled an installed version 0.6.0.61. But still no solution. |
I'm wondering if this might be the same issue I'm facing. Which I previously thought had to do with me mounting server shares on my client computer. But the last couple of lockups haven't happened with shares mounted, but they have happened while numerous threads were having fun with the zfs pool. This last one happened while testing 160+ GB files' md5sums while par2'ing a 160+ GB tar archive while watching Avengers in XBMC. Usually, I'm left with a completely unresponsive system, or at least one I cannot get to via SSH or a TTY. But this time, I actually have SSH access. But any process that attempts to access my pool will hang indefinitely. Syslog mentions this "Ooops", which interestingly enough mentions arc_adapt:
I think I'll try that 0.6.0.56 version you're talking about (if possible thru PPA), or perhaps some ARC fiddling. |
Just a thought... Could this simply be ZFS running out of memory? This isn't exactly enterprise-grade hardware I'm running here, and mem isn't that impressive. I have 8GB of RAM, a RAIDZ3 zpool of 19 2TB disks (plus a 60GB SSD cache), and a max ARC size of 1024 MB. 4 GB swap partition. As this guy says - https://groups.google.com/a/zfsonlinux.org/forum/?fromgroups=#!topic/zfs-discuss/8dDfYK1p1oc - that doesn't necessarily mean that it won't use up more mem, and adding to that I'm md5summing like hell (on mostly smaller files) while par2'ing a 160+ GB file. I have no idea how much RAM par2 was allocating for that, but probably a significant chunk. All mem has now been released from the hanging md5sum and par2 processes. |
Hm. Actually, neither one of them are using any measurable mem at all :/ Thought I'd reboot and retry the whole thing while watching top. Next thing will be to try to get that .56 version installed then :) |
We just got another null pointer dereference, but the stack trace seems to be new for me:
|
I'd like to note that my issue stemmed from problems with Intel SpeedStep vs. a rather low-quality consumer motherboard. I no longer have this issue after disabling SpeedStep. So please disregard my messages, as the problems I experienced were not related to ZFS. Sorry for any inconvenience and thank you so much for native ZFS :) |
@phillipp No promises this will fix your issue, but a very small race was recently closed in the spl condition variable code. Since this is where your NULL dereference is you may want to pick up the fix and see if your still able to reproduce the problem. See commit openzfs/spl@3c60f50 |
We moved all data off the zfs and back to a reiserfs because of the problems... |
Totally understandable. |
@phillipp It might be too late to ask, but if that kernel is still around, would you load the raw vmlinux into gdb and run |
Yep, sorry, i'ts too late... |
Since we've lost our test case and there have been a lot of zfs fixes since this was last updated I'm going to close the issue. We can reopen it or file a new issue if similar symptoms are observed with the latest code. |
There are only a few operations that use the caller-visible timeout `OAError::TimeoutError`: * The PoolOwnerPhys, during try_claim(). Timeout is <2s. * PoolPhys::create(), during pool creation. Timeout is 30s. * The HeartbeatPhys, during start_heartbeat(). Timeout is 10s. * create_object_test(), during test_connectivity. Timeout is 30s. This commit simplifies the timeout code by: * Always applying the PER_OPERATION_TIMEOUT (2s), which is retried transparently and indefinitely. * Optionally applying a user-visible timeout only to put_object_timeout(). There's a slight change in behavior which is that when the caller's timeout is >2s, their request may now be timed out and retried after 2s, whereas before we would wait until the caller-specified timeout (up to 30s) and then fail. This allows reducing the None arguments that are passed in many places, as well as the tricky `xor` logic in the retry code.
On our largest system, we have lots of soft lockups and null pointer dereferences now (sometimes more than two per day) which makes it VERY hard to maintan uptime greater that 95%.
We noticed, that the problems started just after upgrading to ubuntu 12.04. Before that, we had no problems. You see traces of the soft lockups and the null pointer dereferences (as in the ticket #805) below. I wrote a ticket for the NULL pointer dereferences already, but as this seems to have to do with the ubuntu upgrade (10.10 => 12.04) we did before.
The system has 64GB of ram. Maybe freeing is an issue on that amount?
zpool, zfs, spl+zfs version (and one error), uname
dmesg:
Syslog:
The text was updated successfully, but these errors were encountered: