-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NULL deref in balance_pgdat() #287
Comments
One thing that's missing in the dmesg output is an allocation failure that user must've had shortly before that oops (SLUB: Unable to allocate memory on node -1 (gfp=0x20)). This is just theoretical so far (only encountered this once while copying some large files between two datasets), but the only way current->reclaim_state would end up being NULL there seems to be the following call chain: kswapd -> balance_pgdat -> shrink_slab -> do_shrinker_shrink -> spl_kmem_cache_generic_shrinker -> spl_kmem_cache_reap_now -> spl_slab_reclaim -> some dtor func -> ??? (alloc func) -> alloc_pages_node / alloc_pages_exact_node -> __alloc_pages -> __alloc_pages_nodemask -> _alloc_pages??? -> get_page_from_freelist -> zone_reclaim -> __zone_reclaim (sets current->reclaim_state to NULL) Adding a dump_stack() call right before the zone_reclaim() call in get_page_from_freelist() might make this easier to debug (requires recompiling the kernel though). |
Another way this could have happened is if get_current() returned a valid but incorrect address. In balance_pgdat the only user of 'current' is for the reclaim_state so if occurred that's where you would expect to see the NULL deref. I know that sounds like a stretch. But after digging in to issue #279 a bit I'm beginning to wonder if we are hitting some sort of issue with the percpu counters. Both get_current() and get_next_timer_interrupt() (for x86_64) are built on top of percpu counters. And both of these issues could be explained if percpu_read_stable() was somehow returning a bogus address. But I don't have any evidence of this yet, just a hunch. |
This one was pretty tough to reproduce. Here's what finally worked:
I've added this in page_alloc.c:__alloc_pages_direct_reclaim (after try_to_free_pages()):
and in vmscan.c:balance_pgdat (after shrink_slab()):
Here's the debug output I got from that:
|
Hit this bug again. similar workload on the server.
|
I'm having the same problem while loading files from a ZFS filesystem to an OpenAFS volume also on ZFS.
|
I was able to recreate this issue once under RHEL6.1 with xfstests test 224 which completely fills the filesystem as a stress/correctness test. |
Got this while rsyncing from a 4 disk RAIDZ volume to ext4 volume, transferred around 200GB (around 8000 files) before my system locked up, with a server load average (1min) of 32 at the time. Server is a 4core Xeon X3430 with 8GB RAM Sep 5 16:12:43 server_name kernel: [394279.482080] BUG: unable to handle kernel NULL pointer dereference at (null) |
Same problem here: [17978.602553] BUG: unable to handle kernel NULL pointer dereference at (null) Max arc size set to 256 MB. Happens when I am using more than ten Chrome tabs and memory use is high. |
I'm hitting this bug as well. Hardware/Software: Atom 330, 2 Gb RAM, Ubuntu 11.10 server 32 bits but with the debian 64-bits-on-32-bits-userland kernel (3.0.0-2-amd64). ZFS modules 0.6rc6 built from ubuntu's ppa 0.6.0.34-0ubuntu1~oneiric1 zpool config:
Operation: rsync from an ext4 partition on 1 drive (sdc) to a zfs mountpoint, source and destination are local.
|
I got hit on Debian Unstable, standard Debian
I got this after ~1 day of uptime (24.66 hours, according to the log). No particular workload, the pool must have been mostly idle when this happened. From past experiences I get the feeling that the crash happens after some fixed amount of work, meaning it will crash quickly if you stress the pool continuously, but survive longer if you're not using it much. |
Just encountered this as well with 0.6.0-rc6:
It was scrubbing a zpool and running a Windows 7 VMWare VM (from the same zpool) at the time. |
Ok, I have some more info. Inspired by the work of gunnarbeutner I noticed in
What's interesting is that the kernel doesn't check if So I added the following debug statements before the aforementioned code:
Then I waited. Several hours later, the debug message appears in the log and immediately after, the NULL dereference occurs. What's very interesting is that, contrary to gunnarbeutner, I have ZFS functions in the stack trace!
|
In
This seems targeted at fixing this kind of issue… except it doesn't: we see |
Just to check, I added this to
Results are clear, right from the boot:
Well, that explains it. |
As shown in my stack trace above, in the path leading up to the direct reclaim, this is called as part of
In light of this, I'm thinking that maybe disabling the atime feature might be a workaround for this issue. This is assuming no other code path triggers allocation within a reclaim, which is a bit of a stretch, but it's worth giving it a try so that impacted people get a little more air. So, assuming I'm right, this workaround should stop the
I would be glad if some of you could test it and report back. |
Nice digging work ! |
Nice leg work on this, I think you've clearly identified the issue is that we are trashing the current->reclaim_state. Exactly how this occurs is still a little less obvious to me, but I'll carefully walk the stack provided to see if it sheds some light on things. In the meanwhile let me comment on your above observations:
This change eec8164 was made to explicitly prevent a deadlock in the txg_sync_thread. This thread will hold the txg_sync lock while flushing dirty data to disk, if it were to ever attempt to reacquire this lock it would deadlock on itself. In fact, because the Linux VFS layer differs significantly from Solaris VFS this is entirely possible via the direct memory reclaim path described. Any memory allocation could potentially cause this, to avoid this issue entirely direct reclaim is disabled for the txg_sync_thread. Your stack above is able to call zfs_inactive()->...->dmu_tx_assign() because it is being done in the context of the kswapd thread. This should be safe because kswapd won't be holding the txg_sync lock, however it does appear that somehow in the call stack the current->reclaim_state was damaged. Your second observation about PF_MEMALLOC somehow getting cleared in txg_sync_thread() is troubling for two reasons. First off, this should be impossible since __alloc_pages_slowpath() checks for this pretty early on and aborts reclaim if it's set. Perhaps there is another call path where it can be cleared? /* Avoid recursion of direct reclaim */ if (current->flags & PF_MEMALLOC) goto nopage; And secondly, because prior to this fix it was quite easy to deadlock the txg_sync_thread. There's every indication that it is actually working as designed, are you absolutely sure you got this check correct? Anyway, I think you've still hit on the heart of the issue. We need to determine exactly where things go wrong with the current->reclaim_state management. |
It seems pretty clear to me. Let me summarize my findings:
Conclusion: if you trigger a direct reclaim while you're already reclaiming (asynchronously) in As a personal note, it boggles my mind that Moving on. There seems to be one predominant path where ZFS has a significant probability of entering direct reclaim in
Although atime updates are probably not the only case where ZFS allocates memory inside The random nature of this issue is a result of the fact that the system only crashes if a direct reclaim is triggered by ZFS inside The fact that It's still too soon to tell, but it seems my system doesn't suffer from this issue anymore after disabling atime on my pool. If nothing happens in the next 48 hours then I'll take it as a confirmation of my diagnosis. In the mean time, I added some
IMO, the morale of the story is: don't try to allocate memory in a thread which has
There's a few places in the kernel where
Well, I don't know how I could have got it wrong, since I copy-pasted the code inside the main loop of |
Be careful not to unconditionally clear the PF_MEMALLOC bit in the task structure. It may have already been set when entering zpl_putpage() in which case it must remain set on exit. In particular the kswapd thread will have PF_MEMALLOC set in order to prevent it from entering direct reclaim. By clearing it we allow the following NULL deref to potentially occur. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff8109c7ab>] balance_pgdat+0x25b/0x4ff Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#287
Right, I completely agree. However, what you describe should be impossible. By design static int kswapd(void *p) { ... tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; This flag is then checked early in For this to occur something has to be mistakenly clearing the PF_MEMALLOC bit in the task struct. As it turns out a quick look at the spl and zfs code turns up two places where PF_MEMALLOC is set and then cleared unconditionally to avoid a deadlock. In both cases it's possible that PF_MEMALLOC was already set when entering the function, in which case we would have just mistakenly cleared it and armed the time bomb. Incidentally, this also explains the other odd symptom you reported of PF_MEMALLOC not being set in Can you please apply the following two patches and verify that they resolve the issue. If someone has a good test case for this, or just a workload/system which seems to hit this a lot, I'd love the extra testing before merging it to master. spl patch: behlendorf/spl@940d49e |
Oh, and I absolutely agree. A BUG_ON() would be a very good thing right there. |
Oh, right… the kernel folks prevented this. Never underestimate the Power of the Kernel ;) One thing worries me, though: isn't
You're absolutely right. Everything makes sense now :)
I'm a little busy right now, I'll try them as soon as I can. Sounds very promising though. |
At the moment the kmem_alloc() implementation in the spl always retries on failure... forever I'm not proud of this, but we really have no choice. As you say kmem_alloc() must succeed, period. The zfs code completely depends on this behavior and thus contains zero error handling for this. In contrast, failure for Linux's native kmalloc() is a down right common event and needs to be handled. As for the exact case you worried about we should be OK. The kmem_alloc() call run under the kswapd thread may fail repeatedly and retry, but there are other threads running on the system which are allowed to perform direct reclaim. One of them will free the needed memory and then the kmem_alloc() under kswapd will succeed and move on. This should also be a very unlikely event.
No problem, I'll let it soak on a few test systems here but the more testing the better. This is actually a pretty safe change regardless. |
One quick nontechnical note: It's simply exhilarating to see how you guys are cornering this bug with nail bats. I am privileged enough to have found and be part of this conversation, and I am enjoying every second of this. If any of you guys stop by the bay area, at any point, drinks are on me. |
I wasn't hit by the bug in the last few days with atime disabled. Nice !
And bam, kernel NULL deref. |
Nice job. The PF_MEMALLOC bit is something I missed during my initial investigation. :) There are at least half a dozen open tickets which depend on this fix. |
Things sound very good so far. I've applied the patch and added one printk in each two new if-pf_memalloc-already-set block, and :
We really had a lot of occasions to create this deadlock before... Now i'm running the patch without the printk's, and the above loop is running for ~1 hour. No kernel bug so far. I'll let it run for several days, each 10 minutes, to be extra-sure that nothing crashes horribly under different cpu/mem/etc conditions (aka : normal usage). But that would be strange given the changeset. Kudos, guys. |
I'm glad to announce that since I applied Brian's patches, my " I just re-enabled atime, let's see how it goes. |
Imma test drive your branch fix now. If you dont hear from me, either i am dead or all went well. |
I applied Brian's patch yesterday, re-enabled atime, and left a zfs scrub (on a 1.4 TB pool) and Windows VM running all night... no problems yet. |
No problems yet here either.Sent from my Android phone with K-9 Mail. Please excuse my brevity. cshei [email protected] wrote: I applied Brian's patch yesterday, re-enabled atime, and left a zfs scrub (on a 1.4 TB pool) and Windows VM running all night... no problems yet. Reply to this email directly or view it on GitHub: |
Since I'm hearing nothing but good things I'll be merging it in shortly. |
Be careful not to unconditionally clear the PF_MEMALLOC bit in the task structure. It may have already been set when entering zpl_putpage() in which case it must remain set on exit. In particular the kswapd thread will have PF_MEMALLOC set in order to prevent it from entering direct reclaim. By clearing it we allow the following NULL deref to potentially occur. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff8109c7ab>] balance_pgdat+0x25b/0x4ff Signed-off-by: Brian Behlendorf <[email protected]> Issue #287
Fixed! Thank you everyone for the help with this, I'm closing the issue the patches have been merged. |
Be careful not to unconditionally clear the PF_MEMALLOC bit in the task structure. It may have already been set when entering kv_alloc() in which case it must remain set on exit. In particular the kswapd thread will have PF_MEMALLOC set in order to prevent it from entering direct reclaim. By clearing it we allow the following NULL deref to potentially occur. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff8109c7ab>] balance_pgdat+0x25b/0x4ff Signed-off-by: Brian Behlendorf <[email protected]> Closes ZFS issue #287
Sometimes they come back? I hit what I believe is a similar issue at the same place today:
No allocation failures or the like. Now, this is running 0.7.1 (I will update to the latest soon), but was anythign like that fixed recently? I have a crashdump if needed. |
@verygreen we haven't fixed anything like this recently. Although, we have been able to occasionally reproduce something which might be related using |
…ket (openzfs#287) Signed-off-by: nsathyaseelan <[email protected]>
`cargo test` runs both the explicit tests and any code in doc comments that is not explicitly ignored. There are some failures in lazy_static_ptr.rs and vec_ext.rs, which this commit fixes by making the example code actually runnable.
Reported on the zfs-discuss list by Tommy Cheng.
I am running a VM that is doing unrar via NFS to the host computer's
zfs pool (both the rar files and destination are on the zfs pool).
got this while unpacking 4+GB of data.
The text was updated successfully, but these errors were encountered: