-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZFS on Linux null pointer dereference #11679
Comments
I am experiencing the same or similar issue. A hang-up during zfs send/receive of encrypted filesystem. My system is a standard proxmox install 6.3 Os: Debian Buster 10
|
Similar hung system, on the same distribution. The problem has been around ever since reinstalling the system and enabling encryption with ZFS 0.8.5, and persists with 5.4.114-1-pve / 2.0.4-pve1 dmesg
My problem is triggered by receiving snapshots into an encrypted dataset, i am not sending data anywhere from this node. This machine is a consumer of snapshots for multiple machines (all different versions of ZFS). The commands being run are something along: I have tried checking on /proc/spl/kstat/zfs/dbgmsg (zfs_flags = 256, turning it higher, and i can't manage the amout of data it produces) but nothing interesting showed up. Perhaps someone could tell us, how to diagnose it further? On a side note, it seems all three of us are using relatively old arcitectures, from the same era:
|
Same problem here, on NixOS, Linux 5.4.114, ZFS 2.0.4, also (probably) when receiving into an encrypted dataset. Kernel log: https://gist.github.com/lheckemann/1c11ab6865c44a192c9d2597d17ba72b This is the second time these symptoms have shown up, though I didn't think to get logs the first time around. I'm on a Celeron G3900 (Skylake, Q4'15). EDIT: I'm receiving non-raw sends. |
I'm not sure if this is related to the age of the architecture. But add another of my systems: Intel(R) Xeon(R) CPU E3-1231 running on ASUS P9D-M board. This is relatively "new" when compared to the others. Incidentally, I also have a X8SIL running an X3440 cpu and 32GB ram. That has not experienced the issue (yet). All these systems are, as you define them, consumers for ZFS snapshots (i.e. backup servers receiving the snapshots from production). |
Today I had a corruption on the receiving system.
But I'm not sure what that file is. |
Could you share the output of these two commands? I have a vague suspicion that all the CPUs mentioned here don't believe in aesni, and am wondering if there's something awry in the codepath they're taking. (Note that I am not remotely an expert on this code, just a random person musing about what could be going so awry for you and yet not causing much more screaming by encryption users.) (If you'd like to more aggressively experiment with this theory, I suspect the speed order for this goes aesni > x86_64 > generic, so if you are indeed using x86_64, you could try using the "generic" implementation and see if the problem still reproduces.) |
I have switched both parameters to generic now. Will report back as things advance. |
One further thought - I don't see a way to validate which implementation it was currently using on "fastest", so it's also possible you were already using generic for one of both of those. So if it still NULL derefs, you could try explicitly setting x86_64 and pclmulqdq. (I thought of this because if #12106 is indeed related to this, it definitely can't be using x86_64, so...) |
Same output here, on all CPU I use (aes-ni aware and in use not only for zfs but for many other things).
Are we suggesting that the aes-ni implementation in the kernel is "buggy"? That is most certainly not the case. Not sure if I pointed this out, but, in my case, I'm using different encryption keys on the sending and on the receiving side. Also, some of my senders are not encrypted, but the receiver is (via a parent fs which is encrypted). |
No, I was suggesting precisely the opposite - that the reason people here might be seeing an issue and others aren't is the absence of use of aesni. edit: Just for reference, on my (Broadwell) Xeon:
So I was hypothesizing the absence of use of one or both of those could be related to this, and was more suspicious of aesni than avx. |
Understood. I'm going to turn explicitly on aesni and avx on one of the servers now. Just to test the opposite case of generic.
|
AFAICS from quickly skimming the code, there's no equivalent of vdev_raidz_bench for the crypto bits. Maybe I'll see how hard implementing one would be after this. |
Just to be clear I'm trying to reproduce the right thing, could somebody share which values of "encryption/keyformat" they're using on the dataset that things are being received under? (And of course, if they turn out to all be the same/different, that's another data point...) |
My consumer is:
(After changing icp_[aes|gcm]_impl to "generic" a couple of hours ago, the next backup iteration caused the same null pointer/stack trace. After restarting and re-applying the "generic" settings, has yet crashed again) |
Well, unless generic was mysteriously calculated to be faster before and you were always using that, there goes that theory. Rats. edit: Well, I guess it's technically not impossible for both the non-aesni/avx implementations to cause similar misbehavior, but I'm going to assume for now that it's not specific to those...somehow. |
@rincebrain The fastest implementation is statically selected according to the available hardware. See here and here. To summarize: GCM: I'm still working on supporting more instruction set extensions for GCM and am thinking of adding a bench as part of that work. Currently the static selection should always select the fastest one. Further it's quite unlikely that the implementations differ in behavior since |
Yeah, I'd buy that, the only reason I proposed the selection as an idea before was, as I said, that I noticed all of them happened to be using CPUs that did not believe in aesni, and the assumption that most people are not running into this regularly, so something about their setups is different. I also, annoyingly, have not been able to reproduce this in a VM no matter how I've varied the amount of computational power or cores, so far. :/ |
I must say, this is the first time someone talks about cpu beliefs. It might be related to proxmox. I tought I've never seen it with the 0.8.5 zfs module, but turns out it was there. Now I have the same issue with 2.0.4. The thing that changed was encryption. For now the data corruption regards only snapshots. I hope it won't start affecting live files. It also seems there's some silent corruption goin on. On one of the servers with data errors, I never had this null pointer dereference, yet, there are corrupted snapshots, and may of them:
Those hex ids are deleted snapshots (most likely). It can happen that a snapshot is created/deleted during the send/receive. They are asynchronous processes. At the moment I'm running a scrub now on all servers. Will see if those errors go away or not (deleted snapshots). But the "vm-1234" snapshots above cannot be deleted. They're not visible when issuing zfs list -t snapshot, yet, they do have a name in the error list.
There's also more:
|
Ok, I was waken up this morning by a (perhaps related) issue on one of the "sending" servers. The stack trace does not show the encryption part and seems more related to the arc. I haven't seen one of those for years, so being a very rare thing, I decided to post it here in hope it would help. I was doing some things on this server tonight, namely:
The server was completely locked up, load average around 6800 and the following stack trace:
This is an Intel(R) Xeon(R) CPU E5-1620 v2 (4 core, 8 threads) cpu, 32GB of ECC ram running on a Supermicro X9SRE-3F motherboard. No other errors logged. The system was responsive due to a separated ext4 root unrelated to zfs. I was unable to unmount, mount, touch anything zfs-related. I had to issue a unmount/sync/reboot magic sysreq to reboot it. Maybe the reason we're able to trigger this issue is because these are server systems doing heavy lifting on real parallel threads as opposed to "simulated" under a VM? I hope this helps. Let me know if I can do anything to help diagnose further. Edit: I'm reading he metrics on the server and can give some more details. The issue started around 2 o'clock in the morning, The total lockup of the server arrived around 4 o'clock, 2 hours later. I had the following symptoms:
All other parameters are ok. This last lock-up seems unrelated to the crypt path above. It looks more like the issues of ram starvation like we used to have back a few years ago. |
A data point: Can you share the output of |
I have randomly picked two sources with two datasets, and the destination: zfs get all output
single thread, icp_aes_impl icp_gcm_impl on fastest (Since the last two crashes where so close to eachother, I've implemented a watchdog and automatic unlocking, and went full berserk, re-enabling concurrency) 4 threads icp_aes_impl icp_gcm_impl on generic: 4 threads icp_aes_impl icp_gcm_impl on x86_64 / pclmulqdq |
This time, I left a loop of 4 sends and rollbacks running, after creating an incremental by (on the same source I used for my prior send loop:
Then Unfortunately, as you might guess from my not leading with a success statement, it did not reproduce after 24 hours of looping. I'll probably explore including a zvol next, unless someone else comes up with a brighter idea. |
I think I stated before, my setup, but I'll try to explain my setup, in a deeper fashion:
I do not lock easily. Once in a week/10 days. Now, some meat. In the past days, I detected some corruption on one of the source servers. ZFS could mount the dataset but the data in it was inconsistent (missing most directories under the root). Also I could not unmount it once mounted. I had this issue on two related datasets: Same LXC root and data (1TB mail server data). Weirdly enough, sending this corrupted dataset to another dataset on the same machine seemed to fix the issue on one of them. The other (the mail spool), I had to roll back to a known good state, then I had to resync all missing mails form an off-site server before putting it back in production (this is a multisite, geo-replicated mail spool setup, so I had some live data to sync in addition). I was able to delete the corrupted datasets. A full scrub fixed the data errors in the already deleted snapshots. I know all the above does not give any hints, other than anecdotal evidence of a possible problem. I'm only sure there was corruption to start with and it was silent. It probably got to break some algo somewhere. I'd start with a full scrub on all affected pools. Here's some get all statistics. Please note, despite the name of the rpool (proxmox root pool) this is not an actual root pool containing the boot files. I just use the same name for historic reasons, but the pool is mounted inside an "ext4" root. First: sending server pool:
Sending server dataset. This is an example. All my datasets are configured the same way (lz4, xattr=sa etc).
Here's a receiver get all :
And a dataset:
And a last thing I almost forgot: Sorry to drop all this meat on the table. I hope I'm not causing confusion. I've not been able to isolate the issue in a better way. |
Update: I was able to get the zfs related processes:
Since the destroys (thoese are some of the receives where I deliberately make a snapshot for syncing, and they are being cleaned up after the transfer) have the highest pid, perhaps the crash is related to that? |
4 threads icp_aes_impl icp_gcm_impl on x86_64 / pclmulqdq Unfortunately i couldn't capture a process list this time. Here are some events from previous crashes up to the point the system is rebooted and the pool is imported: zpool history
Aren't there some commands that I could run after the exception, to gather more useful data? |
It seems that "reflowing" some of the problematic datasets on different mount point in the same pool has had some beneficial effects in my case. I mean just moving datasets with syncoid from rpool/mount1/dataset[x] -> rpool/mount2/dataset[x] and deleting some older snapshots and scrubbing the pool. BTW: zfs send/receive is so painfully slow. 2x2 mirrors spinning disk often manage only 20-30MB on average, with lots of time spent under the 5-10MB region. |
You might benefit, if you're not using it, from edit: I'm terribly sorry, I was thinking of someone else in the thread who is using gzip-9. My suggestion remains as stated, but assuming its properties work for you, you might want to explore using -w to avoid paying the de+re-encryption overhead. Because, for example, I get much faster throughput than that on my single-disk pools sans encryption. You might also want to use e.g. |
I expect this to be already fixed by #16104 in master. It is likely too fresh for upcoming 2.2.4, but should appear in following 2.2.5 release. |
With zfs 2.2.4 with patch #16104 applied I still get kernel panics with this stacktrace:
|
Is this fixed with 2.2.6? |
I updated to 2.2.6 a few days ago, so haven't had enough time to be confident about this. I'd been running a patched earlier release with significant success, but had also stopped daisy-chained sending to a third host, so it's not clear to me if this is the main reason for the vastly improved stability of the system. |
… ________________________________
From: Duncan Mortimer ***@***.***>
Sent: Wednesday, September 18, 2024 8:27 AM
To: openzfs/zfs ***@***.***>
Cc: Charles Hedrick ***@***.***>; Mention ***@***.***>
Subject: Re: [openzfs/zfs] ZFS on Linux null pointer dereference (#11679)
I updated to 2.2.6 a few days ago, so haven't had enough time to be confident about this. I'd been running a patched earlier release with significant success, but had also stopped daisy-chained sending to a third host, so it's not clear to me if this is the main reason for the vastly improved stability of the system.
I'm currently starting the process of migrating my backups to a new platform which will be heavily stressing send/receive so we'll see how it goes.
—
Reply to this email directly, view it on GitHub<#11679 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAORUCGINTWHXSGTBGFUTGDZXFWUTAVCNFSM4YPT72K2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMZVHAZTGNJVGI4Q>.
You are receiving this because you were mentioned.
|
Indeed, but my box was running 2.2.4 with the proposed patches, so never tested 2.2.5 as released. |
Sorry to report that the machine running 2.2.6 spontaneously rebooted overnight. I can only assume a kernel panic, but as nothing has been logged I can't be certain that ZFS was the cause. |
Usually a kernel panic freezes a system and requires a manual reboot. |
You can set a tunable to trigger triple fault on panic, though I don't think any distros do by default, to my knowledge. ...well, sorry, that was imprecise and inaccurate. A panic will generally reboot the machine outright by default on most distros (possibly with a round trip through kexec+kdump). Oops and BUG_ON don't, out of the box. |
Indeed. I was surprised to see it stuck at the LUKS (boot drive) passphrase prompt with a clearly faulty network connection as Clevis hadn’t auto-unlocked it. It then needed a further reboot before it came up, so any lingering previous boot info was long gone before I could interact with it. The system logs just stop with nothing in them. kernel.panic = 0 So I suspect the watchdog kicked in. |
I'd suggest configuring kdump if you want to figure out more, possibly over the network if a kernel core dump saved on local disk is fraught for you, though perhaps just logging the output of when it broke would be sufficient for you. |
It definitely panicked a few minutes ago. Message from syslogd@backup01 at Sep 23 10:00:16 ... Message from syslogd@backup01 at Sep 23 10:00:16 ... It had managed to internally (old to new pool) send/receive over 6TB of data over the weekend, a process that had completed. Looking at logs there were a couple of remote receives launched at the time of the crash - looks like I need to re-instate my Syncoid concurrency prevention wrapper. Section from /var/log/messages attached. Although kdump is configured it didn't write anything to /var/crash :-( |
Today's errant behaviour is a stuck 'zfs receive' - 100% CPU utilisation, no obvious progress (dataset size is static) and receive is unkillable. |
Until zfs native encrypted backups become stable, you can use restic to make encrypted remote snapshots of ZFS snapshots. Restic snapshots can be encrypted and incremental, and you can delete any restic snapshot without losing data. Restic has its own deduplicated blocks. You can use https://zfs.rent or https://www.rsync.net/ with restic right now. For now, LUKS is faster than zfs native encryption. Thus, LUKS, zfs, and restic are the best options now. |
@scratchings If you remove dust in your computer case, does the issue go away? Sometimes, dust in computer case causes errors in RAM or GPU. My GPU used to freeze until I removed dust in my computer case. |
My host is server grade, ECC RAM (with no reported bit corrections) in a clean server room, and only a few months old, so I very much doubt this. |
As I have written above, I believe the original panic of the report and its later reproductions should be fixed in 2.2.5 and up. Following report seems to be different, so lets close this and open new issue if needed |
@scratchings The panic definitely happened during receive. What is that "Syncoid concurrency prevention wrapper" and what does it prevent and why was it needed? Any evidences that concurrent receives are bad? I suppose it does not try to concurrently receive the same dataset or something like that? |
It's a python script that uses lock files to ensure that only one syncoid can run on the instigating host, and also checks that syncoid isn't running on the sending host (to prevent crashes on that host when it's being used in a daisy-chain fashion). I've reached the point with this particular host (which is replacing the original host that first suffered this issue, and has always had ZFS related crashes, no matter which version of ZFS it used) that I'm going to give up and convert it to LUKS + plain ZFS as I was having to send/receive onto a new JBOD anyway. As someone mentioned historic pool migration, the pool I'm migrating from is from the 0.8 RC era - old enough to have originally suffered from Errata #4 (but the dataset in question was send/received to a fresh dataset to clear this years ago). We will still have a few hosts running native encryption, one of which is running 2.2.6, and those that take snapshots regularly experience #12014. At least these hosts have never crashed, I've only ever seen this on the receiving side. Even post LUKSing I'll still be running encryption for the month+ it will take to transfer the >400TB of data, so if I get further crashes I'll report them. Perhaps of interest... When I got the first zfs process hang, it was a receive on the old pool (aes-256-ccm, lz4), there was a recursive receive in progress from that pool to the new pool (aes-256-gcm + recompression with zstd), this continued to operate for several more days (10+TB transferred and multiple datasets) until it too got a stuck zfs process. I guess if this is a race condition, then the higher performance of the gcm vs ccm encryption may be helping somewhat. |
@scratchings You still haven't really answered my questions. While working on #16104 I've noticed that ZFS dbuf layer is generally unable to handle both encrypted and non-encrypted data for a dataset same time. It should not happen normally, since dataset should not be accessible until encrypted receive is fully completed and TXG is synced, but if we guess there are some races, then we may see all kinds of problems and I want to know about it. That is why I want to know what sorts of concurrency creates you a problem and what sort of serialization fixes it. It would be huge help if we could reproduce it manually somehow, but we are rarely that lucky. Speaking of mentioned "recompression" as part of encrypted dataset replication, that would be possible only if the replication is unencrypted. Is that what we are talking about in general, or it was that only specific case? Unencrypted replication of encrypted dataset? Because it might be very different case from encrypted replication I was thinking about with very different dragons. |
This is a host that backs up several (ca 10) hosts over both local 10Gbit and remote 1Gbit SSH connections. The most stable environment has been when the host pulls backups from remote hosts such that it can ensure that only one receive is ever happening across the whole pool. I wrote a py script to wrap syncoid calls using lock files such that the cron tasks can launch on schedule and then wait until nothing has a lock on this file. We also pushed backups to a remote location (slower network) and check for syncoids running there (it's doing some backups of hosts local to it). It's this latter process that seems to have the biggest impact on stability. Our remote push destination failed so regularly that the common snapshots were lost and thus sends stopped. At this point the stability of the primary backup location improved markedly. I've now embarked on a replacement of this primary backup host (new server hardware and new JBOD). As part of this I briefly (based on the optimism that 2.2.5 fixed the crashes) switched back to client instigated receives (i.e. not central control of when this might happen, cron tasks on the client launch syncoid in push mode) with no exclusivity locks as this obviously has benefits for ensuring the client is in a 'good' state with respect to service backups and means that other backups aren't held up in the case of a large number of blocks needing to be transferred. This made things go bad. As to the 'recompression', this is by overriding the compression setting on receive, e.g. for the internal transfer to the new pool:
When we see 'corrupt'snapshots on the client end this is 'discovered' during the send process, syncoid will report:
This is Red Hat Enterprise 9. Happy to clarify further as required. |
This is not hw problem. I managed to reproduce it on VM server. Few times I got this error message about dmu_recv, but i can not reproduce it at every run. But it looks like it is caused when sending unencrypted snapshot from encrypted dataset, this save unencrypted data in memory and then receive new snapshot. In production backup server this same situation happens when source server send snapshot to backup server while there was recent send of this dataset from this backup server to another backup server (to ballance disk usage across servers). |
In case it's significant. My receiving pool is quite full - 94% (has been as high as 98%) and ca 20% fragmentation (it's a 650TB pool, so even at 94% there's ca 40TB of free space). |
System information
Describe the problem you're observing
When I start sending raw ZFS snapshots to a different system, my Linux systen (4.19.0-14-amd64) starts to hang completely. I can ping it, I can start a very commands (such as dmesg) but most commands hang (incl zfs, zpool, htop, ps, ...). The entire systems hangs completely.
Dmesg shows the following entries at the time of the occurance:
Interestingly, the transfer continues happily but just everything else in the system hangs.
The only way to recover is resetting the machine (since not even reboot works).
Describe how to reproduce the problem
It's a tough one. It seems to me that the issue might be load related in some sense since it only occurs if I have two zfs send's (via syncoid) running in parallel that have to do with encrypted datasets.
Transfer 1
The first one sends datasets from an unecrypted dataset into an encrypted one (I migrate to encryption).
I use syncoid and use the command:
syncoid -r --skip-parent --no-sync-snap zpradix1imain/sys/vz zpradix1imain/sys/vz_enc
This translates into
zfs send -I 'zpradix1imain/sys/vz/main'@'zfs-auto-snap_hourly-2021-03-02-1917' 'zpradix1imain/sys/vz/main'@'zfs-auto-snap_frequent-2021-03-02-1932' | mbuffer -q -s 128k -m 16M 2>/dev/null | pv -s 16392592 | zfs receive -s -F 'zpradix1imain/sys/vz_enc/main'
Transfer 2
I transfer data from an encrypted dataset raw to a secondary server.
The syncoid command is:
syncoid -r --skip-parent --no-sync-snap --sendoptions=w --exclude=zfs-auto-snap_hourly --exclude=zfs-auto-snap_frequent zpradix1imain/data [email protected]:zpzetta/radix/data
This translates into:
zfs send -w 'zpradix1imain/data/home'@'vicari-prev' | pv -s 179222507064 | lzop | mbuffer -q -s 128k -m 16M 2>/dev/null | ssh ...
In summary:
The text was updated successfully, but these errors were encountered: