Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock with Postgres on 2.0.1 #11463

Closed
dkacar-oradian opened this issue Jan 14, 2021 · 12 comments
Closed

Deadlock with Postgres on 2.0.1 #11463

dkacar-oradian opened this issue Jan 14, 2021 · 12 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@dkacar-oradian
Copy link

System information

Type Version/Name
Distribution Name CentOS
Distribution Version 7.9
Linux Kernel 3.10.0-1160.11.1
Architecture x86_64
ZFS Version 2.0.1
SPL Version 2.0.1

Describe the problem you're observing

I have an easily reproducible deadlock on CentOS 7.9 with OpenZFS 2.0.1 (kmod kernel package from the official RPM repo) which doesn't happen on 2.0.0.

I have this in top:

top - 14:45:30 up 10 min,  2 users,  load average: 1.97, 1.68, 0.89
Tasks: 258 total,   4 running, 254 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.7 us,  1.0 sy,  0.0 ni, 97.7 id,  0.7 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.3 us,  0.7 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  0.3 us,  1.0 sy,  0.0 ni, 97.3 id,  1.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu3  :  0.0 us,100.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  :  0.0 us,100.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  :  0.0 us,100.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 41185060 total, 36549828 free,  2478316 used,  2156916 buff/cache
KiB Swap:  2097148 total,  2097148 free,        0 used. 37563100 avail Mem

At first there are only 2 CPUs spinning at 100%, but after some time the third joins in. And so it stays and I have to power off the VM (running in VmWare). I suppose those are deadlocked spinlocks in the kernel or something like that.

Debug log shows:

# cat /proc/spl/kstat/zfs/dbgmsg
1610631466   spa_history.c:304:spa_history_log_sync(): txg 408988 snapshot pg_data/prod-db-c09_5433@rmf_6432 (id 645)  
1610631466   spa_history.c:329:spa_history_log_sync(): ioctl snapshot
1610631466   spa_history.c:296:spa_history_log_sync(): command: zfs snapshot pg_data/prod-db-c09_5433@rmf_6432
1610631466   spa_history.c:304:spa_history_log_sync(): txg 408989 clone pg_data/rmf_6432 (id 651) origin=pg_data/prod-db-c09_5433@rmf_6432 (645)
1610631466   spa_history.c:304:spa_history_log_sync(): txg 408991 set pg_data/rmf_6432 (id 651) redundant_metadata=1
1610631466   spa_history.c:304:spa_history_log_sync(): txg 408991 set pg_data/rmf_6432 (id 651) compression=16
1610631466   spa_history.c:304:spa_history_log_sync(): txg 408991 set pg_data/rmf_6432 (id 651) atime=0
1610631466   spa_history.c:304:spa_history_log_sync(): txg 408991 set pg_data/rmf_6432 (id 651) xattr=2
1610631466   spa_history.c:304:spa_history_log_sync(): txg 408991 set pg_data/rmf_6432 (id 651) mountpoint=/data/pg_data/snapshots/rmf_6432
1610631468   spa_history.c:329:spa_history_log_sync(): ioctl clone
1610631468   spa_history.c:296:spa_history_log_sync(): command: zfs clone -oredundant_metadata=most -ocompression=zstd -oatime=off -oxattr=sa -o mountpoint=/data/pg_data/snapshots/rmf_6432 pg_data/prod-db-c09_5433@rmf_6432 pg_data/rmf_6432

And nothing new appears.

Describe how to reproduce the problem

What I did was: create 2 Postgres slaves with ZFS 2.0.0, each on its own ZFS file system. Then upgraded to 2.0.1 and created snapshot and clone on one of them and tried to start primary Postgres instance on the clone. That will apply a certain number of WAL files and then it gets stuck like this. Reproducible every time, so far.

Include any warning/errors/backtraces from the system logs

However, this data isn't particularly useful for debugging the issue. When I had similar problems with BTRFS there were stack traces for stuck processes obtainable with dmsg, due to:

# cat /proc/sys/kernel/hung_task_timeout_secs
120

But there's absolutely nothing after the boot messages with ZFS.

So how do I find out relevant data to debug this?

@dkacar-oradian dkacar-oradian added Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang) labels Jan 14, 2021
@sempervictus
Copy link
Contributor

Does this happen if you unmount the source of the clone and then re-mount it before spinning up the WAL writer threads on the clone itself?

@Do-while-Bacon
Copy link

Do-while-Bacon commented Jan 17, 2021

This deadlock behavior is very similar to what I was seeing in #11476, also on CentOS 7.9, and also resolved by downgrading to 2.0.0.

@dkacar-oradian
Copy link
Author

@sempervictus WAL directory is moved to ext4 before Postgres is started. So WAL writers are not in the picture, I think.
I'll try unmounting and remounting the origin before starting Postgres.

@dkacar-oradian
Copy link
Author

I did git bisect on this and I got:

58bc86c5cb47bb9d1886a0d0d3f47df90dc3955a is the first bad commit
commit 58bc86c5cb47bb9d1886a0d0d3f47df90dc3955a
Author: Brian Behlendorf <[email protected]>
Date:   Fri Dec 18 08:48:26 2020 -0800

    Linux 5.10 compat: use iov_iter in uio structure
    

Unfortunately, that one is quite large.

Testing methodology:
good - if starting Postgres on a clone passed three times without deadlock and if those clones were deleted without deadlock
bad - one of the three attempts would deadlock. But in practice it was always the first one.

Interesting thing is that with the full 2.0.1 release I first see 2 CPUs spinning at 100% and after some time the third joins them. But with this commit it's only one CPU at 100%. The others never join.

@sempervictus
Copy link
Contributor

Skimming over the commit the uiomove_iter setup looks a bit suspect in the "full pipe" case documented in the function.
Looks like we also gained some cycles in zpl_xattr_get_dir.
The whole thing's weird - it kind of looks like an attempt to unify access paths for the future when everything is great and everyone's on the same happy/stable UIO codepaths, not taking into account that most humans are using older stable kernels likely containing distro backports for the "very much in-flux" Linux IO subsystem. I dont know if this makes sense in 2.0.1 at all (the tip-kernel fanboy club is probably less important to support than release package consumers), or why the legacy implementation known to work was not kept ifdefd for everyone before 5.10.

@arm32let
Copy link

courtesy of echo l > /proc/sysrq-trigger

[ 1793.329066] CPU: 2 PID: 1315 Comm: postgres Tainted: P           OE  ------------   3.10.0-1160.11.1.el7.x86_64 #1
[ 1793.329066] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
[ 1793.329067] task: ffff8f8b3a208000 ti: ffff8f8b3a540000 task.ti: ffff8f8b3a540000
[ 1793.329067] RIP: 0010:[<ffffffffa2d97b80>]  [<ffffffffa2d97b80>] copy_user_handle_tail+0x70/0x90
[ 1793.329068] RSP: 0018:ffff8f8b3a543a68  EFLAGS: 00040206
[ 1793.329068] RAX: 0000000000003000 RBX: 0000000000003000 RCX: 0000000000000000
[ 1793.329069] RDX: 0000000000000318 RSI: ffff9bdc014b9ce9 RDI: ffff9bdc014b9ce8
[ 1793.329069] RBP: ffff8f8b3a543a68 R08: 0000000000000000 R09: 0000000000000000
[ 1793.329070] R10: 0000000000011000 R11: ffffdf5100eaed80 R12: ffff8f8b3a543d80
[ 1793.329070] R13: 0000000000001000 R14: 0000000000003000 R15: ffff8f8b3a543df8
[ 1793.329071] FS:  00007f8fc3b7b880(0000) GS:ffff8f8b3fd00000(0000) knlGS:0000000000000000
[ 1793.329071] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1793.329072] CR2: 00007f8fb5a5b000 CR3: 000000003a53c000 CR4: 0000000000360ee0
[ 1793.329072] Call Trace:
[ 1793.329073]  [<ffffffffc094b2d6>] uiomove_iov+0x1d6/0x2b0 [zfs]
[ 1793.329073]  [<ffffffffc094b554>] uiomove+0x14/0x30 [zfs]
[ 1793.329074]  [<ffffffffc07f969b>] dmu_write_uio_dnode+0x9b/0x200 [zfs]
[ 1793.329074]  [<ffffffffc07f9854>] dmu_write_uio_dbuf+0x54/0x70 [zfs]
[ 1793.329075]  [<ffffffffc09188f5>] zfs_write+0xe65/0x1180 [zfs]
[ 1793.329075]  [<ffffffffa2c71d94>] ? mntput+0x24/0x40
[ 1793.329076]  [<ffffffffa2c5e08d>] ? do_last+0x66d/0x1340
[ 1793.329076]  [<ffffffffc0967c3d>] zpl_aio_write+0x13d/0x1d0 [zfs]
[ 1793.329076]  [<ffffffffa2c4d243>] do_sync_write+0x93/0xe0
[ 1793.329077]  [<ffffffffa2c4dd30>] vfs_write+0xc0/0x1f0
[ 1793.329077]  [<ffffffffa2c4eb0f>] SyS_write+0x7f/0xf0
[ 1793.329078]  [<ffffffffa3193f92>] system_call_fastpath+0x25/0x2a

@arm32let
Copy link

postgresql-server-9.2 running on top of ZFS.

wal_level = archive
fsync = on 
synchronous_commit = on
wal_sync_method = fsync

in /var/lib/pgsql/data/postgresql.conf if that even matters.

reproduced running sudo -u postgres psql -c 'create database test;'

@dkacar-oradian
Copy link
Author

How do I build kernel modules to include enough debug information to decode hex offsets to line numbers? I've tried adding --enable-debuginfo, but that doesn't seem to do the trick. The decode_stacktrace.sh script wants to find .debug_line ELF section.

@behlendorf
Copy link
Contributor

@dkacar-oradian thanks for narrowing this down, I'll see if I can reproduce it on CentOS 7 with the provided reproducer. While the patch is fairly large and some code was refactored to deal with the 5.10 kernel change what the code does should be identical for CentOS 7. Though clearly something a little different, we just need to identify it.

The whole thing's weird - it kind of looks like an attempt to unify access paths for the future when everything is great and everyone's on the same happy/stable UIO code paths not taking into account that most humans are using older stable kernels

@sempervictus yes, there's more churn than I would have liked but it was necessary to support both 3.10 and 5.10 kernels. I'm surprised we didn't encounter this issue in the master branch before backporting the change. We manually ran a ton of regression tests as part of making this change. I considered the ifdef approach, but backporting the change unmodified from master with all the testing felt like the lowest risk approach. The alternative would be doing something custom for the 2.0 branch which would have seen less testing and complicated any needed future compatibility backports.

@behlendorf
Copy link
Contributor

I've opened PR #11484 with a proposed fix for this regression. Any additional testing would be welcome. Thus far I've only verified it resolves the provided Postgres test case on CentOS 7.9 with the 3.10.0-1160.11.1 kernel. @Do-while-Bacon I'd appreciate any help in verifying it also resolves the docker container issue as well (it should). The patch applies cleanly to the 2.0.1 tag.

@dkacar-oradian
Copy link
Author

dkacar-oradian commented Jan 20, 2021

I've applied that patch on top of 2.0.1 and things work fine with starting Postgres on a clone (three in a row) and deleting them. Reboot also works fine.

behlendorf added a commit to behlendorf/zfs that referenced this issue Jan 21, 2021
As part of commit 1c2358c the custom uio_prefaultpages() code
was removed in favor of using the generic kernel provided
iov_iter_fault_in_readable() interface.  Unfortunately, it
turns out that up until the Linux 4.7 kernel the function would
only ever fault in the first iovec of the iov_iter.  The result
being uiomove_iov() may hang waiting for the page.

This commit effectively restores the custom uio_prefaultpages()
pages code for Linux 4.9 and earlier kernels which contain the
troublesome version of iov_iter_fault_in_readable().

Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#11463 
Closes openzfs#11484
@behlendorf
Copy link
Contributor

The fix has been merged to the master branch and I've opened #11493 to make sure it gets applied for OpenZFS 2.0.2.

@behlendorf behlendorf removed the Status: Triage Needed New issue which needs to be triaged label Jan 21, 2021
behlendorf added a commit that referenced this issue Jan 22, 2021
As part of commit 1c2358c the custom uio_prefaultpages() code
was removed in favor of using the generic kernel provided
iov_iter_fault_in_readable() interface.  Unfortunately, it
turns out that up until the Linux 4.7 kernel the function would
only ever fault in the first iovec of the iov_iter.  The result
being uiomove_iov() may hang waiting for the page.

This commit effectively restores the custom uio_prefaultpages()
pages code for Linux 4.9 and earlier kernels which contain the
troublesome version of iov_iter_fault_in_readable().

Signed-off-by: Brian Behlendorf <[email protected]>
Closes #11463 
Closes #11484
jsai20 pushed a commit to jsai20/zfs that referenced this issue Mar 30, 2021
As part of commit 1c2358c the custom uio_prefaultpages() code
was removed in favor of using the generic kernel provided
iov_iter_fault_in_readable() interface.  Unfortunately, it
turns out that up until the Linux 4.7 kernel the function would
only ever fault in the first iovec of the iov_iter.  The result
being uiomove_iov() may hang waiting for the page.

This commit effectively restores the custom uio_prefaultpages()
pages code for Linux 4.9 and earlier kernels which contain the
troublesome version of iov_iter_fault_in_readable().

Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#11463 
Closes openzfs#11484
sempervictus pushed a commit to sempervictus/zfs that referenced this issue May 31, 2021
As part of commit 1c2358c the custom uio_prefaultpages() code
was removed in favor of using the generic kernel provided
iov_iter_fault_in_readable() interface.  Unfortunately, it
turns out that up until the Linux 4.7 kernel the function would
only ever fault in the first iovec of the iov_iter.  The result
being uiomove_iov() may hang waiting for the page.

This commit effectively restores the custom uio_prefaultpages()
pages code for Linux 4.9 and earlier kernels which contain the
troublesome version of iov_iter_fault_in_readable().

Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#11463 
Closes openzfs#11484
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

5 participants