Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NFS server soft-locks on attempting to umount a snapshot from which client copied dirs #5853

Closed
bokkiedog opened this issue Mar 2, 2017 · 2 comments

Comments

@bokkiedog
Copy link

bokkiedog commented Mar 2, 2017

System information

Type Version/Name
Distribution Name Debian GNU/Linux
Distribution Version 8 (Jessie)
Linux Kernel 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1+deb8u1 (2017-02-22) x86_64 GNU/Linux
Architecture x86_64
ZFS Version v0.6.5.9-2~bpo8+1
SPL Version v0.6.5.9-1~bpo8+1

Describe the problem you're observing

On my proxmox NFS-server VM, copying files from a .zfs snapshot hierarchy on a NFS-client VM means that the NFS server's kernel soft-locks on the umount when you try to destroy that snapshot, even after exportfs -f;
it destroys successfully on hard-reboot.

Describe how to reproduce the problem

Create a ZFS dataset on an NFS server (within an encapsulating dataset in my case)
Create a ZFS snapshot of that new dataset called 'snapped'
Copy some dirs and files from the .zfs/snapshot/snapped/ directory on an NFS client connected to that server.
Try to destroy that snapshot on the server - even after exportfs -f

The kernel soft-locks on attempting to umount the snapshot, and the vm eventually grinds to a halt.

I include here a script which reliably triggers this pathological behaviour on my system. I have renamed some of the paths and servers in the paste here for clarity:

#!/bin/bash

# This script crashes a proxmox VM running an NFS server

# The full name of the dataset to create
DATASET=pool/test

# The mountpoint (on my system, it's the same as the dataset name)
MNT=/$DATASET

# The full name of the snapshot for the dataset
SNAPSHOT=$DATASET@today

# The full-path to the test directory hierarchy to create
TEST_DIR=$MNT/contents/foo

# The command with which to ssh to the server with the nfs-client mount
NFS_CLIENT_SSH=user@nfs-client-local

echo ** Create the dataset
zfs create $DATASET

echo ** Create directories and files within
mkdir -p $TEST_DIR
touch $TEST_DIR/bar $TEST_DIR/baz
chmod 777 $MNT/.

echo ** Make a snapshot
zfs snapshot $SNAPSHOT

echo ** SSH to the client and ask it to copy data from the snapshot
ssh $NFS_SSH cp -a $MNT/.zfs/snapshot/$SNAP/contents  /$MNT/restored

echo ** Destroy the snapshot
exportfs -f
zfs destroy $SNAPSHOT

# The command never completes, with the umount entering D-state forever
# and the kernel soft-locking
#
# On hard rebooting, the above destroy command DOES work

Include any warning/errors/backtraces from the system logs

The running process-list once my script has stalled trying to umount the snapshot (using my actual path and hostnames, not the simpler examples in the test-script above):


$ ps rauxwww
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        11  0.0  0.0      0     0 ?        R    10:19   0:00 [watchdog/1]
root        12  0.0  0.0      0     0 ?        R    10:19   0:00 [migration/1]
root        28  0.0  0.0      0     0 ?        R    10:19   0:00 [kworker/1:1]
root      3497  0.0  0.1  35320  3300 pts/0    D+   10:25   0:00 zfs destroy data/shardstor01/home/testcrash snapped
root      3499  1.6  0.1  19856  2336 ?        R    10:25   0:00 umount -t zfs -fn /data/shardstor01/home/testcrash/.zfs/snapshot/snapped

Message from syslogd@tpdev-nfs at Mar  2 10:26:07 ...
 kernel:[  412.048005] BUG: soft lockup - CPU#1 stuck for 22s! [umount:3499]

I have attached a kern.log that has everything from boot to the cycling soft-lock message..

broken_kern_log.txt

@bokkiedog
Copy link
Author

( BTW, is this related to this issue #5810 - should I attempt the patch there, or is this new? )

@tuxoko
Copy link
Contributor

tuxoko commented Mar 2, 2017

@bokkiedog
It's the same, you should try the patch, and post further update on #5810

@tuxoko tuxoko closed this as completed Mar 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants