Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault when using rollback while doing dd #795

Closed
Pinkbyte opened this issue Jun 23, 2012 · 3 comments
Closed

segfault when using rollback while doing dd #795

Pinkbyte opened this issue Jun 23, 2012 · 3 comments
Milestone

Comments

@Pinkbyte
Copy link

System:

zfstest ~ # uname -a
Linux zfstest 3.2.12-gentoo-ZFSTEST #2 SMP Sun Jun 17 03:40:10 MSK 2012 x86_64 Intel(R) Core(TM)2 Duo CPU T7700 @ 2.40GHz GenuineIntel GNU/Linux

zfstest ~ # cat /proc/meminfo | grep Mem
MemTotal:        4046560 kB
MemFree:         3817324 kB

zfstest ~ # zfs list
NAME                 USED  AVAIL  REFER  MOUNTPOINT
bledge              1.81G  7.91G    30K  none
bledge/root         1.44G  7.91G  1.37G  /
bledge/usr           382M  7.91G    30K  none
bledge/usr/portage   382M  7.91G   382M  /usr/portage

zfstest ~ # zfs list -t snapshot
NAME                            USED  AVAIL  REFER  MOUNTPOINT
bledge/root@before_updates      658K      -  1.36G  -
bledge/root@eix_installed       112K      -  1.36G  -
bledge/root@eix_cache_created   115K      -  1.37G  -
bledge/root@world_updated       192K      -  1.37G  -
bledge/root@sshd_enabled         73K      -  1.37G  -
bledge/root@create_test_sh       78K      -  1.37G  -

ZFS and SPL versions: 0.6.0_rc9

How to model this situation(it happens not every times, but very often):

  1. open 2 terminals;
  2. write in first terminal: dd if=/dev/zero of=/zerofile bs=1M
  3. write in second terminal: zfs rollback bledge/root@create_test_sh

dd interrupts with input/output error, but i think it is ok for this situation...

Second terminal:

zfstest ~ # zfs rollback bledge/root@create_test_sh
internal error: unable to open /etc/mtab
zfstest ~ # ls /etc/mtab
ls: cannot access /etc/mtab: Input/output error
zfstest ~ # rm /etc/mtab
Segmentation fault

First terminal:

general protection fault: 0000 [#1] SMP
CPU 1
Modules linked in: zfs(P) zunicode(P) zavl(P) zcommon(P) znvpair(P) spl(O) scsi_wait_scan

Pid: 3393, comm: rm Tainted: P W O 3.2.12-gentoo-ZFSTEST #2 Bochs Bochs
RIP: 0010:[<ffffffffa01221eb>] [<ffffffffa01221eb>] zfs_inode_destroy+0x6b/0x100 [zfs]
RSP: 0018:ffff8801143cda78 EFLAGS: 00010282
RAX: ffff8800c90e4980 RBX: ffff8800c90e49a8 RCX: dead000000100100
RDX: dead000000200200 RSI: dead000000100100 RDI: ffff8800cab29420
RBP: ffff8800cab29000 R08: ff18f1e80b868e03 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800c90e4800
R13: ffff8800cab29420 R14: 0000000000000600 R15: 0000000000000000
FS: 00007f069cb3f700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f069c66d3b0 CR3: 00000001143b6000 CR4: 00000000000006a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process rm (pid: 3393, threadinfo ffff8801143cc000, task ffff8800c6aac1a0)
Stack:
 0000000000000000 ffff8800cab29000 ffff8800c90e49a8 0000000000000000
 ffff8800c90ae180 ffffffffa01227dc 000000000000002c 000000000001d02c
 ffff8800c90e4920 0000000000000000 0000000000050008 ffff8800c8366e08
Call Trace:
 [<ffffffffa01227dc>] ? zfs_inode_update+0x55c/0x670 [zfs]
 [<ffffffffa0124a04>] ? zfs_zget+0x154/0x1e0 [zfs]
 [<ffffffffa01056f0>] ? zfs_dirent_lock+0x470/0x570 [zfs]
 [<ffffffffa011d72d>] ? zfs_remove+0x12d/0x450 [zfs]
 [<ffffffff815f5f95>] ? _raw_spin_lock+0x5/0x10
 [<ffffffffa0132010>] ? zpl_fallocate_common+0x610/0x710 [zfs]
 [<ffffffff810eb1dd>] ? vfs_unlink+0x8d/0x100
 [<ffffffff810eb3f1>] ? do_unlinkat+0x1a1/0x1d0
 [<ffffffff810e1c4f>] ? sys_newfstatat+0x1f/0x50
 [<ffffffff815f68d2>] ? system_call_fastpath+0x16/0x1b
Code: ad 20 04 00 00 4c 89 ef e8 e3 28 4d e1 48 be 00 01 10 00 00 00 ad de 4c 89 ef 4c 89 e0 48 03 85 00 04 00 00 48 8b 08 48 8b 50 08 <48> 89 51 08 48 89 0a 48 89 30 48 b9 00 02 20 00 00 00 ad de 48
RIP [<ffffffffa01221eb>] zfs_inode_destroy+0x6b/0x100 [zfs]
 RSP <ffff8801143cda78>
---[ end trace 596d4c2a441ea645 ]---

After that, any command in already opened terminal stucks that terminal, only reboot helps...

@behlendorf
Copy link
Contributor

Thanks for the reproducer it should make it easier to run down.

@nedbass
Copy link
Contributor

nedbass commented Dec 27, 2012

The following procedure reliably reproduces a variation of this bug for me.

I'm not sure if rolling back a mounted filesystem is even supposed to work. The old Solaris ZFS administration guide says a mounted filesystem is unmounted and remounted, and the rollback fails if the unmount fails. However the current code doesn't seem to attempt this, so that documentation may be obsolete. Also, I verified that this test works on Open Indiana, so there may be a reasonable way to fix this on Linux.

zfs create tank/fish
zfs snapshot tank/fish@a
touch /tank/fish/a
tail -f /tank/fish/a &
sleep 1
zfs rollback tank/fish@a
sleep 1
touch /tank/fish/b

The last touch triggers a WARNING and GPF then segfaults:

[  365.184184] ------------[ cut here ]------------
[  365.184193] WARNING: at /build/buildd/linux-3.5.0/fs/inode.c:964 unlock_new_inode+0x79/0x90()
[  365.184195] Hardware name: Bochs
[  365.184196] Modules linked in: zfs(PO) zcommon(PO) zunicode(PO) znvpair(PO) zavl(PO) splat(O) spl(O) zlib_deflate stap_1e31e8a81d4a0393e2aa39814f900635_2290(O) parport_pc rfcomm bnep bluetooth ppdev kvm snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_seq_midi microcode snd_rawmidi snd_seq_midi_event snd_seq snd_timer psmouse snd_seq_device serio_raw snd soundcore virtio_balloon snd_page_alloc i2c_piix4 mac_hid lp parport floppy
[  365.184228] Pid: 2843, comm: touch Tainted: P           O 3.5.0-17-generic #28-Ubuntu
[  365.184229] Call Trace:
[  365.184247]  [<ffffffff81051c4f>] warn_slowpath_common+0x7f/0xc0
[  365.184250]  [<ffffffff81051caa>] warn_slowpath_null+0x1a/0x20
[  365.184253]  [<ffffffff8119bc69>] unlock_new_inode+0x79/0x90
[  365.184289]  [<ffffffffa03d2633>] zfs_znode_alloc+0x5c3/0x780 [zfs]
[  365.184309]  [<ffffffffa03634c6>] ? sa_idx_tab_rele+0x66/0x1b0 [zfs]
[  365.184332]  [<ffffffffa03d30b4>] zfs_mknode+0x8c4/0xed0 [zfs]
[  365.184356]  [<ffffffffa03cb98a>] zfs_create+0x59a/0x760 [zfs]
[  365.184379]  [<ffffffffa03ee7a8>] zpl_create+0xa8/0x1d0 [zfs]
[  365.184382]  [<ffffffff8118eb04>] vfs_create+0xb4/0x120
[  365.184384]  [<ffffffff8118ff4b>] do_last+0x8ab/0xa10
[  365.184388]  [<ffffffff81191399>] path_openat+0xd9/0x430
[  365.184393]  [<ffffffff81689ad6>] ? ftrace_call+0x5/0x2b
[  365.184396]  [<ffffffff81191811>] do_filp_open+0x41/0xa0
[  365.184400]  [<ffffffff8119e906>] ? alloc_fd+0xc6/0x110
[  365.184403]  [<ffffffff81181355>] do_sys_open+0xf5/0x230
[  365.184406]  [<ffffffff811814b1>] sys_open+0x21/0x30
[  365.184408]  [<ffffffff81689d29>] system_call_fastpath+0x16/0x1b
[  365.184410] ---[ end trace c6ce396af0377dc6 ]---
[  365.184424] general protection fault: 0000 [#1] SMP 
[  365.184428] CPU 10 
[  365.184429] Modules linked in: zfs(PO) zcommon(PO) zunicode(PO) znvpair(PO) zavl(PO) splat(O) spl(O) zlib_deflate stap_1e31e8a81d4a0393e2aa39814f900635_2290(O) parport_pc rfcomm bnep bluetooth ppdev kvm snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_seq_midi microcode snd_rawmidi snd_seq_midi_event snd_seq snd_timer psmouse snd_seq_device serio_raw snd soundcore virtio_balloon snd_page_alloc i2c_piix4 mac_hid lp parport floppy
[  365.184453] 
[  365.184455] Pid: 2843, comm: touch Tainted: P        W  O 3.5.0-17-generic #28-Ubuntu Bochs Bochs
[  365.184461] RIP: 0010:[<ffffffffa03d1d3e>]  [<ffffffffa03d1d3e>] zfs_inode_destroy+0x9e/0x1e0 [zfs]
[  365.184488] RSP: 0018:ffff880051de7708  EFLAGS: 00010292
[  365.184489] RAX: ffff8800527276b0 RBX: ffff8800527276d8 RCX: dead000000100100
[  365.184491] RDX: dead000000200200 RSI: dead000000200200 RDI: ffff880052710530
[  365.184492] RBP: ffff880051de7748 R08: f018000000000000 R09: 00527277780c0000
[  365.184493] R10: ff8f8d9d2491de03 R11: 0000000000000006 R12: ffff880052710000
[  365.184494] R13: ffff880052727528 R14: ffff880052710530 R15: 0000000000000000
[  365.184496] FS:  00007f7389e0e700(0000) GS:ffff88007fd40000(0000) knlGS:0000000000000000
[  365.184498] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  365.184499] CR2: 0000000000407b10 CR3: 00000000791d9000 CR4: 00000000000006e0
[  365.184503] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  365.184506] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  365.184508] Process touch (pid: 2843, threadinfo ffff880051de6000, task ffff880052708000)
[  365.184509] Stack:
[  365.184510]  ffff880051de7778 ffffffff81689ad6 ffffffffa03eed70 0000000000000006
[  365.184514]  ffff8800527276d8 ffff880052727760 ffffffffa0402fe0 ffffffffa0402fe0
[  365.184518]  ffff880051de7778 ffffffffa03eed93 ffff880051de7778 ffffffff8119c19e
[  365.184522] Call Trace:
[  365.184525]  [<ffffffff81689ad6>] ? ftrace_call+0x5/0x2b
[  365.184547]  [<ffffffffa03eed70>] ? zpl_dirty_inode+0x10/0x10 [zfs]
[  365.184569]  [<ffffffffa03eed93>] zpl_inode_destroy+0x23/0x70 [zfs]
[  365.184572]  [<ffffffff8119c19e>] ? __destroy_inode+0x2e/0xf0
[  365.184575]  [<ffffffff8119c29c>] destroy_inode+0x3c/0x70
[  365.184577]  [<ffffffff8119c3f3>] evict+0x123/0x1b0
[  365.184580]  [<ffffffff8119c589>] iput+0x109/0x210
[  365.184602]  [<ffffffffa03d263b>] zfs_znode_alloc+0x5cb/0x780 [zfs]
[  365.184621]  [<ffffffffa03634c6>] ? sa_idx_tab_rele+0x66/0x1b0 [zfs]
[  365.184645]  [<ffffffffa03d30b4>] zfs_mknode+0x8c4/0xed0 [zfs]
[  365.184668]  [<ffffffffa03cb98a>] zfs_create+0x59a/0x760 [zfs]
[  365.184690]  [<ffffffffa03ee7a8>] zpl_create+0xa8/0x1d0 [zfs]
[  365.184693]  [<ffffffff8118eb04>] vfs_create+0xb4/0x120
[  365.184695]  [<ffffffff8118ff4b>] do_last+0x8ab/0xa10
[  365.184699]  [<ffffffff81191399>] path_openat+0xd9/0x430
[  365.184702]  [<ffffffff81689ad6>] ? ftrace_call+0x5/0x2b
[  365.184705]  [<ffffffff81191811>] do_filp_open+0x41/0xa0
[  365.184708]  [<ffffffff8119e906>] ? alloc_fd+0xc6/0x110
[  365.184710]  [<ffffffff81181355>] do_sys_open+0xf5/0x230
[  365.184713]  [<ffffffff811814b1>] sys_open+0x21/0x30
[  365.184716]  [<ffffffff81689d29>] system_call_fastpath+0x16/0x1b
[  365.184717] Code: 84 24 18 05 00 00 0f 84 13 01 00 00 4c 89 e8 49 03 84 24 10 05 00 00 48 be 00 02 20 00 00 00 ad de 4c 89 f7 48 8b 08 48 8b 50 08 <48> 89 51 08 48 89 0a 48 b9 00 01 10 00 00 00 ad de 48 89 08 48 
[  365.184752] RIP  [<ffffffffa03d1d3e>] zfs_inode_destroy+0x9e/0x1e0 [zfs]
[  365.184773]  RSP <ffff880051de7708>
[  365.184776] ---[ end trace c6ce396af0377dc7 ]---

The WARN from unlock_new_inode() is that inode doesn't have the I_NEW flag set:

        WARN_ON(!(inode->i_state & I_NEW));                                     

Regarding the GPF, zfs_inode_destroy+0x9e resolves to __list_del(), so it seems that zfs_inode_destroy() is calling list_remove() on a corrupt zsb->z_all_znodes list.

Finally, I've determined that insert_inode_locked() is returning EBUSY when called from zfs_znode_alloc(), causing control to jump to the error label:

error:
        unlock_new_inode(ip);
        iput(ip);
        return NULL;

where unlock_new_inode() triggers the WARN and iput() leads to the GPF.

@behlendorf
Copy link
Contributor

@Pinkbyte If you get a chance can you test #1214 it addresses your rollback issue. It passes all of my testing but a little more never hurts.

behlendorf added a commit to behlendorf/zfs that referenced this issue Jan 17, 2013
Rolling back a mounted filesystem with open file handles and
cached dentries+inodes never worked properly in ZoL.  The
major issue was that Linux provides no easy mechanism for
modules to invalidate the inode cache for a file system.

Because of this it was possible that an inode from the previous
filesystem would not get properly dropped from the cache during
rolling back.  Then a new inode with the same inode number would
be create and collide with the existing cached inode.  Ideally
this would trigger an VERIFY() but in practice the error wasn't
handled and it would just NULL reference.

Luckily, this issue can be resolved by sprucing up the existing
Solaris zfs_rezget() functionality for the Linux VFS.

The way it works now is that when a file system is rolled back
all the cached inodes will be traversed and refetched from disk.
If a version of the cached inode exists on disk the in-core
copy will be updated accordingly.  If there is no match for that
object on disk it will be unhashed from the inode cache and
marked as stale.

This will effectively make the inode unfindable for lookups
allowing the inode number to be immediately recycled.  The inode
will then only be accessible from the cached dentries.  Subsequent
dentry lookups which reference a stale inode will result in the
dentry being invalidated.  Once invalidated the dentry will drop
its reference on the inode allowing it to be safely pruned from
the cache.

Special care is taken for negative dentries since they do not
reference any inode.  These dentires will be invalidate based
on when they were added to the dentry cache.  Entries added
before the last rollback will be invalidate to prevent them
from masking real files in the dataset.

Two nice side effects of this fix are:

* Removes the dependency on spl_invalidate_inodes(), it can now
  be safely removed from the SPL when we choose to do so.

* zfs_znode_alloc() no longer requires a dentry to be passed.
  This effectively reverts this portition of the code to its
  upstream counterpart.  The dentry is not instantiated more
  correctly in the Linux ZPL layer.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#795
behlendorf added a commit to openzfs/spl that referenced this issue Jan 17, 2013
This functionality is no longer required by ZFS, see commit
openzfs/zfs@7b3e34b.
Since there are no other consumers, and because it adds
additional autoconf complexity which must be maintained
the spl_invalidate_inodes() function has been removed.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs/zfs#795
behlendorf added a commit to behlendorf/spl that referenced this issue Jan 18, 2013
This functionality is no longer required by ZFS, see commit
openzfs/zfs@7b3e34b.
Since there are no other consumers, and because it adds
additional autoconf complexity which must be maintained
the spl_invalidate_inodes() function has been removed.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs/zfs#795
unya pushed a commit to unya/zfs that referenced this issue Dec 13, 2013
Rolling back a mounted filesystem with open file handles and
cached dentries+inodes never worked properly in ZoL.  The
major issue was that Linux provides no easy mechanism for
modules to invalidate the inode cache for a file system.

Because of this it was possible that an inode from the previous
filesystem would not get properly dropped from the cache during
rolling back.  Then a new inode with the same inode number would
be create and collide with the existing cached inode.  Ideally
this would trigger an VERIFY() but in practice the error wasn't
handled and it would just NULL reference.

Luckily, this issue can be resolved by sprucing up the existing
Solaris zfs_rezget() functionality for the Linux VFS.

The way it works now is that when a file system is rolled back
all the cached inodes will be traversed and refetched from disk.
If a version of the cached inode exists on disk the in-core
copy will be updated accordingly.  If there is no match for that
object on disk it will be unhashed from the inode cache and
marked as stale.

This will effectively make the inode unfindable for lookups
allowing the inode number to be immediately recycled.  The inode
will then only be accessible from the cached dentries.  Subsequent
dentry lookups which reference a stale inode will result in the
dentry being invalidated.  Once invalidated the dentry will drop
its reference on the inode allowing it to be safely pruned from
the cache.

Special care is taken for negative dentries since they do not
reference any inode.  These dentires will be invalidate based
on when they were added to the dentry cache.  Entries added
before the last rollback will be invalidate to prevent them
from masking real files in the dataset.

Two nice side effects of this fix are:

* Removes the dependency on spl_invalidate_inodes(), it can now
  be safely removed from the SPL when we choose to do so.

* zfs_znode_alloc() no longer requires a dentry to be passed.
  This effectively reverts this portition of the code to its
  upstream counterpart.  The dentry is not instantiated more
  correctly in the Linux ZPL layer.

Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ned Bass <[email protected]>
Closes openzfs#795
pcd1193182 pushed a commit to pcd1193182/zfs that referenced this issue Sep 26, 2023
Blocks are ingested to the zettacache in 2 cases:
* when writing a new object, we add all its blocks to the zettacache
* when reading a block which is not in the zettacache, we get the object
  and add all its (not-already-present) blocks to the zettacache

In both cases, we are adding many blocks (up to ~300) that are part of
the same object.  The current code mostly handles each block
individually, causing repeated work, especially to lock and unlock
various data structures.

This commit streamlines the batch insertion of all the
(not-already-present) blocks in object.  There are 3 main aspects to
this:
* bulk lookup: `zettacache::Inner::lookup_all_impl()` takes a list of
  keys and executes a callback for each of them, providing the
  IndexValue.  The Locked lock is obtained at most twice.
* bulk insert: `zettacache::Inner::insert_all_impl()` takes a list of
  keys and buffers, and writes all of them to disk.  The Locked lock is
  obtained once.
* bulk LockedKey: the new `RangeLock` is used to lock the range of keys
  covered by the object.  Only one lock is obtained for the object,
  instead of one lock from the LockSet for each block.

The performance of ingesting via reading random blocks is improved by
100-200% (performance is 2-3x what is was before).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants