Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPL Panic during ZFS Receive #12270

Open
pquan opened this issue Jun 23, 2021 · 3 comments
Open

SPL Panic during ZFS Receive #12270

pquan opened this issue Jun 23, 2021 · 3 comments
Labels
Bot: Not Stale Override for the stale bot Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@pquan
Copy link

pquan commented Jun 23, 2021

System information

Type Version/Name
Distribution Name Proxmox VE
Distribution Version 6.4
Linux Kernel 5.4.119-1-pve
Architecture X86
ZFS Version 2.0.4-1
SPL Version 2.0.4-1

Describe the problem you're observing

zfs receive hangs and resulting in huge load on server. access to zfs datasets non possible

Describe how to reproduce the problem

Not easy to replicate. Seems to happen at random. This is a backup server. It receives on encrypted datasets from unencrypted source. At the same time, it lists/deletes old snaphosts. It has ~3TB across 10+ datasets and around 1000 snapshots.

Could be depended on load/hard disk delay (i.e. 2TB disks arranged on 2x2 mirror vdevs). The IOPS are not many.

Include any warning/errors/backtraces from the system logs

No warnings in the logs. Only the stack trace.

[784093.123687] VERIFY3(0 == dmu_bonus_hold_by_dnode(dn, FTAG, &db, flags)) failed (0 == 5)
[784093.124493] PANIC at dmu_recv.c:1811:receive_object()
[784093.125228] Showing stack for process 9871
[784093.126076] CPU: 0 PID: 9871 Comm: receive_writer Tainted: P           O      5.4.119-1-pve #1
[784093.126842] Hardware name: ASUSTeK COMPUTER INC. P9D-M Series/P9D-M Series, BIOS 2201 07/17/2019
[784093.127570] Call Trace:
[784093.128360]  dump_stack+0x6d/0x8b
[784093.129049]  spl_dumpstack+0x29/0x2b [spl]
[784093.129926]  spl_panic+0xd3/0xfb [spl]
[784093.130813]  ? _cond_resched+0x19/0x30
[784093.131470]  ? down_read+0x12/0xa0
[784093.132156]  ? dbuf_read+0x1c9/0x510 [zfs]
[784093.132817]  ? dmu_bonus_hold_by_dnode+0x152/0x1b0 [zfs]
[784093.133480]  ? kmem_asprintf+0x90/0xb0 [spl]
[784093.134141]  receive_object+0x981/0xd10 [zfs]
[784093.134746]  ? _cond_resched+0x19/0x30
[784093.135404]  ? mutex_lock+0x12/0x30
[784093.135999]  receive_writer_thread+0x279/0xa00 [zfs]
[784093.136568]  ? spl_kmem_free+0x33/0x40 [spl]
[784093.137117]  ? receive_process_write_record+0x180/0x180 [zfs]
[784093.137636]  thread_generic_wrapper+0x74/0x90 [spl]
[784093.138182]  ? receive_process_write_record+0x180/0x180 [zfs]
[784093.138705]  ? thread_generic_wrapper+0x74/0x90 [spl]
[784093.139213]  kthread+0x120/0x140
[784093.139715]  ? __thread_exit+0x20/0x20 [spl]
[784093.140219]  ? kthread_park+0x90/0x90
[784093.140720]  ret_from_fork+0x35/0x40
[784320.319731] INFO: task txg_quiesce:1020 blocked for more than 120 seconds.
[784320.320296]       Tainted: P           O      5.4.119-1-pve #1
[784320.320845] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[784320.321403] txg_quiesce     D    0  1020      2 0x80004000
[784320.321917] Call Trace:
[784320.322429]  __schedule+0x2e6/0x700
[784320.322921]  schedule+0x33/0xa0
[784320.323416]  cv_wait_common+0x104/0x130 [spl]
[784320.324003]  ? wait_woken+0x80/0x80
[784320.324555]  __cv_wait+0x15/0x20 [spl]
[784320.325082]  txg_quiesce_thread+0x2ba/0x3b0 [zfs]
[784320.325594]  ? txg_sync_thread+0x4a0/0x4a0 [zfs]
[784320.326066]  thread_generic_wrapper+0x74/0x90 [spl]
[784320.326527]  kthread+0x120/0x140
[784320.326974]  ? __thread_exit+0x20/0x20 [spl]
[784320.327406]  ? kthread_park+0x90/0x90
[784320.327875]  ret_from_fork+0x35/0x40
[784320.328414] INFO: task receive_writer:9871 blocked for more than 120 seconds.
[784320.328814]       Tainted: P           O      5.4.119-1-pve #1
[784320.329212] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[784320.329625] receive_writer  D    0  9871      2 0x80004000
[784320.330001] Call Trace:
[784320.330366]  __schedule+0x2e6/0x700
[784320.330727]  schedule+0x33/0xa0
[784320.331094]  spl_panic+0xf9/0xfb [spl]
[784320.331438]  ? _cond_resched+0x19/0x30
[784320.331841]  ? down_read+0x12/0xa0
[784320.332274]  ? dbuf_read+0x1c9/0x510 [zfs]
[784320.332628]  ? dmu_bonus_hold_by_dnode+0x152/0x1b0 [zfs]
[784320.332967]  ? kmem_asprintf+0x90/0xb0 [spl]
[784320.333325]  receive_object+0x981/0xd10 [zfs]
[784320.333665]  ? _cond_resched+0x19/0x30
[784320.334004]  ? mutex_lock+0x12/0x30
[784320.334361]  receive_writer_thread+0x279/0xa00 [zfs]
[784320.334712]  ? spl_kmem_free+0x33/0x40 [spl]
[784320.335078]  ? receive_process_write_record+0x180/0x180 [zfs]
[784320.335428]  thread_generic_wrapper+0x74/0x90 [spl]
[784320.335860]  ? receive_process_write_record+0x180/0x180 [zfs]
[784320.336248]  ? thread_generic_wrapper+0x74/0x90 [spl]
[784320.336598]  kthread+0x120/0x140
[784320.336944]  ? __thread_exit+0x20/0x20 [spl]
[784320.337287]  ? kthread_park+0x90/0x90
[784320.337627]  ret_from_fork+0x35/0x40
[784441.151508] INFO: task txg_quiesce:1020 blocked for more than 241 seconds.
@pquan pquan added Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang) labels Jun 23, 2021
@kernelOfTruth
Copy link
Contributor

This is a backup server. It receives on encrypted datasets from unencrypted source. At the same time, it lists/deletes old snaphosts.

could you please post the exact commands that you used on sending/receiving side ?

(of course leaving out any sensitive information/data)

also what are the custom settings on the pools (if applicable) ?

@pquan
Copy link
Author

pquan commented Jun 24, 2021

The sender uses

/usr/sbin/syncoid --compress=lzo --no-sync-snap --mbuffer-size=256M data/mail root@remote-server:rpool/backup/mail

This results in the following commands:
on the sender

sh -c zfs send  -I 'data/mail'@'autosnap_2021-06-24_08:00:01_hourly' 'data/mail'@'autosnap_2021-06-24_09:00:01_hourly' | pv -p -t -e -r -b -s 200435704 | lzop  | mbuffer  -q -s 128k -m 256M 2>/dev/null | ssh     -S /tmp/syncoid-root-root@remote-server-1624520294 root@remote-server ' mbuffer  -q -s 128k -m 256M 2>/dev/null | lzop -dfc |  zfs receive  -s -F '"'"'rpool/backup/mail'"'"' 2>&1'

and executed on the receiver:

ssh -S /tmp/syncoid-root-root@remote-server-1624520294 root@remote-server  mbuffer  -q -s 128k -m 256M 2>/dev/null | lzop -dfc |  zfs receive  -s -F 'rpool/backup/mail' 2>&1

A sample sending pool is this (there are several on different servers):

# zpool get all data
NAME  PROPERTY                       VALUE                          SOURCE
data  size                           1.29T                          -
data  capacity                       90%                            -
data  altroot                        -                              default
data  health                         ONLINE                         -
data  version                        -                              default
data  bootfs                         -                              default
data  delegation                     on                             default
data  autoreplace                    off                            default
data  cachefile                      -                              default
data  failmode                       wait                           default
data  listsnapshots                  off                            default
data  autoexpand                     off                            default
data  dedupditto                     0                              default
data  dedupratio                     1.00x                          -
data  free                           123G                           -
data  allocated                      1.17T                          -
data  readonly                       off                            -
data  ashift                         12                             local
data  comment                        -                              default
data  expandsize                     -                              -
data  freeing                        108M                           -
data  fragmentation                  72%                            -
data  leaked                         0                              -
data  multihost                      off                            default
data  checkpoint                     -                              -
data  autotrim                       off                            default
data  feature@async_destroy          enabled                        local
data  feature@empty_bpobj            active                         local
data  feature@lz4_compress           active                         local
data  feature@multi_vdev_crash_dump  enabled                        local
data  feature@spacemap_histogram     active                         local
data  feature@enabled_txg            active                         local
data  feature@hole_birth             active                         local
data  feature@extensible_dataset     active                         local
data  feature@embedded_data          active                         local
data  feature@bookmarks              enabled                        local
data  feature@filesystem_limits      enabled                        local
data  feature@large_blocks           enabled                        local
data  feature@large_dnode            enabled                        local
data  feature@sha512                 enabled                        local
data  feature@skein                  enabled                        local
data  feature@edonr                  enabled                        local
data  feature@userobj_accounting     active                         local
data  feature@encryption             enabled                        local
data  feature@project_quota          active                         local
data  feature@device_removal         enabled                        local
data  feature@obsolete_counts        enabled                        local
data  feature@zpool_checkpoint       enabled                        local
data  feature@spacemap_v2            active                         local
data  feature@allocation_classes     enabled                        local
data  feature@resilver_defer         enabled                        local
data  feature@bookmark_v2            enabled                        local

While the receiver pool is this:

#zpool get all rpool
NAME   PROPERTY                       VALUE                          SOURCE
rpool  size                           3.62T                          -
rpool  capacity                       70%                            -
rpool  altroot                        -                              default
rpool  health                         ONLINE                         -
rpool  version                        -                              default
rpool  bootfs                         rpool/ROOT/pve-1               local
rpool  delegation                     on                             default
rpool  autoreplace                    off                            default
rpool  cachefile                      -                              default
rpool  failmode                       wait                           default
rpool  listsnapshots                  off                            default
rpool  autoexpand                     off                            default
rpool  dedupratio                     1.00x                          -
rpool  free                           1.07T                          -
rpool  allocated                      2.56T                          -
rpool  readonly                       off                            -
rpool  ashift                         12                             local
rpool  comment                        -                              default
rpool  expandsize                     -                              -
rpool  freeing                        0                              -
rpool  fragmentation                  34%                            -
rpool  leaked                         0                              -
rpool  multihost                      off                            default
rpool  checkpoint                     -                              -
rpool  autotrim                       off                            default
rpool  feature@async_destroy          enabled                        local
rpool  feature@empty_bpobj            active                         local
rpool  feature@lz4_compress           active                         local
rpool  feature@multi_vdev_crash_dump  enabled                        local
rpool  feature@spacemap_histogram     active                         local
rpool  feature@enabled_txg            active                         local
rpool  feature@hole_birth             active                         local
rpool  feature@extensible_dataset     active                         local
rpool  feature@embedded_data          active                         local
rpool  feature@bookmarks              enabled                        local
rpool  feature@filesystem_limits      enabled                        local
rpool  feature@large_blocks           enabled                        local
rpool  feature@large_dnode            active                         local
rpool  feature@sha512                 enabled                        local
rpool  feature@skein                  enabled                        local
rpool  feature@edonr                  enabled                        local
rpool  feature@userobj_accounting     active                         local
rpool  feature@encryption             active                         local
rpool  feature@project_quota          active                         local
rpool  feature@device_removal         enabled                        local
rpool  feature@obsolete_counts        enabled                        local
rpool  feature@zpool_checkpoint       enabled                        local
rpool  feature@spacemap_v2            active                         local
rpool  feature@allocation_classes     enabled                        local
rpool  feature@resilver_defer         enabled                        local
rpool  feature@bookmark_v2            enabled                        local
rpool  feature@redaction_bookmarks    disabled                       local
rpool  feature@redacted_datasets      disabled                       local
rpool  feature@bookmark_written       disabled                       local
rpool  feature@log_spacemap           disabled                       local
rpool  feature@livelist               disabled                       local
rpool  feature@device_rebuild         disabled                       local
rpool  feature@zstd_compress          disabled                       local

While the receving data set is

# zfs get all rpool/backup
NAME                PROPERTY              VALUE                  SOURCE
rpool/backup  type                  filesystem             -
rpool/backup  creation              Wed Apr  7 19:37 2021  -
rpool/backup  used                  2.45T                  -
rpool/backup  available             972G                   -
rpool/backup  referenced            240K                   -
rpool/backup  compressratio         1.22x                  -
rpool/backup  mounted               yes                    -
rpool/backup  quota                 none                   default
rpool/backup  reservation           none                   default
rpool/backup  recordsize            128K                   default
rpool/backup  mountpoint            /rpool/backup          default
rpool/backup  sharenfs              off                    default
rpool/backup  checksum              on                     default
rpool/backup  compression           lz4                    inherited from rpool
rpool/backup  atime                 off                    inherited from rpool
rpool/backup  devices               on                     default
rpool/backup  exec                  on                     default
rpool/backup  setuid                on                     default
rpool/backup  readonly              off                    default
rpool/backup  zoned                 off                    default
rpool/backup  snapdir               hidden                 default
rpool/backup  aclmode               discard                default
rpool/backup  aclinherit            restricted             default
rpool/backup  createtxg             299                    -
rpool/backup  canmount              on                     default
rpool/backup  xattr                 on                     default
rpool/backup  copies                1                      default
rpool/backup  version               5                      -
rpool/backup  utf8only              off                    -
rpool/backup  normalization         none                   -
rpool/backup  casesensitivity       sensitive              -
rpool/backup  vscan                 off                    default
rpool/backup  nbmand                off                    default
rpool/backup  sharesmb              off                    default
rpool/backup  refquota              none                   default
rpool/backup  refreservation        none                   default
rpool/backup  primarycache          all                    default
rpool/backup  secondarycache        all                    default
rpool/backup  usedbysnapshots       0B                     -
rpool/backup  usedbydataset         240K                   -
rpool/backup  usedbychildren        2.45T                  -
rpool/backup  usedbyrefreservation  0B                     -
rpool/backup  logbias               latency                default
rpool/backup  objsetid              402                    -
rpool/backup  dedup                 off                    default
rpool/backup  mlslabel              none                   default
rpool/backup  sync                  standard               inherited from rpool
rpool/backup  dnodesize             legacy                 default
rpool/backup  refcompressratio      1.00x                  -
rpool/backup  written               240K                   -
rpool/backup  logicalused           2.83T                  -
rpool/backup  logicalreferenced     86K                    -
rpool/backup  volmode               default                default
rpool/backup  filesystem_limit      none                   default
rpool/backup  snapshot_limit        none                   default
rpool/backup  filesystem_count      none                   default
rpool/backup  snapshot_count        none                   default
rpool/backup  snapdev               hidden                 default
rpool/backup  acltype               off                    default
rpool/backup  context               none                   default
rpool/backup  fscontext             none                   default
rpool/backup  defcontext            none                   default
rpool/backup  rootcontext           none                   default
rpool/backup  relatime              off                    default
rpool/backup  redundant_metadata    all                    default
rpool/backup  overlay               on                     default
rpool/backup  encryption            aes-256-gcm            -
rpool/backup  keylocation           prompt                 local
rpool/backup  keyformat             passphrase             -
rpool/backup  pbkdf2iters           350000                 -
rpool/backup  encryptionroot        rpool/backup           -
rpool/backup  keystatus             available              -
rpool/backup  special_small_blocks  0                      default

@stale
Copy link

stale bot commented Jun 24, 2022

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Jun 24, 2022
@stale stale bot closed this as completed Sep 24, 2022
@ryao ryao added Bot: Not Stale Override for the stale bot and removed Status: Stale No recent activity for issue labels Sep 29, 2022
@ryao ryao reopened this Sep 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bot: Not Stale Override for the stale bot Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

3 participants