many errors "attempt to access beyond end of device" after upgrade to 0.7.10 #7906

japz · 2018-09-14T15:34:20Z

System information

Type	Version/Name
Distribution Name	CentOS
Distribution Version	7
Linux Kernel	3.10.0-862.11.6.el7.x86_64
Architecture	x86_64
ZFS Version	0.7.10
SPL Version	0.7.10

Describe the problem you're observing

ZFS is trying to access sectors beyond the end of device, causing the disk to be marked as failed and the pool to be degraded or unavailable after upgrading from 0.7.9 to 0.7.10.

It looks related to #7724, for which (if I read the commit logs right) the fix landed in 0.7.10.

Describe how to reproduce the problem

Upgrade from 0.7.9 to 0.7.10. The problem is observed only on my pool with 4Kn drives (pool1), and not on pool2 which has drives with 512 byte sectors. Both pools were created with ZFS 0.7.9 and I gave the full disks to ZFS.

When I rolled back to 0.7.9, everything worked fine again.

Include any warning/errors/backtraces from the system logs

[   45.895934] sdg1: rw=0, want=23437749792, limit=23437606912
[   45.896157] attempt to access beyond end of device
[   45.896158] sda1: rw=14, want=23437749760, limit=23437606912
[   45.896189] attempt to access beyond end of device
[   45.896190] sda1: rw=14, want=23437750272, limit=23437606912
[   45.897621] attempt to access beyond end of device
[   45.897623] sda1: rw=0, want=23437749280, limit=23437606912
[   45.897625] attempt to access beyond end of device
[   45.897626] sda1: rw=0, want=23437749792, limit=23437606912
[   45.910790] sdf1: rw=0, want=23437749792, limit=23437606912
[   45.914647] attempt to access beyond end of device
[   45.915124] sdf1: rw=14, want=23437749504, limit=23437606912
[   45.915827] attempt to access beyond end of device

[root@zapp log]# zpool status
  pool: pool1
 state: ONLINE
  scan: scrub repaired 0B in 18h38m with 0 errors on Tue Sep 11 06:52:18 2018
config:

	NAME                                   STATE     READ WRITE CKSUM
	pool1                                  ONLINE       0     0     0
	  raidz1-0                             ONLINE       0     0     0
	    ata-HGST_HUH721212ALN600_8DGK7L2H  ONLINE       0     0     0
	    ata-HGST_HUH721212ALN600_8DGZ0VBH  ONLINE       0     0     0
	    ata-HGST_HUH721212ALN604_8DG4BWAD  ONLINE       0     0     0
	    ata-HGST_HUH721212ALN604_8DG4S9YD  ONLINE       0     0     0
	    ata-HGST_HUH721212ALN600_8DGZ5XZH  ONLINE       0     0     0
	logs
	  pool1_log                            ONLINE       0     0     0
	cache
	  nvme-pool1_cache                     ONLINE       0     0     0
	  ssd2-pool1_cache                     ONLINE       0     0     0

errors: No known data errors

  pool: pool2
 state: ONLINE
  scan: resilvered 476M in 0h0m with 0 errors on Tue Sep  4 17:09:39 2018
config:

	NAME                                          STATE     READ WRITE CKSUM
	pool2                                         ONLINE       0     0     0
	  raidz1-0                                    ONLINE       0     0     0
	    ata-WDC_WD60EFRX-68L0BN1_WD-WX41D758N35U  ONLINE       0     0     0
	    ata-WDC_WD60EFRX-68MYMN1_WD-WX21DA4D1TDJ  ONLINE       0     0     0
	    ata-WDC_WD60EFRX-68MYMN1_WD-WX41D948Y8FY  ONLINE       0     0     0
	    ata-WDC_WD60EFRX-68L0BN1_WD-WX61D380T01V  ONLINE       0     0     0
	    ata-WDC_WD60EFRX-68L0BN1_WD-WX41D758NZ0C  ONLINE       0     0     0
	logs
	  pool2_log                                   ONLINE       0     0     0
	cache
	  nvme-pool2_cache                            ONLINE       0     0     0
	  ssd2-pool2_cache                            ONLINE       0     0     0
	spares
	  sdb                                         AVAIL

errors: No known data errors

pool1:
    version: 5000
    name: 'pool1'
    state: 0
    txg: 8808902
    pool_guid: 10993764123530153711
    errata: 0
    hostname: 'x'
    com.delphix:has_per_vdev_zaps
    vdev_children: 2
    vdev_tree:
        type: 'root'
        id: 0
        guid: 10993764123530153711
        children[0]:
            type: 'raidz'
            id: 0
            guid: 5505180680812326603
            nparity: 1
            metaslab_array: 256
            metaslab_shift: 39
            ashift: 12
            asize: 59999976161280
            is_log: 0
            create_txg: 4
            com.delphix:vdev_zap_top: 129
            children[0]:
                type: 'disk'
                id: 0
                guid: 12995749272061415635
                path: '/dev/disk/by-id/ata-HGST_HUH721212ALN600_8DGK7L2H-part1'
                devid: 'ata-HGST_HUH721212ALN600_8DGK7L2H-part1'
                phys_path: 'pci-0000:07:00.0-sas-0x300062b200f20bc5-lun-0'
                whole_disk: 1
                DTL: 910
                create_txg: 4
                com.delphix:vdev_zap_leaf: 130
            children[1]:
                type: 'disk'
                id: 1
                guid: 11533917232550691754
                path: '/dev/disk/by-id/ata-HGST_HUH721212ALN600_8DGZ0VBH-part1'
                devid: 'ata-HGST_HUH721212ALN600_8DGZ0VBH-part1'
                phys_path: 'pci-0000:07:00.0-sas-0x300062b200f20bc4-lun-0'
                whole_disk: 1
                DTL: 909
                create_txg: 4
                com.delphix:vdev_zap_leaf: 131
            children[2]:
                type: 'disk'
                id: 2
                guid: 7056770016647008821
                path: '/dev/disk/by-id/ata-HGST_HUH721212ALN604_8DG4BWAD-part1'
                devid: 'ata-HGST_HUH721212ALN604_8DG4BWAD-part1'
                phys_path: 'pci-0000:07:00.0-sas-0x300062b200f20bc7-lun-0'
                whole_disk: 1
                DTL: 908
                create_txg: 4
                com.delphix:vdev_zap_leaf: 132
            children[3]:
                type: 'disk'
                id: 3
                guid: 18008172160548078487
                path: '/dev/disk/by-id/ata-HGST_HUH721212ALN604_8DG4S9YD-part1'
                devid: 'ata-HGST_HUH721212ALN604_8DG4S9YD-part1'
                phys_path: 'pci-0000:07:00.0-sas-0x300062b200f20bc6-lun-0'
                whole_disk: 1
                DTL: 907
                create_txg: 4
                com.delphix:vdev_zap_leaf: 133
            children[4]:
                type: 'disk'
                id: 4
                guid: 6383227245028307072
                path: '/dev/disk/by-id/ata-HGST_HUH721212ALN600_8DGZ5XZH-part1'
                devid: 'ata-HGST_HUH721212ALN600_8DGZ5XZH-part1'
                phys_path: 'pci-0000:07:00.0-sas-0x300062b200f20bc2-lun-0'
                whole_disk: 1
                DTL: 389
                create_txg: 4
                com.delphix:vdev_zap_leaf: 302
        children[1]:
            type: 'disk'
            id: 1
            guid: 16610821259748890787
            path: '/dev/nvme/pool1_log'
            devid: 'dm-uuid-LVM-GTpdVXxCBIXsE88JMruHyfBhAVqhzbXOhBdfZZYAuwn0N35pNHNzADVABNMoc0OK'
            whole_disk: 0
            metaslab_array: 653
            metaslab_shift: 26
            ashift: 12
            asize: 10732699648
            is_log: 1
            DTL: 906
            create_txg: 401908
            com.delphix:vdev_zap_leaf: 793
            com.delphix:vdev_zap_top: 794
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data

I'm happy to provide any additional debug info.

The text was updated successfully, but these errors were encountered:

yerkanian · 2018-09-14T17:27:56Z

I've just got bitten by a similar issue. Upon updating CentOS 7.5 kmod-zfs-0.7.9-1.el7_4.x86_64 to 0.7.10-1.el7_4.x86_64. After the reboot members of one vdev got degraded, with one of them eventually getting faulted (the disks are HUS724040ALS640, 512k, but the pool is created with ashift=12):

        NAME                              STATE     READ WRITE CKSUM
        backups-1                         DEGRADED     0     0     0
          mirror-0                        DEGRADED     0     0     0
            scsi-35000cca03b32032c        FAULTED      0     0    17  too many errors
            scsi-35000cca03b678c80        DEGRADED     0     0     0  too many errors
          mirror-1                        ONLINE       0     0     0
            scsi-35000cca07334eff4        ONLINE       0     0     0
            scsi-35000cca05d6d4b04        ONLINE       0     0     0
          mirror-3                        ONLINE       0     0     0
            scsi-35000cca0739b40d0        ONLINE       0     0     0
            scsi-35000cca0739b5c80        ONLINE       0     0     0
        logs
          mirror-2                        ONLINE       0     0     0
            scsi-35000cca013025ed4-part1  ONLINE       0     0     0
            scsi-35000cca01308e9cc-part1  ONLINE       0     0     0

/var/log/messages is flooded with messages of write attempts beyond the partition boundaries:

[11656.401721] sdo1: rw=0, want=7814016032, limit=7812458496
[11656.402606] attempt to access beyond end of device
[11656.402621] sdo1: rw=14, want=7814016256, limit=7812458496
[11656.408906] attempt to access beyond end of device
[11656.408908] sdo1: rw=0, want=7814015520, limit=7812458496
[11656.408911] attempt to access beyond end of device
[11656.408912] sdo1: rw=0, want=7814016032, limit=7812458496

Comparing the affected two disks to the other vdev members shows decreased zfs partition size:

# fdisk -l /dev/sdo
WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.

Disk /dev/sdo: 4000.8 GB, 4000787030016 bytes, 7814037168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: gpt
Disk identifier: 263D54E3-E899-AE45-A3B3-7DD7A8A07D9A


#         Start          End    Size  Type            Name
 1         2048   7812460543    3.7T  Solaris /usr &  zfs-97bf78d81ae4c082
 9   7812460544   7812476927      8M  Solaris reserve

vs good one:

# fdisk -l /dev/sda
WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.

Disk /dev/sda: 4000.8 GB, 4000787030016 bytes, 7814037168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: gpt
Disk identifier: 8BD86933-F560-DB43-AEBD-59EA1B9E8E68


#         Start          End    Size  Type            Name
 1         2048   7814019071    3.7T  Solaris /usr &  zfs-ee30da11927f1c93
 9   7814019072   7814035455      8M  Solaris reserve

The affected disks show no errors in the grown defect list.

behlendorf · 2018-09-14T17:41:49Z

@tonyhutter will be putting out an 0.7.11 release shortly which reverts the change. For the moment, you'll want to rollback to 0.7.9 on systems experiencing this issue.

behlendorf · 2018-09-14T21:31:58Z

Closing, 0.7.11 was released with the fix, https://github.com/zfsonlinux/zfs/releases/tag/zfs-0.7.11

japz · 2018-09-15T09:15:20Z

thanks for the quick fix!

0.7.10 will not be committed due to a regression: openzfs/zfs#7906 Package-Manager: Portage-2.3.40, Repoman-2.3.9

Between openzfs/zfs#7909,openzfs/zfs#7899, openzfs/zfs#7906 and http://list.zfsonlinux.org/pipermail/zfs-discuss/2018-September/032318.html, it seems like 0.7.10 should be clearly marked as "bad".

0.7.10 will not be committed due to a regression: openzfs/zfs#7906 Package-Manager: Portage-2.3.40, Repoman-2.3.9

behlendorf closed this as completed Sep 14, 2018

ghost mentioned this issue Sep 14, 2018

zfs urgent: 0.7.10 has a bug NixOS/nixpkgs#46675

Closed

gentoo-bot pushed a commit to gentoo/gentoo that referenced this issue Sep 16, 2018

sys-fs/zfs: version bump to 0.7.11 (and friends)

27f79fe

0.7.10 will not be committed due to a regression: openzfs/zfs#7906 Package-Manager: Portage-2.3.40, Repoman-2.3.9

rincebrain mentioned this issue Sep 18, 2018

Mark 0.7.10 as on fire zfsonlinux/zfsonlinux.github.com#39

Merged

gentoo-repo-qa-bot pushed a commit to gentoo-mirror/linux-be that referenced this issue Oct 6, 2019

sys-fs/zfs: version bump to 0.7.11 (and friends)

7827789

0.7.10 will not be committed due to a regression: openzfs/zfs#7906 Package-Manager: Portage-2.3.40, Repoman-2.3.9

i3v mentioned this issue Feb 25, 2024

"attempt to access beyond end of device" and devices failing #15932

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

many errors "attempt to access beyond end of device" after upgrade to 0.7.10 #7906

many errors "attempt to access beyond end of device" after upgrade to 0.7.10 #7906

japz commented Sep 14, 2018

yerkanian commented Sep 14, 2018

behlendorf commented Sep 14, 2018

behlendorf commented Sep 14, 2018

japz commented Sep 15, 2018

many errors "attempt to access beyond end of device" after upgrade to 0.7.10 #7906

many errors "attempt to access beyond end of device" after upgrade to 0.7.10 #7906

Comments

japz commented Sep 14, 2018

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

yerkanian commented Sep 14, 2018

behlendorf commented Sep 14, 2018

behlendorf commented Sep 14, 2018

japz commented Sep 15, 2018