Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

many errors "attempt to access beyond end of device" after upgrade to 0.7.10 #7906

Closed
japz opened this issue Sep 14, 2018 · 4 comments
Closed

Comments

@japz
Copy link

japz commented Sep 14, 2018

System information

Type Version/Name
Distribution Name CentOS
Distribution Version 7
Linux Kernel 3.10.0-862.11.6.el7.x86_64
Architecture x86_64
ZFS Version 0.7.10
SPL Version 0.7.10

Describe the problem you're observing

ZFS is trying to access sectors beyond the end of device, causing the disk to be marked as failed and the pool to be degraded or unavailable after upgrading from 0.7.9 to 0.7.10.

It looks related to #7724, for which (if I read the commit logs right) the fix landed in 0.7.10.

Describe how to reproduce the problem

Upgrade from 0.7.9 to 0.7.10. The problem is observed only on my pool with 4Kn drives (pool1), and not on pool2 which has drives with 512 byte sectors. Both pools were created with ZFS 0.7.9 and I gave the full disks to ZFS.

When I rolled back to 0.7.9, everything worked fine again.

Include any warning/errors/backtraces from the system logs

[   45.895934] sdg1: rw=0, want=23437749792, limit=23437606912
[   45.896157] attempt to access beyond end of device
[   45.896158] sda1: rw=14, want=23437749760, limit=23437606912
[   45.896189] attempt to access beyond end of device
[   45.896190] sda1: rw=14, want=23437750272, limit=23437606912
[   45.897621] attempt to access beyond end of device
[   45.897623] sda1: rw=0, want=23437749280, limit=23437606912
[   45.897625] attempt to access beyond end of device
[   45.897626] sda1: rw=0, want=23437749792, limit=23437606912
[   45.910790] sdf1: rw=0, want=23437749792, limit=23437606912
[   45.914647] attempt to access beyond end of device
[   45.915124] sdf1: rw=14, want=23437749504, limit=23437606912
[   45.915827] attempt to access beyond end of device
[root@zapp log]# zpool status
  pool: pool1
 state: ONLINE
  scan: scrub repaired 0B in 18h38m with 0 errors on Tue Sep 11 06:52:18 2018
config:

	NAME                                   STATE     READ WRITE CKSUM
	pool1                                  ONLINE       0     0     0
	  raidz1-0                             ONLINE       0     0     0
	    ata-HGST_HUH721212ALN600_8DGK7L2H  ONLINE       0     0     0
	    ata-HGST_HUH721212ALN600_8DGZ0VBH  ONLINE       0     0     0
	    ata-HGST_HUH721212ALN604_8DG4BWAD  ONLINE       0     0     0
	    ata-HGST_HUH721212ALN604_8DG4S9YD  ONLINE       0     0     0
	    ata-HGST_HUH721212ALN600_8DGZ5XZH  ONLINE       0     0     0
	logs
	  pool1_log                            ONLINE       0     0     0
	cache
	  nvme-pool1_cache                     ONLINE       0     0     0
	  ssd2-pool1_cache                     ONLINE       0     0     0

errors: No known data errors

  pool: pool2
 state: ONLINE
  scan: resilvered 476M in 0h0m with 0 errors on Tue Sep  4 17:09:39 2018
config:

	NAME                                          STATE     READ WRITE CKSUM
	pool2                                         ONLINE       0     0     0
	  raidz1-0                                    ONLINE       0     0     0
	    ata-WDC_WD60EFRX-68L0BN1_WD-WX41D758N35U  ONLINE       0     0     0
	    ata-WDC_WD60EFRX-68MYMN1_WD-WX21DA4D1TDJ  ONLINE       0     0     0
	    ata-WDC_WD60EFRX-68MYMN1_WD-WX41D948Y8FY  ONLINE       0     0     0
	    ata-WDC_WD60EFRX-68L0BN1_WD-WX61D380T01V  ONLINE       0     0     0
	    ata-WDC_WD60EFRX-68L0BN1_WD-WX41D758NZ0C  ONLINE       0     0     0
	logs
	  pool2_log                                   ONLINE       0     0     0
	cache
	  nvme-pool2_cache                            ONLINE       0     0     0
	  ssd2-pool2_cache                            ONLINE       0     0     0
	spares
	  sdb                                         AVAIL

errors: No known data errors
pool1:
    version: 5000
    name: 'pool1'
    state: 0
    txg: 8808902
    pool_guid: 10993764123530153711
    errata: 0
    hostname: 'x'
    com.delphix:has_per_vdev_zaps
    vdev_children: 2
    vdev_tree:
        type: 'root'
        id: 0
        guid: 10993764123530153711
        children[0]:
            type: 'raidz'
            id: 0
            guid: 5505180680812326603
            nparity: 1
            metaslab_array: 256
            metaslab_shift: 39
            ashift: 12
            asize: 59999976161280
            is_log: 0
            create_txg: 4
            com.delphix:vdev_zap_top: 129
            children[0]:
                type: 'disk'
                id: 0
                guid: 12995749272061415635
                path: '/dev/disk/by-id/ata-HGST_HUH721212ALN600_8DGK7L2H-part1'
                devid: 'ata-HGST_HUH721212ALN600_8DGK7L2H-part1'
                phys_path: 'pci-0000:07:00.0-sas-0x300062b200f20bc5-lun-0'
                whole_disk: 1
                DTL: 910
                create_txg: 4
                com.delphix:vdev_zap_leaf: 130
            children[1]:
                type: 'disk'
                id: 1
                guid: 11533917232550691754
                path: '/dev/disk/by-id/ata-HGST_HUH721212ALN600_8DGZ0VBH-part1'
                devid: 'ata-HGST_HUH721212ALN600_8DGZ0VBH-part1'
                phys_path: 'pci-0000:07:00.0-sas-0x300062b200f20bc4-lun-0'
                whole_disk: 1
                DTL: 909
                create_txg: 4
                com.delphix:vdev_zap_leaf: 131
            children[2]:
                type: 'disk'
                id: 2
                guid: 7056770016647008821
                path: '/dev/disk/by-id/ata-HGST_HUH721212ALN604_8DG4BWAD-part1'
                devid: 'ata-HGST_HUH721212ALN604_8DG4BWAD-part1'
                phys_path: 'pci-0000:07:00.0-sas-0x300062b200f20bc7-lun-0'
                whole_disk: 1
                DTL: 908
                create_txg: 4
                com.delphix:vdev_zap_leaf: 132
            children[3]:
                type: 'disk'
                id: 3
                guid: 18008172160548078487
                path: '/dev/disk/by-id/ata-HGST_HUH721212ALN604_8DG4S9YD-part1'
                devid: 'ata-HGST_HUH721212ALN604_8DG4S9YD-part1'
                phys_path: 'pci-0000:07:00.0-sas-0x300062b200f20bc6-lun-0'
                whole_disk: 1
                DTL: 907
                create_txg: 4
                com.delphix:vdev_zap_leaf: 133
            children[4]:
                type: 'disk'
                id: 4
                guid: 6383227245028307072
                path: '/dev/disk/by-id/ata-HGST_HUH721212ALN600_8DGZ5XZH-part1'
                devid: 'ata-HGST_HUH721212ALN600_8DGZ5XZH-part1'
                phys_path: 'pci-0000:07:00.0-sas-0x300062b200f20bc2-lun-0'
                whole_disk: 1
                DTL: 389
                create_txg: 4
                com.delphix:vdev_zap_leaf: 302
        children[1]:
            type: 'disk'
            id: 1
            guid: 16610821259748890787
            path: '/dev/nvme/pool1_log'
            devid: 'dm-uuid-LVM-GTpdVXxCBIXsE88JMruHyfBhAVqhzbXOhBdfZZYAuwn0N35pNHNzADVABNMoc0OK'
            whole_disk: 0
            metaslab_array: 653
            metaslab_shift: 26
            ashift: 12
            asize: 10732699648
            is_log: 1
            DTL: 906
            create_txg: 401908
            com.delphix:vdev_zap_leaf: 793
            com.delphix:vdev_zap_top: 794
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data

I'm happy to provide any additional debug info.

@yerkanian
Copy link

I've just got bitten by a similar issue. Upon updating CentOS 7.5 kmod-zfs-0.7.9-1.el7_4.x86_64 to 0.7.10-1.el7_4.x86_64. After the reboot members of one vdev got degraded, with one of them eventually getting faulted (the disks are HUS724040ALS640, 512k, but the pool is created with ashift=12):

        NAME                              STATE     READ WRITE CKSUM
        backups-1                         DEGRADED     0     0     0
          mirror-0                        DEGRADED     0     0     0
            scsi-35000cca03b32032c        FAULTED      0     0    17  too many errors
            scsi-35000cca03b678c80        DEGRADED     0     0     0  too many errors
          mirror-1                        ONLINE       0     0     0
            scsi-35000cca07334eff4        ONLINE       0     0     0
            scsi-35000cca05d6d4b04        ONLINE       0     0     0
          mirror-3                        ONLINE       0     0     0
            scsi-35000cca0739b40d0        ONLINE       0     0     0
            scsi-35000cca0739b5c80        ONLINE       0     0     0
        logs
          mirror-2                        ONLINE       0     0     0
            scsi-35000cca013025ed4-part1  ONLINE       0     0     0
            scsi-35000cca01308e9cc-part1  ONLINE       0     0     0

/var/log/messages is flooded with messages of write attempts beyond the partition boundaries:

[11656.401721] sdo1: rw=0, want=7814016032, limit=7812458496
[11656.402606] attempt to access beyond end of device
[11656.402621] sdo1: rw=14, want=7814016256, limit=7812458496
[11656.408906] attempt to access beyond end of device
[11656.408908] sdo1: rw=0, want=7814015520, limit=7812458496
[11656.408911] attempt to access beyond end of device
[11656.408912] sdo1: rw=0, want=7814016032, limit=7812458496

Comparing the affected two disks to the other vdev members shows decreased zfs partition size:

# fdisk -l /dev/sdo
WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.

Disk /dev/sdo: 4000.8 GB, 4000787030016 bytes, 7814037168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: gpt
Disk identifier: 263D54E3-E899-AE45-A3B3-7DD7A8A07D9A


#         Start          End    Size  Type            Name
 1         2048   7812460543    3.7T  Solaris /usr &  zfs-97bf78d81ae4c082
 9   7812460544   7812476927      8M  Solaris reserve

vs good one:

# fdisk -l /dev/sda
WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.

Disk /dev/sda: 4000.8 GB, 4000787030016 bytes, 7814037168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: gpt
Disk identifier: 8BD86933-F560-DB43-AEBD-59EA1B9E8E68


#         Start          End    Size  Type            Name
 1         2048   7814019071    3.7T  Solaris /usr &  zfs-ee30da11927f1c93
 9   7814019072   7814035455      8M  Solaris reserve

The affected disks show no errors in the grown defect list.

@behlendorf
Copy link
Contributor

@tonyhutter will be putting out an 0.7.11 release shortly which reverts the change. For the moment, you'll want to rollback to 0.7.9 on systems experiencing this issue.

@behlendorf
Copy link
Contributor

Closing, 0.7.11 was released with the fix, https://github.com/zfsonlinux/zfs/releases/tag/zfs-0.7.11

@japz
Copy link
Author

japz commented Sep 15, 2018

thanks for the quick fix!

gentoo-bot pushed a commit to gentoo/gentoo that referenced this issue Sep 16, 2018
0.7.10 will not be committed due to a regression:

openzfs/zfs#7906

Package-Manager: Portage-2.3.40, Repoman-2.3.9
rincebrain added a commit to rincebrain/zfsonlinux.github.com that referenced this issue Sep 18, 2018
gentoo-repo-qa-bot pushed a commit to gentoo-mirror/linux-be that referenced this issue Oct 6, 2019
0.7.10 will not be committed due to a regression:

openzfs/zfs#7906

Package-Manager: Portage-2.3.40, Repoman-2.3.9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants