Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draid rebase [WIP] #8016

Closed
wants to merge 3 commits into from
Closed

Draid rebase [WIP] #8016

wants to merge 3 commits into from

Conversation

don-brady
Copy link
Contributor

@don-brady don-brady commented Oct 12, 2018

Motivation and Context

This is a WIP rebase of draid from @thegreatgazoo PR-7078 to allow continued community development using a branch based on most recent master.

Description

Rebasing Notes:

  • Conditionally disabled dRAID mirrored metaslabs (was causing failed assertions)

  • Exclude dRAID pools from device removal operations.

  • Added dRAID specific testing to ztest and zloop that will exercise dRAID configurations across a range of settings (see zloop.sh for details)

  • Removed the linking of libzpool intolibzfs (moved common dRAID config code into zcommon/draid_config.c)

How Has This Been Tested?

Used new zloop & ztest to verify draid code base
Some manual testing of draid configs

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the contributing document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

Don Brady added 3 commits October 11, 2018 17:03
Included new command options to control ztest testing with dRAID:
  -K draid|raidz|random -- kind of RAID to test
  -D <value> -- dRAID data drives per redundancy group
  -G <value> -- dRAID redundancy group count
  -S <value> -- dRAID distributed spare drives
  -R <value> -- RAID parity (raidz or dRAID)

The above values are used to generate a dRAID config from the
draidcfg command. This also allows for a more comprehensive set
of configurations when used with zloop.  By default, dRAID will
be provisioned 50% of the time in ztest runs.

Added dRAID specific section to zloop to exercise dRAID across a
wide range of configuration parameters. For example:

 ztest -VVVV -K draid -D 7 -G 6 -S 1 -m 0 -r 1 -R 1 -v 0 -a 12 -s 384m

Updated the zloop ztest time range to be 30 - 120 seconds.

Signed-off-by: Don Brady <[email protected]>
Removed libzfs dependency on libzpool

The new source relationship is as follows:

  [include/draid_config.h]<--------+
    ^        ^         ^           |
    |        |         | (include) |
    |        |         |           |
 [zpool] [draidcfg] [ztest]        |
    |        |         |           |
    |        |         |  [vdev_draid_impl.h]
    +-+    +-+         |           ^
      |    |   (link)  |           |
      v    v           v           |
     [libzfs]     [libzpool]   [zfs-mod]
        |              |           |
        +-----+        | (object)  |
              |        |           |
              v        v           v
             [zcommon/draid_config.c]

Signed-off-by: Don Brady <[email protected]>
@ahrens ahrens added Type: Feature Feature request or new feature Status: Work in Progress Not yet ready for general review labels Oct 12, 2018
@codecov
Copy link

codecov bot commented Oct 12, 2018

Codecov Report

Merging #8016 into master will increase coverage by 11.34%.
The diff coverage is 57.92%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #8016       +/-   ##
===========================================
+ Coverage   66.74%   78.08%   +11.34%     
===========================================
  Files         314      383       +69     
  Lines       97292   116167    +18875     
===========================================
+ Hits        64934    90714    +25780     
+ Misses      32358    25453     -6905
Flag Coverage Δ
#kernel 77.64% <14.74%> (?)
#user 67.32% <57.79%> (+0.58%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5d43cc9...2964970. Read the comment docs.

@don-brady
Copy link
Contributor Author

So clearly we need some ZTS testing coverage (the codecov %67 score is too low)

As for the ztest runs, it looks like it's finding some bugs. We need some volunteers to chase them down (see buildbot runs for details). They are all failed assertions. Below are the top 3:

spa_config_held(os->os_spa, SCL_ALL, RW_WRITER) == 0 || (spa_is_root(os->os_spa) && spa_config_held(os->os_spa, SCL_STATE, RW_WRITER))
ASSERT at ../../module/zfs/dnode.c:1285:dnode_hold_impl()/sbin/ztest(+0x8c13)[0x55e9f52f0c13]
offset < vd->vdev_psize - VDEV_LABEL_START_SIZE - VDEV_LABEL_END_SIZE (0xffffffffffc3f000 < 0x7b80000)
ASSERT at ../../module/zfs/vdev_draid.c:1310:vdev_dspare_get_child()/sbin/ztest(+0x8c13)[0x55bded087c13]
!msp->ms_rebuilding
ASSERT at ../../module/zfs/metaslab.c:3603:metaslab_free_concrete()/usr/sbin/ztest(+0x8d67)[0x55d9a6cf5d67]

@behlendorf behlendorf mentioned this pull request Oct 12, 2018
@naclosagc
Copy link

naclosagc commented Oct 13, 2018

Hi.

Trying out the new rebase. It's a VM on a Intel(R) Xeon(R) CPU X5650. The VM has 16 cores and 16GB memory. I'm using a SCSI Virtio controller, with 28 disks attached - 1 for boot and 27 for a pool.
Created the pool with 27 drives, parity 1, 2 spares, 4 drive data. I was continuously copying data to it for a couple of hours, and it was acting better than the last one did. The last one would get checksum errors in less than an hour. I stopped the copy with the pool fairly full, and did a scrub. The scrub finished fine. I then failed a drive, and did a replace. The replace failed after just a few minutes with:

gentoo ~ # zpool replace tank scsi-0QEMU_QEMU_HARDDISK_drive-scsi3-0-0-1 '%draid1-0-s0'
Message from syslogd@gentoo at Fri Oct 12 20:33:45 2018 ...
gentoo kernel: [14202.454290] PANIC at metaslab.c:3603:metaslab_free_concrete()

Message from syslogd@gentoo at Fri Oct 12 20:33:45 2018 ...
gentoo kernel: [14202.454288] VERIFY(!msp->ms_rebuilding) failed

I attached the dmesg output. Let me know what else I need to provide.

P.S. I tried this twice, and the same thing happened. The first time, I did not have debug enabled and the zfs paramters suggested set. The second time I did.

P.P.S It was a QEMU VM under Libvirt. The OS is Gentoo. Going to try an Ubuntu under ESXi

dmesg.txt

@don-brady
Copy link
Contributor Author

@naclosagc Thanks for the report. That looks to be the same failed assertion that ztest is hitting.

@carlesmateo
Copy link

carlesmateo commented Oct 16, 2018

Created the pool with 27 drives, parity 1, 2 spares, 4 drive data. I was continuously copying data to it for a couple of hours, and it was acting better than the last one did. The last one would get checksum errors in less than an hour. I stopped the copy with the pool fairly full, and did a scrub. The scrub finished fine. I then failed a drive, and did a replace. The replace failed after just a few minutes with:

When you tell "I then failed a drive" you mean that you forced Linux to stop seeing the drive, so made it UNAVAIL on purpose, or you're reporting a bug: ZFS seen the drive as failed?.
Can you indicate the exact steps you do to make the drive appear as failing?.
Can you paste your zpool history?.

Did you had the ZED Service running?.

@naclosagc
Copy link

naclosagc commented Oct 16, 2018 via email

@naclosagc
Copy link

naclosagc commented Oct 16, 2018 via email

@feltstykket
Copy link

Trying out the rebase as well, the EXPANDSZ parameter is wrong on a test pool:
root@gaia:~# zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
gaia 70.7T 1.50M 70.7T - 16.0E 0% 0% 1.00x ONLINE -

root@gaia:~# zpool status gaia
pool: gaia
state: ONLINE
scan: none requested
config:

    NAME                                            STATE     READ WRITE CKSUM
    gaia                                            ONLINE       0     0     0
      draid1-0                                      ONLINE       0     0     0
        ata-Hitachi_HUA722020ALA330_JK11A8B9KEE58F  ONLINE       0     0     0
        ata-Hitachi_HUA722020ALA330_JK11A8B9KEYSUF  ONLINE       0     0     0
        ata-ST32000644NS_9WM1QR5H                   ONLINE       0     0     0
        ata-ST32000644NS_9WM1XP7W                   ONLINE       0     0     0
        ata-ST32000644NS_9WM1YJX3                   ONLINE       0     0     0
        ata-ST32000644NS_9WM1YLJC                   ONLINE       0     0     0
        ata-ST32000644NS_9WM1YLJG                   ONLINE       0     0     0
        ata-ST32000644NS_9WM1ZGLD                   ONLINE       0     0     0
        ata-ST32000644NS_9WM20JHL                   ONLINE       0     0     0
        ata-ST32000644NS_9WM2S91S                   ONLINE       0     0     0
        ata-ST32000644NS_9WM70FXN                   ONLINE       0     0     0
        ata-ST32000644NS_9WM72X94                   ONLINE       0     0     0
        ata-WDC_WD2000FYYZ-01UL1B0_WD-WCC1P0022601  ONLINE       0     0     0
        ata-WDC_WD2002FYPS-01U1B1_WD-WCAVY1425565   ONLINE       0     0     0
        ata-WDC_WD2002FYPS-01U1B1_WD-WCAVY2455644   ONLINE       0     0     0
        ata-WDC_WD2002FYPS-01U1B1_WD-WCAVY2523196   ONLINE       0     0     0
        ata-WDC_WD2002FYPS-01U1B1_WD-WCAVY2586569   ONLINE       0     0     0
        ata-WDC_WD2002FYPS-01U1B1_WD-WCAVY2588016   ONLINE       0     0     0
        ata-WDC_WD2002FYPS-02W3B0_WD-WCAVY6425945   ONLINE       0     0     0
        ata-WDC_WD2003FYYS-02W0B0_WD-WMAY00055867   ONLINE       0     0     0
        ata-WDC_WD2003FYYS-02W0B0_WD-WMAY00058754   ONLINE       0     0     0
        ata-WDC_WD2003FYYS-02W0B0_WD-WMAY00691102   ONLINE       0     0     0
        ata-WDC_WD2003FYYS-02W0B0_WD-WMAY01342101   ONLINE       0     0     0
        ata-WDC_WD2003FYYS-02W0B0_WD-WMAY03223177   ONLINE       0     0     0
        ata-WDC_WD2003FYYS-02W0B1_WD-WCAY00008597   ONLINE       0     0     0
        ata-WDC_WD2003FYYS-02W0B1_WD-WCAY00012700   ONLINE       0     0     0
        ata-WDC_WD2003FYYS-02W0B1_WD-WCAY00014911   ONLINE       0     0     0
        ata-WDC_WD2003FYYS-02W0B1_WD-WCAY00015392   ONLINE       0     0     0
        ata-WDC_WD2003FYYS-02W0B1_WD-WCAY00016755   ONLINE       0     0     0
        ata-WDC_WD2003FYYS-02W0B1_WD-WCAY00017646   ONLINE       0     0     0
        ata-WDC_WD2003FYYS-02W0B1_WD-WCAY00019016   ONLINE       0     0     0
        ata-WDC_WD2003FYYS-02W0B1_WD-WCAY00022891   ONLINE       0     0     0
        ata-WDC_WD2003FYYS-02W0B1_WD-WCAY00025608   ONLINE       0     0     0
        ata-WDC_WD2003FYYS-02W0B1_WD-WCAY00038080   ONLINE       0     0     0
        ata-WDC_WD2003FYYS-02W0B1_WD-WCAY00038451   ONLINE       0     0     0
        ata-WDC_WD2003FYYS-02W0B1_WD-WCAY00040879   ONLINE       0     0     0
        ata-WDC_WD2003FYYS-02W0B1_WD-WCAY00041792   ONLINE       0     0     0
        ata-WDC_WD2003FYYS-02W0B1_WD-WCAY00087852   ONLINE       0     0     0
        ata-WDC_WD2003FYYS-02W0B1_WD-WCAY00492436   ONLINE       0     0     0
        ata-WDC_WD2003FYYS-02W0B1_WD-WCAY00875337   ONLINE       0     0     0
    spares
      %draid1-0-s0                                  AVAIL

errors: No known data errors

test.nvl.gz

@prod-feng
Copy link

Hi,

I did a test on the draidcfg, with the rebase, with a very rare case: 3data+1parity+1spare, it gives the base permutations as below:

draidcfg -r my5.nvl

dRAID1 vdev of 5 child drives: 1 x (3 data + 1 parity) and 1 distributed spare
Using 32 base permutations

3, 0, 2, 1, 4,
1, 3, 4, 2, 0,
0, 4, 2, 3, 1,
3, 1, 2, 0, 4,
0, 1, 3, 2, 4,
0, 1, 2, 3, 4,
3, 2, 1, 4, 0,
2, 1, 0, 4, 3,
4, 3, 1, 2, 0,
4, 1, 2, 0, 3,
1, 4, 2, 0, 3,
2, 4, 1, 0, 3,
0, 3, 2, 4, 1,
4, 1, 3, 2, 0,
3, 4, 2, 0, 1,
0, 3, 2, 4, 1,
0, 4, 3, 1, 2,
4, 1, 3, 0, 2,
4, 2, 0, 3, 1,
1, 0, 4, 3, 2,
3, 2, 4, 1, 0,
3, 0, 2, 1, 4,
4, 3, 1, 0, 2,
3, 0, 1, 4, 2,
2, 0, 3, 4, 1,
1, 3, 0, 4, 2,
3, 2, 0, 1, 4,
4, 2, 3, 1, 0,
3, 2, 1, 4, 0,
2, 1, 0, 4, 3,
2, 0, 4, 1, 3,
3, 1, 2, 0, 4,

There are 5 rows that are identical, such as "3, 0, 2, 1, 4". It seems it is not as randomly as expected?

Also, I am wondering that whether draid can support the case of "(n - s) % (p + d) /= 0". For example, n=10, d=4, p=2, s=1? Basically the random permutation can also work well with this case.

@richardelling
Copy link
Contributor

The row combinations can be repeated, but the goal is to get a random spread
as data is allocated to according to rows. In this example, there are 120 combinations
and draidcfg randomly picks 32. It does this a few times and picks the set that has
the least number of collisions (implying it is more random). This has no effect on
data protection, though it might have a small impact on random read performance.

The answer to the second question is no, not today. What would be a good use case
for having more disks than usable space or spares? IMHO, it would be better to use
the space for spares, parity, or data; do you agree?

@prod-feng
Copy link

Thanks for your reply, Richard.

For the case of 3data+1parity+1spare, the 32 base permutations X 5 times permutation=160 combinations, which is > 120, the total combinations of 5 disks. Since there are 5 base permutations repeated, that is 32-5=27 X 5= 135 combinations(>120). Some of them should be repeated, the chance that it still distribute the data evenly is good. Just want to be sure that when use more disks, like 20, 40, 80, etc., if there's any repeated base permutations, the expected even distribution may be affected badly(much more combinations >permutations). One easy way may be just remove the identical base permutation, and generate a new/different one to replace it.

For my second question:

In this case, with the 32/64/128/base permutations, it will eventually use all disks, just like regular random distribution of data?

Thanks

@richardelling
Copy link
Contributor

I'm not sure I follow your math. There are n! combinations, so for n=5 there are 120 combinations.

In general, the way this works is the offset % base_permutation picks the row. Parity, data, and spares are the columns. So for any offset there is a direct map to the disks and their role (parity, data, spare). This just needs to be random enough to spread the workload across all of the disks. By contrast, RAID-5 typically rotates the parity through the smaller number of combinations, n. This likely doesn't make much difference for small values of n. We use n=56, base_permutations=64, combinations=7.1e74 for small products and larger n for others, so we want a random choice.

If it helps, we could prevent duplicates if n! >= base_permutations. But I'm not sure how to quantify the impact.

@PrivatePuffin
Copy link
Contributor

Is there any chance @thegreatgazoo is able or willing to continue some work on this any time soon?

It seems a bit odd that he is still listed as a ZoL Organisation member, but seems totally inactive and unrespsonive...

@PrivatePuffin
Copy link
Contributor

@kpande The usual stuff basically...
Although according to @ahrens that it should still be actively developed still?
https://twitter.com/mahrens1/status/1191717726118563842

That being said: It would've been nice from intel (as a sponsor) to send a new dude if their dude leaves or at least make sure he posts about leaving.

Shame though, because this is very much in the interest of Intel. Because lets be real: thats is why they started pushing this and Allocation classes in the first place.

@ofaaland ofaaland mentioned this pull request Nov 6, 2019
12 tasks
@ahrens
Copy link
Member

ahrens commented Nov 7, 2019

Superseded by #9558

@ahrens ahrens closed this Nov 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Work in Progress Not yet ready for general review Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants