monitor physical disks #242

K0zka · 2018-12-18T23:11:47Z

check the drives periodically for predicted errors, create a problem when a drive fails

K0zka · 2018-12-18T23:13:32Z

smartmontools

Bug-Url: #242 Signed-off-by: Laszlo Hornyak <[email protected]>

K0zka · 2018-12-23T17:48:01Z

TODO

OpenIndiana and Windows integration missing (windows: smartmontools available for cygwin)
~~Problem must be created when any negative disk health indication~~ done
There should be step factories to provide solutions for the problem, e.g. migrate storage, ~~remove the disk from the VG~~, and so on (remove the disk from VG is done, same for gvinum is not done yet)
There should be a few test stories about disk failure scenarios with all kinds of storage solutions (fs, lvm, gvinum)
The story should be also introduced in kerub-ext-tests, but no idea whether or not we can emulate a SMART disk failure with qemu

what is missing is the list of storage devices from the storage capabilities Bug-Url: #242 Signed-off-by: Laszlo Hornyak <[email protected]>

Bug-Url: #242 Signed-off-by: Laszlo Hornyak <[email protected]>

K0zka · 2018-12-25T08:20:22Z

Windows: wmic diskdrive get status should theoretically do the trick

Bug-Url: #242 Signed-off-by: Laszlo Hornyak <[email protected]>

lvm vgreduce and pvmove junix utilities Bug-Url: #242 Signed-off-by: Laszlo Hornyak <[email protected]>

…tion Bug-Url: #242 Signed-off-by: Laszlo Hornyak <[email protected]>

Bug-Url: #242 Signed-off-by: Laszlo Hornyak <[email protected]>

K0zka · 2018-12-28T20:51:43Z

Reminder

When there is an lvm VG with single PV in it, which signals SMART failure, then we can not remove the PV (one should always remain). There is no operation for completely removing the VG either. But this leads to keep having the disk failure detected as a problem.

The idea for this is that if it is the only PV in the VG and there are no virtual storage allocations on it, the problem detector should consider it fine. I will do this if tomorrow I still believe this is a good idea.

K0zka · 2018-12-29T13:55:23Z

but in any case a storage failure should create an alert #51

K0zka · 2019-01-01T12:34:48Z

Supressing the problem detection if there is only one failing PV in the VG does not seem to be such a good idea. But what else could one do:

~~remove the VG completely - but e.g. this could be a system VG and then the server will fail~~
raise one more problem for each virtual disk allocation on the volume group, let the planner come up with ideas about what can we do. Eventually the single problem should remain with the VG, but that we can not solve without human interaction, so let's leave it there (still: raise an alert)

…ity with failing storage Bug-Url: #242 Signed-off-by: Laszlo Hornyak <[email protected]>

Bug-Url: #242 Signed-off-by: Laszlo Hornyak <[email protected]>

…ing disk Bug-Url: #242 Signed-off-by: Laszlo Hornyak <[email protected]>

Bug-Url: #242 Signed-off-by: Laszlo Hornyak <[email protected]>

K0zka · 2019-02-23T21:04:41Z

For testing, some documentation here https://www.kernel.org/doc/Documentation/fault-injection/fault-injection.txt

K0zka · 2019-02-23T22:22:50Z

Reminder

Right now FS does not track any information about the backing block device. Therefore when the a storage device is signals future problem, kerub could evacuate the filesystem, but it does not even know it should.

K0zka added enhancement priority: normal component:data processing labels Dec 18, 2018

K0zka self-assigned this Dec 18, 2018

K0zka added a commit that referenced this issue Dec 19, 2018

smartmontools utility wrapper

95a6116

Bug-Url: #242 Signed-off-by: Laszlo Hornyak <[email protected]>

K0zka added a commit that referenced this issue Dec 21, 2018

lsblk 'available' method

3fa95de

Bug-Url: #242 Signed-off-by: Laszlo Hornyak <[email protected]>

K0zka added a commit that referenced this issue Dec 22, 2018

smartctl monitoring method

27a0841

Bug-Url: #242 Signed-off-by: Laszlo Hornyak <[email protected]>

K0zka added a commit that referenced this issue Dec 23, 2018

monitoring disk health on freebsd and linux

38264fe

Bug-Url: #242 Signed-off-by: Laszlo Hornyak <[email protected]>

K0zka pinned this issue Dec 24, 2018

K0zka added a commit that referenced this issue Dec 24, 2018

unfinished problem-detector for failing storage devices

486ae75

what is missing is the list of storage devices from the storage capabilities Bug-Url: #242 Signed-off-by: Laszlo Hornyak <[email protected]>

K0zka added a commit that referenced this issue Dec 24, 2018

detect failed storage problems

5becc34

Bug-Url: #242 Signed-off-by: Laszlo Hornyak <[email protected]>

K0zka added a commit that referenced this issue Dec 26, 2018

lvm vgreduce and pvmove junix utilities

8d4d061

Bug-Url: #242 Signed-off-by: Laszlo Hornyak <[email protected]>

K0zka added a commit that referenced this issue Dec 26, 2018

(missed from prev commit)

29ba26f

lvm vgreduce and pvmove junix utilities Bug-Url: #242 Signed-off-by: Laszlo Hornyak <[email protected]>

K0zka added a commit that referenced this issue Dec 26, 2018

default value for list update 'default' function literal: throw excep…

f028573

…tion Bug-Url: #242 Signed-off-by: Laszlo Hornyak <[email protected]>

K0zka added a commit that referenced this issue Dec 26, 2018

lvm remove disk step

a6582b8

Bug-Url: #242 Signed-off-by: Laszlo Hornyak <[email protected]>

K0zka added a commit that referenced this issue Dec 27, 2018

check number of PVs in VG, don't offer reduce if there is only one PV

e8f8c0e

Bug-Url: #242 Signed-off-by: Laszlo Hornyak <[email protected]>

K0zka added a commit that referenced this issue Dec 27, 2018

first test story for pv remove

3bda29f

Bug-Url: #242 Signed-off-by: Laszlo Hornyak <[email protected]>

K0zka added this to the 0.2 milestone Dec 27, 2018

K0zka added a commit that referenced this issue Jan 1, 2019

new problem type in case a virtual storage allocation is on a capabil…

b3caaeb

…ity with failing storage Bug-Url: #242 Signed-off-by: Laszlo Hornyak <[email protected]>

K0zka added a commit that referenced this issue Jan 1, 2019

some simplification and tuning on the FailingStorageDeviceDetector

255ea07

Bug-Url: #242 Signed-off-by: Laszlo Hornyak <[email protected]>

K0zka added a commit that referenced this issue Jan 1, 2019

new base class for the failing storage detector

9292e77

Bug-Url: #242 Signed-off-by: Laszlo Hornyak <[email protected]>

K0zka added a commit that referenced this issue Jan 2, 2019

detect a problem when a virtual storage device is allocated on a fail…

fd94719

…ing disk Bug-Url: #242 Signed-off-by: Laszlo Hornyak <[email protected]>

K0zka added a commit that referenced this issue Jan 4, 2019

device -> failingDevice

4188909

Bug-Url: #242 Signed-off-by: Laszlo Hornyak <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

monitor physical disks #242

monitor physical disks #242

K0zka commented Dec 18, 2018

K0zka commented Dec 18, 2018

K0zka commented Dec 23, 2018 •

edited

Loading

K0zka commented Dec 25, 2018

K0zka commented Dec 28, 2018

K0zka commented Dec 29, 2018

K0zka commented Jan 1, 2019 •

edited

Loading

K0zka commented Feb 23, 2019

K0zka commented Feb 23, 2019

monitor physical disks #242

monitor physical disks #242

Comments

K0zka commented Dec 18, 2018

K0zka commented Dec 18, 2018

K0zka commented Dec 23, 2018 • edited Loading

TODO

K0zka commented Dec 25, 2018

K0zka commented Dec 28, 2018

Reminder

K0zka commented Dec 29, 2018

K0zka commented Jan 1, 2019 • edited Loading

K0zka commented Feb 23, 2019

K0zka commented Feb 23, 2019

Reminder

K0zka commented Dec 23, 2018 •

edited

Loading

K0zka commented Jan 1, 2019 •

edited

Loading