Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

monitor physical disks #242

Open
K0zka opened this issue Dec 18, 2018 · 8 comments
Open

monitor physical disks #242

K0zka opened this issue Dec 18, 2018 · 8 comments

Comments

@K0zka
Copy link
Collaborator

K0zka commented Dec 18, 2018

check the drives periodically for predicted errors, create a problem when a drive fails

@K0zka
Copy link
Collaborator Author

K0zka commented Dec 18, 2018

smartmontools

K0zka added a commit that referenced this issue Dec 19, 2018
Bug-Url: #242
Signed-off-by: Laszlo Hornyak <[email protected]>
K0zka added a commit that referenced this issue Dec 21, 2018
Bug-Url: #242
Signed-off-by: Laszlo Hornyak <[email protected]>
K0zka added a commit that referenced this issue Dec 22, 2018
Bug-Url: #242
Signed-off-by: Laszlo Hornyak <[email protected]>
K0zka added a commit that referenced this issue Dec 23, 2018
Bug-Url: #242
Signed-off-by: Laszlo Hornyak <[email protected]>
@K0zka
Copy link
Collaborator Author

K0zka commented Dec 23, 2018

TODO

  1. OpenIndiana and Windows integration missing (windows: smartmontools available for cygwin)
  2. Problem must be created when any negative disk health indication done
  3. There should be step factories to provide solutions for the problem, e.g. migrate storage, remove the disk from the VG, and so on (remove the disk from VG is done, same for gvinum is not done yet)
  4. There should be a few test stories about disk failure scenarios with all kinds of storage solutions (fs, lvm, gvinum)
  5. The story should be also introduced in kerub-ext-tests, but no idea whether or not we can emulate a SMART disk failure with qemu

@K0zka K0zka pinned this issue Dec 24, 2018
K0zka added a commit that referenced this issue Dec 24, 2018
what is missing is the list of storage devices from the storage
capabilities

Bug-Url: #242
Signed-off-by: Laszlo Hornyak <[email protected]>
K0zka added a commit that referenced this issue Dec 24, 2018
Bug-Url: #242
Signed-off-by: Laszlo Hornyak <[email protected]>
@K0zka
Copy link
Collaborator Author

K0zka commented Dec 25, 2018

Windows: wmic diskdrive get status should theoretically do the trick

K0zka added a commit that referenced this issue Dec 26, 2018
Bug-Url: #242
Signed-off-by: Laszlo Hornyak <[email protected]>
K0zka added a commit that referenced this issue Dec 26, 2018
lvm vgreduce and pvmove junix utilities

Bug-Url: #242
Signed-off-by: Laszlo Hornyak <[email protected]>
K0zka added a commit that referenced this issue Dec 26, 2018
K0zka added a commit that referenced this issue Dec 26, 2018
Bug-Url: #242
Signed-off-by: Laszlo Hornyak <[email protected]>
K0zka added a commit that referenced this issue Dec 27, 2018
K0zka added a commit that referenced this issue Dec 27, 2018
Bug-Url: #242
Signed-off-by: Laszlo Hornyak <[email protected]>
@K0zka K0zka added this to the 0.2 milestone Dec 27, 2018
@K0zka
Copy link
Collaborator Author

K0zka commented Dec 28, 2018

Reminder

When there is an lvm VG with single PV in it, which signals SMART failure, then we can not remove the PV (one should always remain). There is no operation for completely removing the VG either. But this leads to keep having the disk failure detected as a problem.

The idea for this is that if it is the only PV in the VG and there are no virtual storage allocations on it, the problem detector should consider it fine. I will do this if tomorrow I still believe this is a good idea.

@K0zka
Copy link
Collaborator Author

K0zka commented Dec 29, 2018

but in any case a storage failure should create an alert #51

@K0zka
Copy link
Collaborator Author

K0zka commented Jan 1, 2019

Supressing the problem detection if there is only one failing PV in the VG does not seem to be such a good idea. But what else could one do:

  • remove the VG completely - but e.g. this could be a system VG and then the server will fail
  • raise one more problem for each virtual disk allocation on the volume group, let the planner come up with ideas about what can we do. Eventually the single problem should remain with the VG, but that we can not solve without human interaction, so let's leave it there (still: raise an alert)

K0zka added a commit that referenced this issue Jan 1, 2019
…ity with failing storage

Bug-Url: #242
Signed-off-by: Laszlo Hornyak <[email protected]>
K0zka added a commit that referenced this issue Jan 1, 2019
K0zka added a commit that referenced this issue Jan 1, 2019
K0zka added a commit that referenced this issue Jan 2, 2019
K0zka added a commit that referenced this issue Jan 4, 2019
Bug-Url: #242
Signed-off-by: Laszlo Hornyak <[email protected]>
@K0zka
Copy link
Collaborator Author

K0zka commented Feb 23, 2019

For testing, some documentation here https://www.kernel.org/doc/Documentation/fault-injection/fault-injection.txt

@K0zka
Copy link
Collaborator Author

K0zka commented Feb 23, 2019

Reminder

Right now FS does not track any information about the backing block device. Therefore when the a storage device is signals future problem, kerub could evacuate the filesystem, but it does not even know it should.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant