Module/metricset for Metricbeat: RAID Metrics #5600

plinde · 2017-11-15T14:39:55Z

Enhancement to Metricbeat for collecting RAID-related metrics; specifically for the equivalents of these commands:

cat /proc/mdstat
mdadm

The text was updated successfully, but these errors were encountered:

ruflin · 2017-11-16T04:34:15Z

Here an example content of /proc/mdstat from a machine with 2 disks and raid1:

cat mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid1 sda5[0] sdb5[1]
      2925531648 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sda2[0] sdb2[1]
      2097088 blocks [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[1]
      2490176 blocks [2/2] [UU]

unused devices: <none>

@andrewkroh We should probably add support for this in gosigar? https://github.com/elastic/gosigar

andrewkroh · 2017-11-16T14:02:48Z

@ruflin There's support for this in procfs so we should use that. https://github.com/elastic/procfs/blob/master/mdstat.go

The question I'm pondering is where do we add this. linux/mdstat, system/mdstat, or a maybe more general system/raid?

ruflin · 2017-11-16T22:57:50Z

My thought process here:

Raid is something generic that exists on all (?) OS
Information on Windows should be similar (never created a Raid on Windows)

This kind of leads to system/raid. One issue I have with raid is that raid could be perhaps mdstats + ?, meaning is there more information under /proc we should add to raid?

raid also seems to be the most user friendly one. I would not have known from top of my head that the information is in mdstats.

ruflin · 2017-11-20T03:56:03Z

@plinde I put a PR together with the data in. #5642

@andrewkroh The metricset is not tested yet as I'm not sure how test this best with "actual" raid data. Some ideas here?

ruflin · 2017-11-20T23:09:38Z

@plinde Could you let me know which are the exact metrics you are looking for from the mdstats file? I have a few in #5642 but not sure if these are the right ones and if you need some more.

plinde · 2017-11-21T18:12:10Z

@ruflin Looks great! I think it would be beneficial to include the following additional metrics per RAID device. However, I can see this would be more of an enhancement to procfs's mdstat.go

Working Devices: int
Failed Devices: int
Spare Devices: int

If possible, it would also be good to compare the blocks.synced/total and perhaps include a Boolean for "synced: true". The scenarios for this being false would include during the rebuild (syncing) of a disk.

ruflin · 2017-11-27T02:36:17Z

The raid metricset will create each time an event for each device. The metrics you described above are more a summary which you probably do best with a query in ES / KB. You can group by the field system.raid.activity_state for that. Is there a "state" for spare device?

In case you are mainly interested in the overall stats we could thinkg about either only doing the overview or have something similar to the process and process_summary metricset, like having an raid_summary metricset.

For the blocks.synced / total are you referring to the recovery lines? See https://raid.wiki.kernel.org/index.php/Mdstat

Closes elastic#5600

andrewkroh · 2017-12-05T03:11:57Z

Leaving this open because I think we need

metrics for number of failed and spare devices
resolve the mdstat parsing issue discussed in Add system/raid metricset #5642 (comment)

jsoriano · 2018-04-03T10:18:36Z

It'd be also nice to support common hardware RAID controllers like MegaRAID.

ruflin · 2018-04-03T11:27:38Z

@jsoriano How is data access for these?

jsoriano · 2018-04-03T12:34:12Z

Access to them is usually via commands, for example megacli for MegaRAID.

ruflin · 2018-04-04T12:53:01Z

I would prefer if we would not have to execute commands (if possible). So far we stayed away from it for security reasons.

jsoriano · 2018-04-04T14:04:02Z

Oh ok, I understand, probably these commands use ioctls at the end, I don't know if they are also based in libraries that we could use.

It may be complex if their commands are not used because sometimes they are based in propietary solutions.

ruflin · 2018-12-04T19:57:30Z

@plinde As we did a first part of the implementation but so far didn't get to follow up with the second part but didn't hear back yet, I wonder if there is still need for par 2?

plinde · 2019-01-07T15:50:17Z

@ruflin I'd say that mdadm falls into the same category as megacli, as it is a command to execute. That said, there is likely value in digging deeper as there are some vital bits of information in mdadm's output. Please see below for an example.

/dev/md0:
        Version : 1.2
  Creation Time : Thu Dec  7 17:52:05 2017
     Raid Level : raid1
     Array Size : 83820544 (79.94 GiB 85.83 GB)
  Used Dev Size : 83820544 (79.94 GiB 85.83 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Thu Dec  7 18:55:14 2017
          State : active, degraded, recovering
 Active Devices : 1
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 1

 Rebuild Status : 19% complete

           Name : ip-10-47-187-215:0  (local to host ip-10-47-187-215)
           UUID : 0bf0183d:dd02021a:cb5eb9e1:3da0c5c6
         Events : 26

    Number   Major   Minor   RaidDevice State
       2     202       80        0      spare rebuilding   /dev/xvdf
       1     202       96        1      active sync   /dev/xvdg

ruflin · 2019-01-22T11:53:48Z

@plinde Could you share some details on what exact values you are interested in from the above?

plinde · 2019-01-22T12:19:58Z

I think particularly relevant metrics might be:

{
"md0.raid1.active_devices" : 1,
"md0.raid1.working_devices" : 2,
"md0.raid1.failed_devices" : 0,
"md0.raid1.spare_devices" : 1,
"md0.raid1.rebuild_status" : 19,
...
"md1.raid1.active_devices" : 1,
"md1.raid1.working_devices" : 1,
"md1.raid1.failed_devices" : 1,
"md1.raid1.spare_devices" : 1,
"md1.raid1.rebuild_status" : 0,
}

jsoriano · 2019-01-27T23:34:03Z

@plinde @ruflin mdadm manages software RAID, the same we are already monitoring now. We collect the data from /proc/mdstat, but only total and active devices, and synced bytes, we could try to collect more values if needed.

alvarolobato · 2019-03-04T14:26:34Z

@plinde is this something you still need?

fearful-symmetry · 2019-03-08T21:08:48Z

So, there are 3 sources of this information:

ioctl() via GET_DISK_INFO and GET_ARRAY_INFO
/proc/mdstat
/sys/block/md*/md/*

Tools like mdadm are just collecting all these and putting them on a single screen.

The data that @plinde mentions above is from GET_ARRAY_INFO

Annoyingly, other interesting information, such as recovery/rebuild/resync percentage in mdadm comes from /proc/mdstat , and we're using a 3rd party library to parse that, and said library doesn't support grabbing those values.

fearful-symmetry · 2019-03-08T21:24:32Z

Addendum: The Array size that mdadm reports comes from a third ioctl call, BLKGETSIZE64 .

fearful-symmetry · 2019-03-13T15:55:31Z

So, I made a brief PoC using ioctl in go. The GET_DISK_INFO stuff that we wan't is actually rather straightforward, but there seems to be a lot of nefarious logic going on with the GET_DISK_INFO AND BLKGETSIZE64 calls that I can't completely understand. The md subsystem isn't very well documented, which isn't helping.

ruflin · 2019-03-15T12:03:37Z

@fearful-symmetry In general you think it's worth pursuing this route further?

fearful-symmetry · 2019-03-15T16:26:00Z

@ruflin Yes. I'm currently putting together a PR for the ioctl implementation of this. I'm going to be running around airports most of the day, so it probably won't happen until Monday.

plinde added enhancement Metricbeat Metricbeat labels Nov 15, 2017

ruflin mentioned this issue Nov 27, 2017

Add system/raid metricset #5642

Merged

ruflin added a commit to ruflin/beats that referenced this issue Dec 4, 2017

Add system/raid metricset to Metricbeat

46c2691

Closes elastic#5600

andrewkroh closed this as completed in 6536115 Dec 5, 2017

andrewkroh reopened this Dec 5, 2017

ruflin added the module label Feb 26, 2018

ruflin added the Team:Integrations Label for the Integrations team label Nov 21, 2018

alvarolobato assigned ruflin Dec 4, 2018

ruflin removed their assignment Jan 22, 2019

alvarolobato added the [zube]: Ready label Mar 4, 2019

jsoriano assigned fearful-symmetry Mar 14, 2019

ruflin added [zube]: In Progress and removed [zube]: Ready labels Mar 18, 2019

fearful-symmetry mentioned this issue Mar 18, 2019

[metricbeat] add support for working, failed and spare disks in raid metricset #11292

Closed

fearful-symmetry mentioned this issue Apr 2, 2019

[metricbeat] Raid metricset with expanded disk states, using /sys/block #11613

Merged

alvarolobato added [zube]: In Review and removed [zube]: In Progress labels Apr 15, 2019

andresrc closed this as completed Jun 17, 2019

andresrc added [zube]: Done and removed [zube]: In Review [zube]: Done labels Jun 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Module/metricset for Metricbeat: RAID Metrics #5600

Module/metricset for Metricbeat: RAID Metrics #5600

plinde commented Nov 15, 2017

ruflin commented Nov 16, 2017

andrewkroh commented Nov 16, 2017

ruflin commented Nov 16, 2017

ruflin commented Nov 20, 2017

ruflin commented Nov 20, 2017

plinde commented Nov 21, 2017 •

edited

Loading

ruflin commented Nov 27, 2017

andrewkroh commented Dec 5, 2017 •

edited

Loading

jsoriano commented Apr 3, 2018

ruflin commented Apr 3, 2018 •

edited

Loading

jsoriano commented Apr 3, 2018

ruflin commented Apr 4, 2018

jsoriano commented Apr 4, 2018

ruflin commented Dec 4, 2018

plinde commented Jan 7, 2019

ruflin commented Jan 22, 2019

plinde commented Jan 22, 2019

jsoriano commented Jan 27, 2019

alvarolobato commented Mar 4, 2019

fearful-symmetry commented Mar 8, 2019 •

edited

Loading

fearful-symmetry commented Mar 8, 2019

fearful-symmetry commented Mar 13, 2019

ruflin commented Mar 15, 2019

fearful-symmetry commented Mar 15, 2019

Module/metricset for Metricbeat: RAID Metrics #5600

Module/metricset for Metricbeat: RAID Metrics #5600

Comments

plinde commented Nov 15, 2017

ruflin commented Nov 16, 2017

andrewkroh commented Nov 16, 2017

ruflin commented Nov 16, 2017

ruflin commented Nov 20, 2017

ruflin commented Nov 20, 2017

plinde commented Nov 21, 2017 • edited Loading

ruflin commented Nov 27, 2017

andrewkroh commented Dec 5, 2017 • edited Loading

jsoriano commented Apr 3, 2018

ruflin commented Apr 3, 2018 • edited Loading

jsoriano commented Apr 3, 2018

ruflin commented Apr 4, 2018

jsoriano commented Apr 4, 2018

ruflin commented Dec 4, 2018

plinde commented Jan 7, 2019

ruflin commented Jan 22, 2019

plinde commented Jan 22, 2019

jsoriano commented Jan 27, 2019

alvarolobato commented Mar 4, 2019

fearful-symmetry commented Mar 8, 2019 • edited Loading

fearful-symmetry commented Mar 8, 2019

fearful-symmetry commented Mar 13, 2019

ruflin commented Mar 15, 2019

fearful-symmetry commented Mar 15, 2019

plinde commented Nov 21, 2017 •

edited

Loading

andrewkroh commented Dec 5, 2017 •

edited

Loading

ruflin commented Apr 3, 2018 •

edited

Loading

fearful-symmetry commented Mar 8, 2019 •

edited

Loading