Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Module/metricset for Metricbeat: RAID Metrics #5600

Closed
plinde opened this issue Nov 15, 2017 · 24 comments
Closed

Module/metricset for Metricbeat: RAID Metrics #5600

plinde opened this issue Nov 15, 2017 · 24 comments
Assignees
Labels
enhancement Metricbeat Metricbeat module Team:Integrations Label for the Integrations team

Comments

@plinde
Copy link
Member

plinde commented Nov 15, 2017

Enhancement to Metricbeat for collecting RAID-related metrics; specifically for the equivalents of these commands:

  • cat /proc/mdstat
  • mdadm
@ruflin
Copy link
Contributor

ruflin commented Nov 16, 2017

Here an example content of /proc/mdstat from a machine with 2 disks and raid1:

cat mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid1 sda5[0] sdb5[1]
      2925531648 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sda2[0] sdb2[1]
      2097088 blocks [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[1]
      2490176 blocks [2/2] [UU]

unused devices: <none>

@andrewkroh We should probably add support for this in gosigar? https://github.com/elastic/gosigar

@andrewkroh
Copy link
Member

@ruflin There's support for this in procfs so we should use that. https://github.com/elastic/procfs/blob/master/mdstat.go

The question I'm pondering is where do we add this. linux/mdstat, system/mdstat, or a maybe more general system/raid?

@ruflin
Copy link
Contributor

ruflin commented Nov 16, 2017

My thought process here:

  • Raid is something generic that exists on all (?) OS
  • Information on Windows should be similar (never created a Raid on Windows)

This kind of leads to system/raid. One issue I have with raid is that raid could be perhaps mdstats + ?, meaning is there more information under /proc we should add to raid?

raid also seems to be the most user friendly one. I would not have known from top of my head that the information is in mdstats.

@ruflin
Copy link
Contributor

ruflin commented Nov 20, 2017

@plinde I put a PR together with the data in. #5642

@andrewkroh The metricset is not tested yet as I'm not sure how test this best with "actual" raid data. Some ideas here?

@ruflin
Copy link
Contributor

ruflin commented Nov 20, 2017

@plinde Could you let me know which are the exact metrics you are looking for from the mdstats file? I have a few in #5642 but not sure if these are the right ones and if you need some more.

@plinde
Copy link
Member Author

plinde commented Nov 21, 2017

@ruflin Looks great! I think it would be beneficial to include the following additional metrics per RAID device. However, I can see this would be more of an enhancement to procfs's mdstat.go

  • Working Devices: int
  • Failed Devices: int
  • Spare Devices: int

If possible, it would also be good to compare the blocks.synced/total and perhaps include a Boolean for "synced: true". The scenarios for this being false would include during the rebuild (syncing) of a disk.

@ruflin
Copy link
Contributor

ruflin commented Nov 27, 2017

The raid metricset will create each time an event for each device. The metrics you described above are more a summary which you probably do best with a query in ES / KB. You can group by the field system.raid.activity_state for that. Is there a "state" for spare device?

In case you are mainly interested in the overall stats we could thinkg about either only doing the overview or have something similar to the process and process_summary metricset, like having an raid_summary metricset.

For the blocks.synced / total are you referring to the recovery lines? See https://raid.wiki.kernel.org/index.php/Mdstat

ruflin added a commit to ruflin/beats that referenced this issue Dec 4, 2017
@andrewkroh andrewkroh reopened this Dec 5, 2017
@andrewkroh
Copy link
Member

andrewkroh commented Dec 5, 2017

Leaving this open because I think we need

@ruflin ruflin added the module label Feb 26, 2018
@jsoriano
Copy link
Member

jsoriano commented Apr 3, 2018

It'd be also nice to support common hardware RAID controllers like MegaRAID.

@ruflin
Copy link
Contributor

ruflin commented Apr 3, 2018

@jsoriano How is data access for these?

@jsoriano
Copy link
Member

jsoriano commented Apr 3, 2018

Access to them is usually via commands, for example megacli for MegaRAID.

@ruflin
Copy link
Contributor

ruflin commented Apr 4, 2018

I would prefer if we would not have to execute commands (if possible). So far we stayed away from it for security reasons.

@jsoriano
Copy link
Member

jsoriano commented Apr 4, 2018

Oh ok, I understand, probably these commands use ioctls at the end, I don't know if they are also based in libraries that we could use.

It may be complex if their commands are not used because sometimes they are based in propietary solutions.

@ruflin ruflin added the Team:Integrations Label for the Integrations team label Nov 21, 2018
@ruflin
Copy link
Contributor

ruflin commented Dec 4, 2018

@plinde As we did a first part of the implementation but so far didn't get to follow up with the second part but didn't hear back yet, I wonder if there is still need for par 2?

@plinde
Copy link
Member Author

plinde commented Jan 7, 2019

@ruflin I'd say that mdadm falls into the same category as megacli, as it is a command to execute. That said, there is likely value in digging deeper as there are some vital bits of information in mdadm's output. Please see below for an example.

/dev/md0:
        Version : 1.2
  Creation Time : Thu Dec  7 17:52:05 2017
     Raid Level : raid1
     Array Size : 83820544 (79.94 GiB 85.83 GB)
  Used Dev Size : 83820544 (79.94 GiB 85.83 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Thu Dec  7 18:55:14 2017
          State : active, degraded, recovering
 Active Devices : 1
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 1

 Rebuild Status : 19% complete

           Name : ip-10-47-187-215:0  (local to host ip-10-47-187-215)
           UUID : 0bf0183d:dd02021a:cb5eb9e1:3da0c5c6
         Events : 26

    Number   Major   Minor   RaidDevice State
       2     202       80        0      spare rebuilding   /dev/xvdf
       1     202       96        1      active sync   /dev/xvdg

@ruflin
Copy link
Contributor

ruflin commented Jan 22, 2019

@plinde Could you share some details on what exact values you are interested in from the above?

@ruflin ruflin removed their assignment Jan 22, 2019
@plinde
Copy link
Member Author

plinde commented Jan 22, 2019

I think particularly relevant metrics might be:

{
"md0.raid1.active_devices" : 1,
"md0.raid1.working_devices" : 2,
"md0.raid1.failed_devices" : 0,
"md0.raid1.spare_devices" : 1,
"md0.raid1.rebuild_status" : 19,
...
"md1.raid1.active_devices" : 1,
"md1.raid1.working_devices" : 1,
"md1.raid1.failed_devices" : 1,
"md1.raid1.spare_devices" : 1,
"md1.raid1.rebuild_status" : 0,
}

@jsoriano
Copy link
Member

@plinde @ruflin mdadm manages software RAID, the same we are already monitoring now. We collect the data from /proc/mdstat, but only total and active devices, and synced bytes, we could try to collect more values if needed.

@alvarolobato
Copy link

@plinde is this something you still need?

@fearful-symmetry
Copy link
Contributor

fearful-symmetry commented Mar 8, 2019

So, there are 3 sources of this information:

  • ioctl() via GET_DISK_INFO and GET_ARRAY_INFO
  • /proc/mdstat
  • /sys/block/md*/md/*

Tools like mdadm are just collecting all these and putting them on a single screen.

The data that @plinde mentions above is from GET_ARRAY_INFO

Annoyingly, other interesting information, such as recovery/rebuild/resync percentage in mdadm comes from /proc/mdstat , and we're using a 3rd party library to parse that, and said library doesn't support grabbing those values.

@fearful-symmetry
Copy link
Contributor

Addendum: The Array size that mdadm reports comes from a third ioctl call, BLKGETSIZE64 .

@fearful-symmetry
Copy link
Contributor

So, I made a brief PoC using ioctl in go. The GET_DISK_INFO stuff that we wan't is actually rather straightforward, but there seems to be a lot of nefarious logic going on with the GET_DISK_INFO AND BLKGETSIZE64 calls that I can't completely understand. The md subsystem isn't very well documented, which isn't helping.

@ruflin
Copy link
Contributor

ruflin commented Mar 15, 2019

@fearful-symmetry In general you think it's worth pursuing this route further?

@fearful-symmetry
Copy link
Contributor

@ruflin Yes. I'm currently putting together a PR for the ioctl implementation of this. I'm going to be running around airports most of the day, so it probably won't happen until Monday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Metricbeat Metricbeat module Team:Integrations Label for the Integrations team
Projects
None yet
Development

No branches or pull requests

7 participants