Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interrupts+spurious #5519

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 85 additions & 45 deletions plugins/inputs/interrupts/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Interrupts Input Plugin

The interrupts plugin gathers metrics about IRQs from `/proc/interrupts` and `/proc/softirqs`.
The interrupts plugin gathers metrics about IRQs from `/proc/interrupts`, `/proc/softirqs`,
and `/proc/irq/IRQ_NUMBER/spurious`

### Configuration
```toml
Expand All @@ -13,65 +14,99 @@ The interrupts plugin gathers metrics about IRQs from `/proc/interrupts` and `/p
## deployments.
# cpu_as_tag = false

## spurious interrupt counters can be collected
# spurious = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just enable this by default, and rely on the metric filtering for those who don't want spurious counts per IRQ.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok


## To filter which IRQs to collect, make use of tagpass / tagdrop, i.e.
# [inputs.interrupts.tagdrop]
# irq = [ "NET_RX", "TASKLET" ]
```

### Metrics
### Measurements

There are two styles depending on the value of `cpu_as_tag`.
There are two styles of `interrupts` and `soft_interrupts` depending on the value of `cpu_as_tag`.
When `cpu_as_tag` is `false` the per-CPU count is in a field. When `true` the CPU is a tag and
there is one point per interrupt per CPU. Having the CPU as a tag easily allows queries by-CPU at
the cost of requiring greater cardinality.

With `cpu_as_tag = false`:
For `spurious_interrupts` there is only a per-interrupt counter available; no per-CPU info is
available.

- interrupts
- tags:
- irq (IRQ name)
- type
- device (name of the device that is located at the IRQ)
- cpu
- fields:
- cpu (int, number of interrupts per cpu)
- total (int, total number of interrupts)

- soft_interrupts
- tags:
- irq (IRQ name)
- type
- device (name of the device that is located at the IRQ)
- cpu
- fields:
- cpu (int, number of interrupts per cpu)
- total (int, total number of interrupts)
#### interrupts
The `interrupts` measurement reports the hard interrupt data collected from the
`/proc/interrupts` file in Linux.
These interrupts are typically from hardware devices, but can be also generated by
software, such as timers and interprocessor interrupts.

With `cpu_as_tag = true`:
##### interrupts tags
| cpu_as_tag | tag | description |
|:---:|:---:|---|
| - | device | description of device |
| - | irq | IRQ name (ephemeral, could be a number) |
| - | type | interrupt type |
| true | cpu | CPU number |

##### interrupts fields
| cpu_as_tag | field | counter | units | description |
|:---:|:---:|:---:|:---:|---|
| - | count | counter | events | number of times the interrupt has been triggered |
| false | cpu# | counter | events | number of times the interrupt has been handled by CPU _#_ |
| false | total | counter | events | total number of times the interrupt has been handled by all CPUs |

#### soft_interrupts
The `soft_interrupts` measurement reports the soft interrupt data collected from the
`/proc/softirqs` file in Linux.

Note: for some Linux systems there can be fixed, large number of CPUs reported in
the `/proc/softirqs` file. This number could be much larger than the actual number of
CPUs in the system. The fields for these phantom CPUs contain zeroes. The approach
taken to remove these phantom CPUs is to remove the columns containing all zeros
to the right (higher CPU numbers). For systems where CPUs are dynamically enabled,
this can lead to CPUs not being reported until enabled. However, this is preferable
to collecting metrics for tens or hundreds of phantom CPUs. For queries with fixed
numbers of CPUs, consider using `fill(0)` rather than `fill(null)`

##### soft_interrupts tags
| cpu_as_tag | tag | description |
|:---:|:---:|---|
| - | irq | IRQ name |
| true | cpu | CPU number |

- interrupts
- tags:
- irq (IRQ name)
- type
- device (name of the device that is located at the IRQ)
- cpu
- fields:
- count (int, number of interrupts)

- soft_interrupts
- tags:
- irq (IRQ name)
- type
- device (name of the device that is located at the IRQ)
- cpu
- fields:
- count (int, number of interrupts)
##### soft_interrupts fields
| cpu_as_tag | field | type | units | description |
|:---:|:---:|:---:|:---:|---|
| false | cpu# | counter | events | number of times the interrupt has been handled by CPU _#_ |
| false | total | counter | events | total number of times the interrupt has been handled by all CPUs |
| true | count | counter | events | number of times the interrupt has been handled by CPU in the tag |


#### spurious_interrupts
The `spurious_interrupts` measurement reports the number of spurious interrupts triggered
and unhandled for IRQs. This data is collected from the `/proc/irq/IRQ/spurious` file.
This data is identified by IRQ and not per-CPU.

##### spurious_interrupts tags
| tag | description |
|:---:|---|
| device | description of device |
| irq | IRQ name |
| type | interrupt type |

##### spurious_interrupts fields
| field | type | units | description |
|:---:|:---:|:---:|---|
| count | counter | events | number of times the interrupt has been handled (modulo 100,000) |
| total | counter | events | total number of times the interrupt has been handled |
| unhandled | counter | events | number of times an interrupt was not handled |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks nice but we need to stick to the previous format style to match https://raw.githubusercontent.com/influxdata/telegraf/master/plugins/inputs/EXAMPLE_README.md

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do, though I will assert your standard is ugly and takes up too much screen space :-)


### Example Output

With `cpu_as_tag = false`:
```
interrupts,irq=0,type=IO-APIC,device=2-edge\ timer,cpu=cpu0 count=23i 1489346531000000000
interrupts,irq=1,type=IO-APIC,device=1-edge\ i8042,cpu=cpu0 count=9i 1489346531000000000
interrupts,irq=30,type=PCI-MSI,device=65537-edge\ virtio1-input.0,cpu=cpu1 count=1i 1489346531000000000
soft_interrupts,irq=NET_RX,cpu=cpu0 count=280879i 1489346531000000000
interrupts,irq=0,type=IO-APIC,device=2-edge\ timer cpu0=23i,cpu1=0i,total=23i 1489346531000000000
interrupts,irq=1,type=IO-APIC,device=1-edge\ i8042 cpu0=4i,cpu1=5i,total=9i 1489346531000000000
interrupts,irq=30,type=PCI-MSI,device=65537-edge\ virtio1-input.0 cpu0=2i,cpu1=40i,total=42i 1489346531000000000
soft_interrupts,irq=NET_RX cpu0=140412i,cpu1=140467,total=280879i 1489346531000000000
```

With `cpu_as_tag = true`:
Expand All @@ -81,3 +116,8 @@ interrupts,cpu=cpu7,irq=PIW,type=Posted-interrupt\ wakeup\ event count=0i 154353
soft_interrupts,cpu=cpu0,irq=HI count=246441i 1543539773000000000
soft_interrupts,cpu=cpu1,irq=HI count=159154i 1543539773000000000
```

With `spurious = true` add to the above:
```
spurious_interrupts,device=17-fasteoi\ ioc0,irq=17,type=IO-APIC count=27836i,total=327836i,unhandled=0i 1551582077000000000
```
89 changes: 79 additions & 10 deletions plugins/inputs/interrupts/interrupts.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ import (
"fmt"
"io"
"os"
"path/filepath"
"strconv"
"strings"

Expand All @@ -17,11 +18,13 @@ type Interrupts struct {
}

type IRQ struct {
ID string
Type string
Device string
Total int64
Cpus []int64
ID string
Type string
Device string
Total int64
Cpus []int64
SpuriousCount uint64
SpuriousUnhandled uint64
}

func NewIRQ(id string) *IRQ {
Expand Down Expand Up @@ -57,7 +60,7 @@ func parseInterrupts(r io.Reader) ([]IRQ, error) {
if scanner.Scan() {
cpus := strings.Fields(scanner.Text())
if cpus[0] != "CPU0" {
return nil, fmt.Errorf("Expected first line to start with CPU0, but was %s", scanner.Text())
return nil, fmt.Errorf("expected first line to start with CPU0, but was %s", scanner.Text())
}
cpucount = len(cpus)
}
Expand Down Expand Up @@ -90,14 +93,67 @@ scan:
} else if len(fields) > cpucount {
irq.Type = strings.Join(fields[cpucount+1:], " ")
}

// collect spurious interrupt data for this irq.ID
file := filepath.Join("/proc/irq", irq.ID, "spurious")
f, err := os.Open(file)
if err == nil {
irq.SpuriousCount, irq.SpuriousUnhandled = parseSpurious(f)
}
irqs = append(irqs, *irq)
}
if scanner.Err() != nil {
return nil, fmt.Errorf("Error scanning file: %s", scanner.Err())
return nil, fmt.Errorf("error scanning file: %s", scanner.Err())
}

// For some Linux systems there can be fixed, large number of CPUs reported in
// the `/proc/softirqs` file. This number could be much larger than the actual number of
// CPUs in the system. The fields for these phantom CPUs contain zeroes. The approach
// taken to remove these phantom CPUs is to remove the columns containing all zeros
// to the right (higher CPU numbers). For systems where CPUs are dynamically enabled,
// this can lead to CPUs not being reported until enabled. However, this is preferable
// to collecting metrics for tens or hundreds of phantom CPUs.

// First, determine the rightmost CPU column with non-zero data
validCpuIndex := 0
for _, irq := range irqs {
var i int
for i = len(irq.Cpus) - 1; i > validCpuIndex && irq.Cpus[i] == 0; i-- {
}
if i > validCpuIndex {
validCpuIndex = i
}
}
// Secondly, remove data for any CPUs above the validCpuIndex
validCpuCount := validCpuIndex + 1
for i := 0; i < len(irqs); i++ {
if len(irqs[i].Cpus) > validCpuCount {
irqs[i].Cpus = append(irqs[i].Cpus[:validCpuCount])
}
srebhan marked this conversation as resolved.
Show resolved Hide resolved
}

return irqs, nil
}

func parseSpurious(r io.Reader) (uint64, uint64) {
count := uint64(0)
unhandled := uint64(0)
scanner := bufio.NewScanner(r)
for scanner.Scan() {
s := strings.Fields(scanner.Text())
if len(s) < 2 {
continue
}
switch s[0] {
case "count":
count, _ = strconv.ParseUint(s[1], 10, 64)
case "unhandled":
unhandled, _ = strconv.ParseUint(s[1], 10, 64)
}
}
return count, unhandled
}

func gatherTagsFields(irq IRQ) (map[string]string, map[string]interface{}) {
tags := map[string]string{"irq": irq.ID, "type": irq.Type, "device": irq.Device}
fields := map[string]interface{}{"total": irq.Total}
Expand All @@ -112,16 +168,17 @@ func (s *Interrupts) Gather(acc telegraf.Accumulator) error {
for measurement, file := range map[string]string{"interrupts": "/proc/interrupts", "soft_interrupts": "/proc/softirqs"} {
f, err := os.Open(file)
if err != nil {
acc.AddError(fmt.Errorf("Could not open file: %s", file))
acc.AddError(fmt.Errorf("could not open file: %s", file))
continue
}
defer f.Close()
irqs, err := parseInterrupts(f)
_ = f.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous defer method is actually preferred, since it will be less likely to be broken in future code updates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with the previous method is that the static code checkers complain about the f.close() return code being ignored, which it is. Two choices:

  1. add proper func for defer f.Close() to explicitly ignore the return value
  2. add comment for static checkers to ignore the ignored return value error

I'm fine either way. Do you have a preference?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not our goal to keep static code checkers quiet, outside of the ones required by the integration tests (go vet in particular). So I'd rather just leave it as before, I don't find assigning to _ to be helpful in explaining the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll prefer to f.Close() in the loop. If defer is in a loop, we trigger possible leak detection because the defer's closure is the function, not the loop. In this particular location inside Gather, we loop twice, not a real problem. In parseInterrupts we open one file per irq, this can easily grow into the hundreds of open file descriptors, so it is better to close in the loop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, this is possibly an indication that we should extract a function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I'm not particularly happy with the overall design, lemme try a more detailed refactor, including some improved testing

if err != nil {
acc.AddError(fmt.Errorf("Parsing %s: %s", file, err))
acc.AddError(fmt.Errorf("parsing %s: %s", file, err))
continue
}
reportMetrics(measurement, irqs, acc, s.CpuAsTag)
reportSpuriousMetrics(irqs, acc)
}
return nil
}
Expand All @@ -143,6 +200,18 @@ func reportMetrics(measurement string, irqs []IRQ, acc telegraf.Accumulator, cpu
}
}

func reportSpuriousMetrics(irqs []IRQ, acc telegraf.Accumulator) {
for _, irq := range irqs {
tags, _ := gatherTagsFields(irq)
spuriousFields := map[string]interface{}{
"count": irq.SpuriousCount,
"unhandled": irq.SpuriousUnhandled,
"total": irq.Total,
}
acc.AddFields("spurious_interrupts", spuriousFields, tags)
}
}

func init() {
inputs.Add("interrupts", func() telegraf.Input {
return &Interrupts{}
Expand Down
23 changes: 22 additions & 1 deletion plugins/inputs/interrupts/interrupts_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ func expectCpuAsFields(m *testutil.Accumulator, t *testing.T, measurement string

func setup(t *testing.T, irqString string, cpuAsTags bool) (*testutil.Accumulator, []IRQ) {
f := bytes.NewBufferString(irqString)
irqs, err := parseInterrupts(f)
irqs, err := parseInterrupts(f, false)
require.Equal(t, nil, err)
require.NotEqual(t, 0, len(irqs))

Expand Down Expand Up @@ -154,3 +154,24 @@ func TestCpuAsFieldsHwIrqs(t *testing.T) {
expectCpuAsFields(acc, t, "interrupts", irq)
}
}

// =====================================================================================
// spurious interrupts
//
// Note: the spurious interrupt ID is gathered from /proc/interrupts as part of the
// hardware interrupts, so its test is not recreated here.
// =====================================================================================

const spuriousIrqsString = `
count 12345
unhandled 89
last_unhandled 6677
`

func TestSpuriousParser(t *testing.T) {
f := bytes.NewBufferString(spuriousIrqsString)
hasSpurious, count, unhandled := parseSpurious(f)
require.True(t, hasSpurious, "spurious data found")
require.Equal(t, uint64(12345), count, "incorrect parsed count")
require.Equal(t, uint64(89), unhandled, "incorrect parsed unhandled")
}