-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Heartbeat] Defer monitor / ICMP errors to monitor runtime / ES #29413
Conversation
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
💚 Flaky test reportTests succeeded. 🤖 GitHub commentsTo re-run your PR in the CI, just comment with:
|
@@ -91,10 +91,6 @@ func (jf *jobFactory) checkConfig() error { | |||
} | |||
|
|||
func (jf *jobFactory) makePlugin() (plugin2 plugin.Plugin, err error) { | |||
if err := jf.loop.checkNetworkMode(jf.ipVersion); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now checked in stdloop.go, we don't want to check at plugin initialization, just at exec time
@@ -159,29 +158,6 @@ func newICMPLoop() (*stdICMPLoop, error) { | |||
return l, nil | |||
} | |||
|
|||
func (l *stdICMPLoop) checkNetworkMode(mode string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to main loop icmp recv so we don't interfere with heartbeat initialization.
@@ -379,7 +363,7 @@ func (l *stdICMPLoop) sendEchoRequest(addr *net.IPAddr) (*requestContext, error) | |||
|
|||
_, err := conn.WriteTo(encoded, addr) | |||
if err != nil { | |||
return nil, err | |||
return nil, fmt.Errorf("could not write to conn: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more specific error context is always nice
Pinging @elastic/uptime (Team:Uptime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've started doing the E2E testing and so far one of the things that I couldn't get to is to actually see the errors in the lines below, which trigger specific IPv6 and IPv4 error messages depending on whether the stdICMPLoop
has ipv4 or ipv6 capabilities.
isIPv6 := addr.IP.To4() == nil
if isIPv6 && l.conn6 == nil {
return -1, -1, fmt.Errorf("cannot ping IPv6 address '%s', no IPv6 connection available", addr)
}
if !isIPv6 && l.conn4 == nil {
return -1, -1, fmt.Errorf("cannot ping IPv4 address '%s', no IPv4 connection available", addr)
}
For that, I tried, for example, disabling my connection's ipv6 support with networksetup -setv6off Wi-Fi
(note that I'm only using Wi-Fi on a Mac).
For some reason it seems that the the createListener
function never assigns to err
, but does assign to conn
, and thus we never get a false ipv6 support. To confirm that, I've logged the value of err
when we create the listener:
func createListener(name, network string) *icmp.PacketConn {
conn, err := icmp.ListenPacket(network, "")
fmt.Println("CONN", conn)
fmt.Println("ERR", err)
// XXX: need to check for conn == nil, as 'err != nil' seems always to be
// true, even if error value itself is `nil`. Checking for conn suppresses
// misleading log message.
if conn == nil && err != nil {
return nil
}
return conn
}
I'll do some more digging into this to see if I'm doing anything wrong in my tests or whether I could offer a suggestion on how to fix this. For now, I've been stuck on this for a bit.
Btw, for further testers, it's worth highlighting you will need adequate permissions so that createListener
can work. In my case, I just quickly ran heartbeat
with sudo and made sure that root
owned my config file.
On another note, I've also seen we have a mock function which doesn't seem to be necessary anymore as we've removed checkNetworkMode
:
func (t mockLoop) checkNetworkMode(mode string) error {
return t.checkNetworkModeErr
}
Let me do some more testing and get back to you as soon as I can with more specific feedback on what could be going wrong in these tests.
// Note, needed to hoist err to this scope, not just to add a prefix | ||
fullErr := fmt.Errorf("job could not be initialized: %s", err) | ||
// A placeholder job that always returns an error | ||
p.Jobs = []jobs.Job{func(event *beat.Event) ([]jobs.Job, error) { | ||
return nil, fullErr | ||
}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! Thanks for the helpful comment on the top too. This monitor initialisation part is really elegant, but having the comment on the top does help a lot given the further wrapping and how the jobs then get queued.
@@ -0,0 +1 @@ | |||
.synthetics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've spent some more significant time on this today with testing and familiarising myself with the codebase and the code itself LGTM, but please notice that I couldn't reproduce the exact problem we've seen on the SDH where there were no ipv6 capabilities enabled at all in the user's system. On my Mac, it seems like I can't fully disable IPv6 in a way which causes Go to throw an error when creating the listener here. Despite my attempts at fully disabling IPv6 capabilities on my Mac, Go will succeed in creating a conn
, and thus we'll not see the errors "no IPvX connection available".
Looking through the net
code itself, I also couldn't find a meaningful way to trigger these errors without messing with the code. Perhaps more experienced Go developers/systems programmers could help me figure out what's needed to make Go fail in creating a conn
. That would be much appreciated.
I have also tried to run this code with
dlv
and see whether any of the other members ofPacketConn
(the type ofconn
) could be used to determine whether we have IPvX capabilities ahead of time, but I couldn't see anything useful there as well. Reference: https://pkg.go.dev/golang.org/x/net/icmp#PacketConn.
I also confirmed that I could still ping my own IPv6 loopback interface, and external IPv4.
Besides that, my only other comment is non-blocking, and related to tests.
heartbeat/tests/system/test_icmp.py
Outdated
if has_failed_message(): | ||
proc.check_kill_and_wait(1) | ||
assert monitor_status == "down" | ||
self.assertRegex(monitor_error, ".*Insufficient privileges to perform ICMP ping.*") | ||
else: | ||
# Check that documents are moving through | ||
self.wait_until(lambda: self.output_has(lines=1)) | ||
assert monitor_status == "up" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen we have the following comment in the code to explain why we have these different assertions, which makes this test really test different things depending on the underlying environment:
We don't know if the system tests are running is configured to support or not support ping, but we can at least check that the ICMP loop was initiated. In the future we should start up VMs with the correct perms configured and be more specific. In addition to that we should run pings on those machines and make sure they work.
However, I wonder whether we should just add a further verification above to check whether we can perform ICMP pings and then we can use that to decide which logs and statuses we'll look for.
Right now we use Heartbeat's logs to assert on its behaviour, while the suggestion above would help us use Python to look into the underlying platform and then assert on Heartbeat's behaviour. In other words, even though it would be a conditional test, its conditions would not depend only on Heartbeat.
This is not a blocking comment btw, and you can consider this to be outside the scope of this PR if you judge it to be unnecessary for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it's a good point for a follow-up PR perhaps.
The only to determine if you can do a ping is to actually do one sadly or maybe to use getcap
.
The tricky thing from a CI perspective is you need to see if you have permission, somehow, to run sudo setcap cap_net_raw+eip
on the heartbeat binary. I think we'd need to execute this from python as well.
We probably need this logic regardless because on most people's laptops the no pings behavior will be the default.
I'm not averse to us reaching out to robots to fix this, but I don't think it's a priority given the ultra-low churn WRT the ICMP code.
This PR generally improves the error behavior of all monitors, and some specific ICMP related errors as well. These two items are combined in one PR because the general theme here is improving the ICMP error experience, and improving ICMP required improving all monitors. Fixes #29346 and incremental progress toward #29692 General monitor improvements Generally speaking, per #29692 we are trying to send monitor output to ES wherever possible. With this PR we now send any monitor initialization errors (such as a lack of ICMP kernel capabilities) during monitor creation to ES. We do this by allowing the monitor to initialize and run on schedule, even though we know it will always send the same error message. This lets users more easily debug issues in Kibana. ICMP Specific Improvement This PR also Removes broken a IP capability check that caused heartbeat to be unable to start. We now just rely on return codes from attempts to actually send packets. This is the more specific fix for #29346 . I was not able to exactly reproduce the exact customer reported issue, where the user somehow disabled ipv6 in a way that the ICMP loop that I can't exactly reproduce. I tried disabling ipv6 fully with sudo sysctl net.ipv6.conf.all.disable_ipv6=1 but that didn't yield the error in #29346 The logic is now simplified, there's no truly reliable way to know if you can send an ipv6 (or ipv4) ping before you send it (settings can change at any time! network cards can disappear!), so we just let the error codes happen as the check is executed. This is also generally a better UX in that the errors will now be visible in the Uptime app, not just the logs. It should be noted that the ipv4 and ipv6 boolean options only are documented to affect how DNS lookups happen. With this change the behavior matches the docs. Note that ICMP is a bit weird in that there's a single ICMP loop in heartbeat, and all monitors are really just interacting with that. Removal of .synthetics This also ignores the .synthetics folder which has been inconvenient for some time for devs, in that it dirties the git path (cherry picked from commit 616db13)
This PR generally improves the error behavior of all monitors, and some specific ICMP related errors as well. These two items are combined in one PR because the general theme here is improving the ICMP error experience, and improving ICMP required improving all monitors. Fixes #29346 and incremental progress toward #29692 General monitor improvements Generally speaking, per #29692 we are trying to send monitor output to ES wherever possible. With this PR we now send any monitor initialization errors (such as a lack of ICMP kernel capabilities) during monitor creation to ES. We do this by allowing the monitor to initialize and run on schedule, even though we know it will always send the same error message. This lets users more easily debug issues in Kibana. ICMP Specific Improvement This PR also Removes broken a IP capability check that caused heartbeat to be unable to start. We now just rely on return codes from attempts to actually send packets. This is the more specific fix for #29346 . I was not able to exactly reproduce the exact customer reported issue, where the user somehow disabled ipv6 in a way that the ICMP loop that I can't exactly reproduce. I tried disabling ipv6 fully with sudo sysctl net.ipv6.conf.all.disable_ipv6=1 but that didn't yield the error in #29346 The logic is now simplified, there's no truly reliable way to know if you can send an ipv6 (or ipv4) ping before you send it (settings can change at any time! network cards can disappear!), so we just let the error codes happen as the check is executed. This is also generally a better UX in that the errors will now be visible in the Uptime app, not just the logs. It should be noted that the ipv4 and ipv6 boolean options only are documented to affect how DNS lookups happen. With this change the behavior matches the docs. Note that ICMP is a bit weird in that there's a single ICMP loop in heartbeat, and all monitors are really just interacting with that. Removal of .synthetics This also ignores the .synthetics folder which has been inconvenient for some time for devs, in that it dirties the git path (cherry picked from commit 616db13)
@Mergifyio backport 7.17 |
This PR generally improves the error behavior of all monitors, and some specific ICMP related errors as well. These two items are combined in one PR because the general theme here is improving the ICMP error experience, and improving ICMP required improving all monitors. Fixes #29346 and incremental progress toward #29692 General monitor improvements Generally speaking, per #29692 we are trying to send monitor output to ES wherever possible. With this PR we now send any monitor initialization errors (such as a lack of ICMP kernel capabilities) during monitor creation to ES. We do this by allowing the monitor to initialize and run on schedule, even though we know it will always send the same error message. This lets users more easily debug issues in Kibana. ICMP Specific Improvement This PR also Removes broken a IP capability check that caused heartbeat to be unable to start. We now just rely on return codes from attempts to actually send packets. This is the more specific fix for #29346 . I was not able to exactly reproduce the exact customer reported issue, where the user somehow disabled ipv6 in a way that the ICMP loop that I can't exactly reproduce. I tried disabling ipv6 fully with sudo sysctl net.ipv6.conf.all.disable_ipv6=1 but that didn't yield the error in #29346 The logic is now simplified, there's no truly reliable way to know if you can send an ipv6 (or ipv4) ping before you send it (settings can change at any time! network cards can disappear!), so we just let the error codes happen as the check is executed. This is also generally a better UX in that the errors will now be visible in the Uptime app, not just the logs. It should be noted that the ipv4 and ipv6 boolean options only are documented to affect how DNS lookups happen. With this change the behavior matches the docs. Note that ICMP is a bit weird in that there's a single ICMP loop in heartbeat, and all monitors are really just interacting with that. Removal of .synthetics This also ignores the .synthetics folder which has been inconvenient for some time for devs, in that it dirties the git path (cherry picked from commit 616db13)
✅ Backports have been created
|
… (backport #29413) (#29896) * [Heartbeat] Defer monitor / ICMP errors to monitor runtime / ES (#29413) This PR generally improves the error behavior of all monitors, and some specific ICMP related errors as well. These two items are combined in one PR because the general theme here is improving the ICMP error experience, and improving ICMP required improving all monitors. Fixes #29346 and incremental progress toward #29692 General monitor improvements Generally speaking, per #29692 we are trying to send monitor output to ES wherever possible. With this PR we now send any monitor initialization errors (such as a lack of ICMP kernel capabilities) during monitor creation to ES. We do this by allowing the monitor to initialize and run on schedule, even though we know it will always send the same error message. This lets users more easily debug issues in Kibana. ICMP Specific Improvement This PR also Removes broken a IP capability check that caused heartbeat to be unable to start. We now just rely on return codes from attempts to actually send packets. This is the more specific fix for #29346 . I was not able to exactly reproduce the exact customer reported issue, where the user somehow disabled ipv6 in a way that the ICMP loop that I can't exactly reproduce. I tried disabling ipv6 fully with sudo sysctl net.ipv6.conf.all.disable_ipv6=1 but that didn't yield the error in #29346 The logic is now simplified, there's no truly reliable way to know if you can send an ipv6 (or ipv4) ping before you send it (settings can change at any time! network cards can disappear!), so we just let the error codes happen as the check is executed. This is also generally a better UX in that the errors will now be visible in the Uptime app, not just the logs. It should be noted that the ipv4 and ipv6 boolean options only are documented to affect how DNS lookups happen. With this change the behavior matches the docs. Note that ICMP is a bit weird in that there's a single ICMP loop in heartbeat, and all monitors are really just interacting with that. Removal of .synthetics This also ignores the .synthetics folder which has been inconvenient for some time for devs, in that it dirties the git path (cherry picked from commit 616db13) * [Heartbeat] Fix broken macOS ICMP test (#29900) Fixes broken macos python e2e test Co-authored-by: Andrew Cholakian <[email protected]> Co-authored-by: Justin Kambic <[email protected]>
…itor runtime / ES (#29892) * [Heartbeat] Defer monitor / ICMP errors to monitor runtime / ES (#29413) This PR generally improves the error behavior of all monitors, and some specific ICMP related errors as well. These two items are combined in one PR because the general theme here is improving the ICMP error experience, and improving ICMP required improving all monitors. Fixes #29346 and incremental progress toward #29692 General monitor improvements Generally speaking, per #29692 we are trying to send monitor output to ES wherever possible. With this PR we now send any monitor initialization errors (such as a lack of ICMP kernel capabilities) during monitor creation to ES. We do this by allowing the monitor to initialize and run on schedule, even though we know it will always send the same error message. This lets users more easily debug issues in Kibana. ICMP Specific Improvement This PR also Removes broken a IP capability check that caused heartbeat to be unable to start. We now just rely on return codes from attempts to actually send packets. This is the more specific fix for #29346 . I was not able to exactly reproduce the exact customer reported issue, where the user somehow disabled ipv6 in a way that the ICMP loop that I can't exactly reproduce. I tried disabling ipv6 fully with sudo sysctl net.ipv6.conf.all.disable_ipv6=1 but that didn't yield the error in #29346 The logic is now simplified, there's no truly reliable way to know if you can send an ipv6 (or ipv4) ping before you send it (settings can change at any time! network cards can disappear!), so we just let the error codes happen as the check is executed. This is also generally a better UX in that the errors will now be visible in the Uptime app, not just the logs. It should be noted that the ipv4 and ipv6 boolean options only are documented to affect how DNS lookups happen. With this change the behavior matches the docs. Note that ICMP is a bit weird in that there's a single ICMP loop in heartbeat, and all monitors are really just interacting with that. Removal of .synthetics This also ignores the .synthetics folder which has been inconvenient for some time for devs, in that it dirties the git path (cherry picked from commit 616db13) * [Heartbeat] Fix broken macOS ICMP test (#29900) Fixes broken macos python e2e test Co-authored-by: Andrew Cholakian <[email protected]> Co-authored-by: Justin Kambic <[email protected]>
* upstream/7.17: (30 commits) [7.17](backport #29966) Add the Elastic product origin header when talking to Elasticsearch or Kibana. (#30000) [Heartbeat] Change size of data on ICMP packet (#29948) (#29978) Add clarification about enableing dashboard loading (#29985) (#29989) Improve aws-s3 gzip file detection to avoid false negatives (#29969) (#29974) ci: docker login step for pulling then pushing (#29960) (#29963) x-pack/auditbeat/module/system/socket: get full length path and arg from /proc when not available from kprobe (#29410) (#29958) [Automation] Update elastic stack version to 7.17.0-ab4975a2 for testing (#29956) [Automation] Update elastic stack version to 7.17.0-1bd58b32 for testing (#29938) [7.17](backport #29913) [Metricbeat] gcp.gke: fix overview dashboard (#29914) [7.17](backport #29605) Fix annotation enrichment (#29834) [Automation] Update elastic stack version to 7.17.0-e1efbe3a for testing (#29922) [Automation] Update elastic stack version to 7.17.0-68da5d12 for testing (#29904) [7.17][Heartbeat] Defer monitor / ICMP errors to monitor runtime / ES (backport #29413) (#29896) Merge pull request from GHSA-rj4h-hqvq-cc6q [7.17](backport #29681) Change docker image from CentOS 7 to Ubuntu 20.04 (#29817) Fix YAML indentation in `parsers` examples (#29663) (#29894) [Automation] Update elastic stack version to 7.17.0-079761a0 for testing (#29864) Fix Filebeat dissect processor field tokenization in documentation (#29680) (#29883) Enable require_alias for Bulk requests for all actions when target is a write alias (#29879) Update Index template loading guide to use the correct endpoint (#29869) (#29877) ...
…tic#29413) This PR generally improves the error behavior of all monitors, and some specific ICMP related errors as well. These two items are combined in one PR because the general theme here is improving the ICMP error experience, and improving ICMP required improving all monitors. Fixes elastic#29346 and incremental progress toward elastic#29692 General monitor improvements Generally speaking, per elastic#29692 we are trying to send monitor output to ES wherever possible. With this PR we now send any monitor initialization errors (such as a lack of ICMP kernel capabilities) during monitor creation to ES. We do this by allowing the monitor to initialize and run on schedule, even though we know it will always send the same error message. This lets users more easily debug issues in Kibana. ICMP Specific Improvement This PR also Removes broken a IP capability check that caused heartbeat to be unable to start. We now just rely on return codes from attempts to actually send packets. This is the more specific fix for elastic#29346 . I was not able to exactly reproduce the exact customer reported issue, where the user somehow disabled ipv6 in a way that the ICMP loop that I can't exactly reproduce. I tried disabling ipv6 fully with sudo sysctl net.ipv6.conf.all.disable_ipv6=1 but that didn't yield the error in elastic#29346 The logic is now simplified, there's no truly reliable way to know if you can send an ipv6 (or ipv4) ping before you send it (settings can change at any time! network cards can disappear!), so we just let the error codes happen as the check is executed. This is also generally a better UX in that the errors will now be visible in the Uptime app, not just the logs. It should be noted that the ipv4 and ipv6 boolean options only are documented to affect how DNS lookups happen. With this change the behavior matches the docs. Note that ICMP is a bit weird in that there's a single ICMP loop in heartbeat, and all monitors are really just interacting with that. Removal of .synthetics This also ignores the .synthetics folder which has been inconvenient for some time for devs, in that it dirties the git path
* upstream/7.17: (30 commits) [7.17](backport elastic#29966) Add the Elastic product origin header when talking to Elasticsearch or Kibana. (elastic#30000) [Heartbeat] Change size of data on ICMP packet (elastic#29948) (elastic#29978) Add clarification about enableing dashboard loading (elastic#29985) (elastic#29989) Improve aws-s3 gzip file detection to avoid false negatives (elastic#29969) (elastic#29974) ci: docker login step for pulling then pushing (elastic#29960) (elastic#29963) x-pack/auditbeat/module/system/socket: get full length path and arg from /proc when not available from kprobe (elastic#29410) (elastic#29958) [Automation] Update elastic stack version to 7.17.0-ab4975a2 for testing (elastic#29956) [Automation] Update elastic stack version to 7.17.0-1bd58b32 for testing (elastic#29938) [7.17](backport elastic#29913) [Metricbeat] gcp.gke: fix overview dashboard (elastic#29914) [7.17](backport elastic#29605) Fix annotation enrichment (elastic#29834) [Automation] Update elastic stack version to 7.17.0-e1efbe3a for testing (elastic#29922) [Automation] Update elastic stack version to 7.17.0-68da5d12 for testing (elastic#29904) [7.17][Heartbeat] Defer monitor / ICMP errors to monitor runtime / ES (backport elastic#29413) (elastic#29896) Merge pull request from GHSA-rj4h-hqvq-cc6q [7.17](backport elastic#29681) Change docker image from CentOS 7 to Ubuntu 20.04 (elastic#29817) Fix YAML indentation in `parsers` examples (elastic#29663) (elastic#29894) [Automation] Update elastic stack version to 7.17.0-079761a0 for testing (elastic#29864) Fix Filebeat dissect processor field tokenization in documentation (elastic#29680) (elastic#29883) Enable require_alias for Bulk requests for all actions when target is a write alias (elastic#29879) Update Index template loading guide to use the correct endpoint (elastic#29869) (elastic#29877) ...
What does this PR do?
This PR generally improves the error behavior of all monitors, and some specific ICMP related errors as well. These two items are combined in one PR because the general theme here is improving the ICMP error experience, and improving ICMP required improving all monitors.
Fixes #29346
and incremental progress toward #29692
General monitor improvements
Generally speaking, per #29692 we are trying to send monitor output to ES wherever possible. With this PR we now send any monitor initialization errors (such as a lack of ICMP kernel capabilities) during monitor creation to ES. We do this by allowing the monitor to initialize and run on schedule, even though we know it will always send the same error message. This lets users more easily debug issues in Kibana.
ICMP Specific Improvement
This PR also Removes broken a IP capability check that caused heartbeat to be unable to start. We now just rely on return codes from attempts to actually send packets. This is the more specific fix for #29346 . I was not able to exactly reproduce the exact customer reported issue, where the user somehow disabled ipv6 in a way that the ICMP loop that I can't exactly reproduce. I tried disabling ipv6 fully with
sudo sysctl net.ipv6.conf.all.disable_ipv6=1
but that didn't yield the error in #29346The logic is now simplified, there's no truly reliable way to know if you can send an ipv6 (or ipv4) ping before you send it (settings can change at any time! network cards can disappear!), so we just let the error codes happen as the check is executed. This is also generally a better UX in that the errors will now be visible in the Uptime app, not just the logs.
It should be noted that the
ipv4
andipv6
boolean options only are documented to affect how DNS lookups happen. With this change the behavior matches the docs.Note that ICMP is a bit weird in that there's a single ICMP loop in heartbeat, and all monitors are really just interacting with that.
Removal of .synthetics
This also ignores the
.synthetics
folder which has been inconvenient for some time for devs, in that it dirties the git pathChecklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.How to test this PR locally
This PR MUST be tested locally given that IPv6 checks don't really happen right on CI. It's also really hard to replicate broken environments since there are different ways IPv6 can fail, and I wasn't even able to replicate the original customer complaint behind #29346
We will need to rely on code review more than testing here.
Some useful monitor configs for testing this are: