Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Heartbeat] Defer monitor / ICMP errors to monitor runtime / ES #29413

Merged
merged 12 commits into from
Jan 18, 2022

Conversation

andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Dec 14, 2021

What does this PR do?

This PR generally improves the error behavior of all monitors, and some specific ICMP related errors as well. These two items are combined in one PR because the general theme here is improving the ICMP error experience, and improving ICMP required improving all monitors.

Fixes #29346
and incremental progress toward #29692

General monitor improvements

Generally speaking, per #29692 we are trying to send monitor output to ES wherever possible. With this PR we now send any monitor initialization errors (such as a lack of ICMP kernel capabilities) during monitor creation to ES. We do this by allowing the monitor to initialize and run on schedule, even though we know it will always send the same error message. This lets users more easily debug issues in Kibana.

ICMP Specific Improvement

This PR also Removes broken a IP capability check that caused heartbeat to be unable to start. We now just rely on return codes from attempts to actually send packets. This is the more specific fix for #29346 . I was not able to exactly reproduce the exact customer reported issue, where the user somehow disabled ipv6 in a way that the ICMP loop that I can't exactly reproduce. I tried disabling ipv6 fully with sudo sysctl net.ipv6.conf.all.disable_ipv6=1 but that didn't yield the error in #29346

The logic is now simplified, there's no truly reliable way to know if you can send an ipv6 (or ipv4) ping before you send it (settings can change at any time! network cards can disappear!), so we just let the error codes happen as the check is executed. This is also generally a better UX in that the errors will now be visible in the Uptime app, not just the logs.

It should be noted that the ipv4 and ipv6 boolean options only are documented to affect how DNS lookups happen. With this change the behavior matches the docs.

Note that ICMP is a bit weird in that there's a single ICMP loop in heartbeat, and all monitors are really just interacting with that.

Removal of .synthetics

This also ignores the .synthetics folder which has been inconvenient for some time for devs, in that it dirties the git path

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

This PR MUST be tested locally given that IPv6 checks don't really happen right on CI. It's also really hard to replicate broken environments since there are different ways IPv6 can fail, and I wasn't even able to replicate the original customer complaint behind #29346

We will need to rely on code review more than testing here.

Some useful monitor configs for testing this are:

heartbeat.monitors:
- type: icmp
  enabled: true
  id: public-ipv6
  hosts: "2600::"
  schedule: '@every 10s'
- type: icmp
  enabled: true
  id: public-ipv4
  hosts: "google.com"
  schedule: '@every 10s'
- type: icmp
  enabled: true
  id: public-ipv4-and-ipv6
  hosts: "netflix.com"
  mode: all # netflix has A and AAAA records
  schedule: '@every 10s'
# Monitor this with tail -f /tmp/hbout/* | jq .monitor.ip,.error
output.file.path: /tmp/hbout

@andrewvc andrewvc added bug v8.0.0 backport-v8.0.0 Automated backport with mergify backport-v7.16.0 Automated backport with mergify labels Dec 14, 2021
@andrewvc andrewvc self-assigned this Dec 14, 2021
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Dec 14, 2021
@elasticmachine
Copy link
Collaborator

elasticmachine commented Dec 14, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-01-18T13:25:38.177+0000

  • Duration: 67 min 17 sec

  • Commit: e9db6b3

Test stats 🧪

Test Results
Failed 0
Passed 3295
Skipped 71
Total 3366

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@andrewvc andrewvc changed the title Report IP version errors at execution time, not startup [Heartbeat] Defer monitor / ICMP errors to monitor runtime / ES Jan 13, 2022
@andrewvc andrewvc added Heartbeat Team:obs-ds-hosted-services Label for the Observability Hosted Services team labels Jan 13, 2022
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jan 13, 2022
@@ -91,10 +91,6 @@ func (jf *jobFactory) checkConfig() error {
}

func (jf *jobFactory) makePlugin() (plugin2 plugin.Plugin, err error) {
if err := jf.loop.checkNetworkMode(jf.ipVersion); err != nil {
Copy link
Contributor Author

@andrewvc andrewvc Jan 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now checked in stdloop.go, we don't want to check at plugin initialization, just at exec time

@@ -159,29 +158,6 @@ func newICMPLoop() (*stdICMPLoop, error) {
return l, nil
}

func (l *stdICMPLoop) checkNetworkMode(mode string) error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to main loop icmp recv so we don't interfere with heartbeat initialization.

@@ -379,7 +363,7 @@ func (l *stdICMPLoop) sendEchoRequest(addr *net.IPAddr) (*requestContext, error)

_, err := conn.WriteTo(encoded, addr)
if err != nil {
return nil, err
return nil, fmt.Errorf("could not write to conn: %w", err)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more specific error context is always nice

@andrewvc andrewvc marked this pull request as ready for review January 14, 2022 22:14
@andrewvc andrewvc requested a review from a team as a code owner January 14, 2022 22:14
@elasticmachine
Copy link
Collaborator

Pinging @elastic/uptime (Team:Uptime)

Copy link
Contributor

@lucasfcosta lucasfcosta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've started doing the E2E testing and so far one of the things that I couldn't get to is to actually see the errors in the lines below, which trigger specific IPv6 and IPv4 error messages depending on whether the stdICMPLoop has ipv4 or ipv6 capabilities.

isIPv6 := addr.IP.To4() == nil
if isIPv6 && l.conn6 == nil {
  return -1, -1, fmt.Errorf("cannot ping IPv6 address '%s', no IPv6 connection available", addr)
}
if !isIPv6 && l.conn4 == nil {
  return -1, -1, fmt.Errorf("cannot ping IPv4 address '%s', no IPv4 connection available", addr)
}

For that, I tried, for example, disabling my connection's ipv6 support with networksetup -setv6off Wi-Fi (note that I'm only using Wi-Fi on a Mac).

For some reason it seems that the the createListener function never assigns to err, but does assign to conn, and thus we never get a false ipv6 support. To confirm that, I've logged the value of err when we create the listener:

func createListener(name, network string) *icmp.PacketConn {
	conn, err := icmp.ListenPacket(network, "")

	fmt.Println("CONN", conn)
	fmt.Println("ERR", err)

	// XXX: need to check for conn == nil, as 'err != nil' seems always to be
	//      true, even if error value itself is `nil`. Checking for conn suppresses
	//      misleading log message.
	if conn == nil && err != nil {
		return nil
	}
	return conn
}

I'll do some more digging into this to see if I'm doing anything wrong in my tests or whether I could offer a suggestion on how to fix this. For now, I've been stuck on this for a bit.

Btw, for further testers, it's worth highlighting you will need adequate permissions so that createListener can work. In my case, I just quickly ran heartbeat with sudo and made sure that root owned my config file.

On another note, I've also seen we have a mock function which doesn't seem to be necessary anymore as we've removed checkNetworkMode:

func (t mockLoop) checkNetworkMode(mode string) error {
	return t.checkNetworkModeErr
}

Let me do some more testing and get back to you as soon as I can with more specific feedback on what could be going wrong in these tests.

Comment on lines +175 to +180
// Note, needed to hoist err to this scope, not just to add a prefix
fullErr := fmt.Errorf("job could not be initialized: %s", err)
// A placeholder job that always returns an error
p.Jobs = []jobs.Job{func(event *beat.Event) ([]jobs.Job, error) {
return nil, fullErr
}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Thanks for the helpful comment on the top too. This monitor initialisation part is really elegant, but having the comment on the top does help a lot given the further wrapping and how the jobs then get queued.

@@ -0,0 +1 @@
.synthetics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙏

Copy link
Contributor

@lucasfcosta lucasfcosta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've spent some more significant time on this today with testing and familiarising myself with the codebase and the code itself LGTM, but please notice that I couldn't reproduce the exact problem we've seen on the SDH where there were no ipv6 capabilities enabled at all in the user's system. On my Mac, it seems like I can't fully disable IPv6 in a way which causes Go to throw an error when creating the listener here. Despite my attempts at fully disabling IPv6 capabilities on my Mac, Go will succeed in creating a conn, and thus we'll not see the errors "no IPvX connection available".

Looking through the net code itself, I also couldn't find a meaningful way to trigger these errors without messing with the code. Perhaps more experienced Go developers/systems programmers could help me figure out what's needed to make Go fail in creating a conn. That would be much appreciated.

I have also tried to run this code with dlv and see whether any of the other members of PacketConn (the type of conn) could be used to determine whether we have IPvX capabilities ahead of time, but I couldn't see anything useful there as well. Reference: https://pkg.go.dev/golang.org/x/net/icmp#PacketConn.

I also confirmed that I could still ping my own IPv6 loopback interface, and external IPv4.

Besides that, my only other comment is non-blocking, and related to tests.

Comment on lines 52 to 56
if has_failed_message():
proc.check_kill_and_wait(1)
assert monitor_status == "down"
self.assertRegex(monitor_error, ".*Insufficient privileges to perform ICMP ping.*")
else:
# Check that documents are moving through
self.wait_until(lambda: self.output_has(lines=1))
assert monitor_status == "up"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen we have the following comment in the code to explain why we have these different assertions, which makes this test really test different things depending on the underlying environment:

We don't know if the system tests are running is configured to support or not support ping, but we can at least check that the ICMP loop was initiated. In the future we should start up VMs with the correct perms configured and be more specific. In addition to that we should run pings on those machines and make sure they work.

However, I wonder whether we should just add a further verification above to check whether we can perform ICMP pings and then we can use that to decide which logs and statuses we'll look for.

Right now we use Heartbeat's logs to assert on its behaviour, while the suggestion above would help us use Python to look into the underlying platform and then assert on Heartbeat's behaviour. In other words, even though it would be a conditional test, its conditions would not depend only on Heartbeat.

This is not a blocking comment btw, and you can consider this to be outside the scope of this PR if you judge it to be unnecessary for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it's a good point for a follow-up PR perhaps.

The only to determine if you can do a ping is to actually do one sadly or maybe to use getcap.

The tricky thing from a CI perspective is you need to see if you have permission, somehow, to run sudo setcap cap_net_raw+eip on the heartbeat binary. I think we'd need to execute this from python as well.

We probably need this logic regardless because on most people's laptops the no pings behavior will be the default.

I'm not averse to us reaching out to robots to fix this, but I don't think it's a priority given the ultra-low churn WRT the ICMP code.

@andrewvc andrewvc merged commit 616db13 into elastic:master Jan 18, 2022
@andrewvc andrewvc deleted the fix-icmp branch January 18, 2022 15:09
mergify bot pushed a commit that referenced this pull request Jan 18, 2022
This PR generally improves the error behavior of all monitors, and some specific ICMP related errors as well. These two items are combined in one PR because the general theme here is improving the ICMP error experience, and improving ICMP required improving all monitors.

Fixes #29346
and incremental progress toward #29692

General monitor improvements
Generally speaking, per #29692 we are trying to send monitor output to ES wherever possible. With this PR we now send any monitor initialization errors (such as a lack of ICMP kernel capabilities) during monitor creation to ES. We do this by allowing the monitor to initialize and run on schedule, even though we know it will always send the same error message. This lets users more easily debug issues in Kibana.

ICMP Specific Improvement
This PR also Removes broken a IP capability check that caused heartbeat to be unable to start. We now just rely on return codes from attempts to actually send packets. This is the more specific fix for #29346 . I was not able to exactly reproduce the exact customer reported issue, where the user somehow disabled ipv6 in a way that the ICMP loop that I can't exactly reproduce. I tried disabling ipv6 fully with sudo sysctl net.ipv6.conf.all.disable_ipv6=1 but that didn't yield the error in #29346

The logic is now simplified, there's no truly reliable way to know if you can send an ipv6 (or ipv4) ping before you send it (settings can change at any time! network cards can disappear!), so we just let the error codes happen as the check is executed. This is also generally a better UX in that the errors will now be visible in the Uptime app, not just the logs.

It should be noted that the ipv4 and ipv6 boolean options only are documented to affect how DNS lookups happen. With this change the behavior matches the docs.

Note that ICMP is a bit weird in that there's a single ICMP loop in heartbeat, and all monitors are really just interacting with that.

Removal of .synthetics
This also ignores the .synthetics folder which has been inconvenient for some time for devs, in that it dirties the git path

(cherry picked from commit 616db13)
mergify bot pushed a commit that referenced this pull request Jan 18, 2022
This PR generally improves the error behavior of all monitors, and some specific ICMP related errors as well. These two items are combined in one PR because the general theme here is improving the ICMP error experience, and improving ICMP required improving all monitors.

Fixes #29346
and incremental progress toward #29692

General monitor improvements
Generally speaking, per #29692 we are trying to send monitor output to ES wherever possible. With this PR we now send any monitor initialization errors (such as a lack of ICMP kernel capabilities) during monitor creation to ES. We do this by allowing the monitor to initialize and run on schedule, even though we know it will always send the same error message. This lets users more easily debug issues in Kibana.

ICMP Specific Improvement
This PR also Removes broken a IP capability check that caused heartbeat to be unable to start. We now just rely on return codes from attempts to actually send packets. This is the more specific fix for #29346 . I was not able to exactly reproduce the exact customer reported issue, where the user somehow disabled ipv6 in a way that the ICMP loop that I can't exactly reproduce. I tried disabling ipv6 fully with sudo sysctl net.ipv6.conf.all.disable_ipv6=1 but that didn't yield the error in #29346

The logic is now simplified, there's no truly reliable way to know if you can send an ipv6 (or ipv4) ping before you send it (settings can change at any time! network cards can disappear!), so we just let the error codes happen as the check is executed. This is also generally a better UX in that the errors will now be visible in the Uptime app, not just the logs.

It should be noted that the ipv4 and ipv6 boolean options only are documented to affect how DNS lookups happen. With this change the behavior matches the docs.

Note that ICMP is a bit weird in that there's a single ICMP loop in heartbeat, and all monitors are really just interacting with that.

Removal of .synthetics
This also ignores the .synthetics folder which has been inconvenient for some time for devs, in that it dirties the git path

(cherry picked from commit 616db13)
@andrewvc
Copy link
Contributor Author

@Mergifyio backport 7.17

mergify bot pushed a commit that referenced this pull request Jan 18, 2022
This PR generally improves the error behavior of all monitors, and some specific ICMP related errors as well. These two items are combined in one PR because the general theme here is improving the ICMP error experience, and improving ICMP required improving all monitors.

Fixes #29346
and incremental progress toward #29692

General monitor improvements
Generally speaking, per #29692 we are trying to send monitor output to ES wherever possible. With this PR we now send any monitor initialization errors (such as a lack of ICMP kernel capabilities) during monitor creation to ES. We do this by allowing the monitor to initialize and run on schedule, even though we know it will always send the same error message. This lets users more easily debug issues in Kibana.

ICMP Specific Improvement
This PR also Removes broken a IP capability check that caused heartbeat to be unable to start. We now just rely on return codes from attempts to actually send packets. This is the more specific fix for #29346 . I was not able to exactly reproduce the exact customer reported issue, where the user somehow disabled ipv6 in a way that the ICMP loop that I can't exactly reproduce. I tried disabling ipv6 fully with sudo sysctl net.ipv6.conf.all.disable_ipv6=1 but that didn't yield the error in #29346

The logic is now simplified, there's no truly reliable way to know if you can send an ipv6 (or ipv4) ping before you send it (settings can change at any time! network cards can disappear!), so we just let the error codes happen as the check is executed. This is also generally a better UX in that the errors will now be visible in the Uptime app, not just the logs.

It should be noted that the ipv4 and ipv6 boolean options only are documented to affect how DNS lookups happen. With this change the behavior matches the docs.

Note that ICMP is a bit weird in that there's a single ICMP loop in heartbeat, and all monitors are really just interacting with that.

Removal of .synthetics
This also ignores the .synthetics folder which has been inconvenient for some time for devs, in that it dirties the git path

(cherry picked from commit 616db13)
@mergify
Copy link
Contributor

mergify bot commented Jan 18, 2022

backport 7.17

✅ Backports have been created

andrewvc added a commit that referenced this pull request Jan 19, 2022
… (backport #29413) (#29896)

* [Heartbeat] Defer monitor / ICMP errors to monitor runtime / ES (#29413)

This PR generally improves the error behavior of all monitors, and some specific ICMP related errors as well. These two items are combined in one PR because the general theme here is improving the ICMP error experience, and improving ICMP required improving all monitors.

Fixes #29346
and incremental progress toward #29692

General monitor improvements
Generally speaking, per #29692 we are trying to send monitor output to ES wherever possible. With this PR we now send any monitor initialization errors (such as a lack of ICMP kernel capabilities) during monitor creation to ES. We do this by allowing the monitor to initialize and run on schedule, even though we know it will always send the same error message. This lets users more easily debug issues in Kibana.

ICMP Specific Improvement
This PR also Removes broken a IP capability check that caused heartbeat to be unable to start. We now just rely on return codes from attempts to actually send packets. This is the more specific fix for #29346 . I was not able to exactly reproduce the exact customer reported issue, where the user somehow disabled ipv6 in a way that the ICMP loop that I can't exactly reproduce. I tried disabling ipv6 fully with sudo sysctl net.ipv6.conf.all.disable_ipv6=1 but that didn't yield the error in #29346

The logic is now simplified, there's no truly reliable way to know if you can send an ipv6 (or ipv4) ping before you send it (settings can change at any time! network cards can disappear!), so we just let the error codes happen as the check is executed. This is also generally a better UX in that the errors will now be visible in the Uptime app, not just the logs.

It should be noted that the ipv4 and ipv6 boolean options only are documented to affect how DNS lookups happen. With this change the behavior matches the docs.

Note that ICMP is a bit weird in that there's a single ICMP loop in heartbeat, and all monitors are really just interacting with that.

Removal of .synthetics
This also ignores the .synthetics folder which has been inconvenient for some time for devs, in that it dirties the git path

(cherry picked from commit 616db13)

* [Heartbeat] Fix broken macOS ICMP test (#29900)

Fixes broken macos python e2e test

Co-authored-by: Andrew Cholakian <[email protected]>
Co-authored-by: Justin Kambic <[email protected]>
andrewvc added a commit that referenced this pull request Jan 19, 2022
…itor runtime / ES (#29892)

* [Heartbeat] Defer monitor / ICMP errors to monitor runtime / ES (#29413)

This PR generally improves the error behavior of all monitors, and some specific ICMP related errors as well. These two items are combined in one PR because the general theme here is improving the ICMP error experience, and improving ICMP required improving all monitors.

Fixes #29346
and incremental progress toward #29692

General monitor improvements
Generally speaking, per #29692 we are trying to send monitor output to ES wherever possible. With this PR we now send any monitor initialization errors (such as a lack of ICMP kernel capabilities) during monitor creation to ES. We do this by allowing the monitor to initialize and run on schedule, even though we know it will always send the same error message. This lets users more easily debug issues in Kibana.

ICMP Specific Improvement
This PR also Removes broken a IP capability check that caused heartbeat to be unable to start. We now just rely on return codes from attempts to actually send packets. This is the more specific fix for #29346 . I was not able to exactly reproduce the exact customer reported issue, where the user somehow disabled ipv6 in a way that the ICMP loop that I can't exactly reproduce. I tried disabling ipv6 fully with sudo sysctl net.ipv6.conf.all.disable_ipv6=1 but that didn't yield the error in #29346

The logic is now simplified, there's no truly reliable way to know if you can send an ipv6 (or ipv4) ping before you send it (settings can change at any time! network cards can disappear!), so we just let the error codes happen as the check is executed. This is also generally a better UX in that the errors will now be visible in the Uptime app, not just the logs.

It should be noted that the ipv4 and ipv6 boolean options only are documented to affect how DNS lookups happen. With this change the behavior matches the docs.

Note that ICMP is a bit weird in that there's a single ICMP loop in heartbeat, and all monitors are really just interacting with that.

Removal of .synthetics
This also ignores the .synthetics folder which has been inconvenient for some time for devs, in that it dirties the git path

(cherry picked from commit 616db13)

* [Heartbeat] Fix broken macOS ICMP test (#29900)

Fixes broken macos python e2e test

Co-authored-by: Andrew Cholakian <[email protected]>
Co-authored-by: Justin Kambic <[email protected]>
v1v added a commit that referenced this pull request Jan 28, 2022
* upstream/7.17: (30 commits)
  [7.17](backport #29966) Add the Elastic product origin header when talking to Elasticsearch or Kibana. (#30000)
  [Heartbeat] Change size of data on ICMP packet (#29948) (#29978)
  Add clarification about enableing dashboard loading (#29985) (#29989)
  Improve aws-s3 gzip file detection to avoid false negatives (#29969) (#29974)
  ci: docker login step for pulling then pushing (#29960) (#29963)
  x-pack/auditbeat/module/system/socket: get full length path and arg from /proc when not available from kprobe (#29410) (#29958)
  [Automation] Update elastic stack version to 7.17.0-ab4975a2 for testing (#29956)
  [Automation] Update elastic stack version to 7.17.0-1bd58b32 for testing (#29938)
  [7.17](backport #29913) [Metricbeat] gcp.gke: fix overview dashboard (#29914)
  [7.17](backport #29605) Fix annotation enrichment (#29834)
  [Automation] Update elastic stack version to 7.17.0-e1efbe3a for testing (#29922)
  [Automation] Update elastic stack version to 7.17.0-68da5d12 for testing (#29904)
  [7.17][Heartbeat] Defer monitor / ICMP errors to monitor runtime / ES (backport #29413) (#29896)
  Merge pull request from GHSA-rj4h-hqvq-cc6q
  [7.17](backport #29681) Change docker image from CentOS 7 to Ubuntu 20.04 (#29817)
  Fix YAML indentation in `parsers` examples (#29663) (#29894)
  [Automation] Update elastic stack version to 7.17.0-079761a0 for testing (#29864)
  Fix Filebeat dissect processor field tokenization in documentation (#29680) (#29883)
  Enable require_alias for Bulk requests for all actions when target is a write alias (#29879)
  Update Index template loading guide to use the correct endpoint (#29869) (#29877)
  ...
yashtewari pushed a commit to build-security/beats that referenced this pull request Jan 30, 2022
…tic#29413)

This PR generally improves the error behavior of all monitors, and some specific ICMP related errors as well. These two items are combined in one PR because the general theme here is improving the ICMP error experience, and improving ICMP required improving all monitors.

Fixes elastic#29346
and incremental progress toward elastic#29692

General monitor improvements
Generally speaking, per elastic#29692 we are trying to send monitor output to ES wherever possible. With this PR we now send any monitor initialization errors (such as a lack of ICMP kernel capabilities) during monitor creation to ES. We do this by allowing the monitor to initialize and run on schedule, even though we know it will always send the same error message. This lets users more easily debug issues in Kibana.

ICMP Specific Improvement
This PR also Removes broken a IP capability check that caused heartbeat to be unable to start. We now just rely on return codes from attempts to actually send packets. This is the more specific fix for elastic#29346 . I was not able to exactly reproduce the exact customer reported issue, where the user somehow disabled ipv6 in a way that the ICMP loop that I can't exactly reproduce. I tried disabling ipv6 fully with sudo sysctl net.ipv6.conf.all.disable_ipv6=1 but that didn't yield the error in elastic#29346

The logic is now simplified, there's no truly reliable way to know if you can send an ipv6 (or ipv4) ping before you send it (settings can change at any time! network cards can disappear!), so we just let the error codes happen as the check is executed. This is also generally a better UX in that the errors will now be visible in the Uptime app, not just the logs.

It should be noted that the ipv4 and ipv6 boolean options only are documented to affect how DNS lookups happen. With this change the behavior matches the docs.

Note that ICMP is a bit weird in that there's a single ICMP loop in heartbeat, and all monitors are really just interacting with that.

Removal of .synthetics
This also ignores the .synthetics folder which has been inconvenient for some time for devs, in that it dirties the git path
leweafan pushed a commit to leweafan/beats that referenced this pull request Apr 28, 2023
* upstream/7.17: (30 commits)
  [7.17](backport elastic#29966) Add the Elastic product origin header when talking to Elasticsearch or Kibana. (elastic#30000)
  [Heartbeat] Change size of data on ICMP packet (elastic#29948) (elastic#29978)
  Add clarification about enableing dashboard loading (elastic#29985) (elastic#29989)
  Improve aws-s3 gzip file detection to avoid false negatives (elastic#29969) (elastic#29974)
  ci: docker login step for pulling then pushing (elastic#29960) (elastic#29963)
  x-pack/auditbeat/module/system/socket: get full length path and arg from /proc when not available from kprobe (elastic#29410) (elastic#29958)
  [Automation] Update elastic stack version to 7.17.0-ab4975a2 for testing (elastic#29956)
  [Automation] Update elastic stack version to 7.17.0-1bd58b32 for testing (elastic#29938)
  [7.17](backport elastic#29913) [Metricbeat] gcp.gke: fix overview dashboard (elastic#29914)
  [7.17](backport elastic#29605) Fix annotation enrichment (elastic#29834)
  [Automation] Update elastic stack version to 7.17.0-e1efbe3a for testing (elastic#29922)
  [Automation] Update elastic stack version to 7.17.0-68da5d12 for testing (elastic#29904)
  [7.17][Heartbeat] Defer monitor / ICMP errors to monitor runtime / ES (backport elastic#29413) (elastic#29896)
  Merge pull request from GHSA-rj4h-hqvq-cc6q
  [7.17](backport elastic#29681) Change docker image from CentOS 7 to Ubuntu 20.04 (elastic#29817)
  Fix YAML indentation in `parsers` examples (elastic#29663) (elastic#29894)
  [Automation] Update elastic stack version to 7.17.0-079761a0 for testing (elastic#29864)
  Fix Filebeat dissect processor field tokenization in documentation (elastic#29680) (elastic#29883)
  Enable require_alias for Bulk requests for all actions when target is a write alias (elastic#29879)
  Update Index template loading guide to use the correct endpoint (elastic#29869) (elastic#29877)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v7.16.0 Automated backport with mergify backport-v8.0.0 Automated backport with mergify bug Heartbeat Team:obs-ds-hosted-services Label for the Observability Hosted Services team v7.17.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Heartbeat] ICMP checks fail on systems without ipv6, even if they are only checking ICMPv4
3 participants