Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log any FQDN lookup errors and fallback to OS-reported hostname #34946

Merged
merged 16 commits into from
Mar 30, 2023
Merged

Log any FQDN lookup errors and fallback to OS-reported hostname #34946

merged 16 commits into from
Mar 30, 2023

Conversation

ycombinator
Copy link
Contributor

@ycombinator ycombinator commented Mar 28, 2023

What does this PR do?

This PR fixes a bug wherein a Beat would fail to start if the FQDN lookup failed.

Why is it important?

FQDN lookup is "best effort". As such, if the lookup fails for some reason, we should not fail to start a Beat or halt execution otherwise. Instead, we should fallback to the OS-provided hostname and continue execution.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

  1. Build Filebeat (or any other Beat) with this PR.

    cd beats/filebeat
    mage clean build
  2. Set hostname to something that wouldn't resolve via DNS. On MacOS this can be done via:

    orig_hostname=$(hostname -f)
    sudo scutil --set HostName cowchicken
  3. Run filebeat version. This command should report the version as usual and NOT fail with an FQDN-related error (as reported in Cannot start Beats - fails with error: could not get FQDN #34910)

    ./filebeat version
    filebeat version 8.8.0 (arm64), libbeat 8.8.0 [528128f44ee870cedca830d8acbf63d8ee35ae26 built 2023-03-28 16:45:09 +0000 UTC]
    
  4. Create a configuration file for the Beat that enables the add_host_metadata processor and also enables the FQDN feature flag.

    cat <<EOF >filebeat.test.yml
    filebeat.inputs:
    - type: stdin
    
    processors:
    - add_host_metadata: ~
    
    features.fqdn.enabled: true
    output.console.enabled: true
    EOF
  5. Start the Beat with the above configuration and look for log entries mentioning FQDN. You should see two log entries in all, one from Beat initialization and one from the add_host_metadata processor.

    ./filebeat -c ./filebeat.test.yml -e 2>&1 | grep -i fqdn
    {"log.level":"warn","@timestamp":"2023-03-28T11:38:17.639-0700","log.origin":{"file.name":"instance/beat.go","file.line":802},"message":"unable to lookup FQDN: could not get FQDN, all methods failed: failed looking up CNAME: lookup cowchicken: no such host: failed looking up IP: lookup cowchicken: no such host, using hostname = cowchicken as FQDN","service.name":"filebeat","ecs.version":"1.6.0"}
    {"log.level":"warn","@timestamp":"2023-03-28T11:38:17.642-0700","log.logger":"add_host_metadata","log.origin":{"file.name":"add_host_metadata/add_host_metadata.go","file.line":152},"message":"unable to lookup FQDN: could not get FQDN, all methods failed: failed looking up CNAME: lookup cowchicken: no such host: failed looking up IP: lookup cowchicken: no such host, using hostname = cowchicken as FQDN","service.name":"filebeat","ecs.version":"1.6.0"}
    
  6. Reset hostname at the end of the test. On MacOS this can be done via:

    sudo scutil --set HostName $orig_hostname

Related issues

Use cases

Screenshots

Logs

@ycombinator ycombinator added bug Team:Elastic-Agent Label for the Agent team backport-v8.7.0 Automated backport with mergify labels Mar 28, 2023
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Mar 28, 2023
@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 28, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-03-30T00:17:36.901+0000

  • Duration: 56 min 48 sec

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@ycombinator ycombinator marked this pull request as ready for review March 28, 2023 19:11
@ycombinator ycombinator requested a review from a team as a code owner March 28, 2023 19:11
@ycombinator ycombinator requested review from rdner and faec and removed request for a team March 28, 2023 19:11
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@ycombinator ycombinator requested a review from andrewkroh March 28, 2023 19:11
if err != nil {
// FQDN lookup is "best effort". We log the error, fallback to
// the OS-reported hostname, and move on.
p.logger.Warnf("unable to lookup FQDN: %s, using hostname = %s as FQDN", err.Error(), hostname)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often would this be logged? We may want to lower this to Debug to avoid unnecessarily spamming the logs when this fails.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like we would be better off with a metrics counter for how often this resolution has failed.

That would tell us it's failing without worrying about log spam, and we could then turn on debug logging to see exactly what is happening.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often would this be logged? We may want to lower this to Debug to avoid unnecessarily spamming the logs when this fails.

Whenever the add_host_metadata cache expires. By default, that's every 5 minutes. That's quite frequent, so agreed on lowering the log level here to debug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like we would be better off with a metrics counter for how often this resolution has failed.

That would tell us it's failing without worrying about log spam, and we could then turn on debug logging to see exactly what is happening.

Makes sense. Let me look into implementing a metrics counter.

@cmacknz
Copy link
Member

cmacknz commented Mar 28, 2023

I assume writing a regression test for this is challenging because it requires us to modify the hostname of the system running the tests to one that doesn't resolve via DNS?

@cmacknz
Copy link
Member

cmacknz commented Mar 28, 2023

Also your test steps are excellent, thank you! I followed them exactly and the lack of fqdn logs caught that my attempt to switch branches failed because I had local file modifications to discard first :p

@ycombinator
Copy link
Contributor Author

I assume writing a regression test for this is challenging because it requires us to modify the hostname of the system running the tests to one that doesn't resolve via DNS?

Yes. Perhaps we can do something with a Docker testcontainer, though. Let me look into that.

@leehinman
Copy link
Contributor

I assume writing a regression test for this is challenging because it requires us to modify the hostname of the system running the tests to one that doesn't resolve via DNS?

Yes. Perhaps we can do something with a Docker testcontainer, though. Let me look into that.

maybe https://github.com/foxcpp/go-mockdns ?

@ycombinator
Copy link
Contributor Author

Thanks to @leehinman's suggestion about https://github.com/foxcpp/go-mockdns, I was able to implement a couple of unit test cases - one where the FQDN lookup will succeed and one where it will fail. I've also implemented the metrics counter and dropped the log level for the FQDN lookup failure in the add_host_metadata processor to debug, as suggested by @cmacknz.

This PR is ready for review again.

@ycombinator ycombinator requested a review from cmacknz March 29, 2023 09:57
// New constructs a new add_host_metadata processor.
func New(cfg *config.C) (processors.Processor, error) {
c := defaultConfig()
if err := cfg.Unpack(&c); err != nil {
return nil, fmt.Errorf("fail to unpack the %v configuration: %w", processorName, err)
}

// Logging and metrics (each processor instance has a unique ID).
var (
id = int(instanceID.Inc())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First time I've seen this way of getting around the namespace collision issue. CC @fearful-symmetry, this could help the version of this problem you have in the shipper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, interesting!

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. My comment about the processors stats being per instance of the processor is more of a nit.

Thanks for finding a way to test this!

@ycombinator
Copy link
Contributor Author

/test

@ycombinator ycombinator merged commit 895505c into elastic:main Mar 30, 2023
@ycombinator ycombinator deleted the fqdn-lookup-errors branch March 30, 2023 01:13
mergify bot pushed a commit that referenced this pull request Mar 30, 2023
* Log any FQDN lookup errors, fallback to OS hostname, and move on.

* Implement fallback logic

* Bumping up go-sysinfo dependency version

* Update call to host.ReportInfo

* Better log message

* Bumping up version on elastic-agent-system-metrics dependency

* Updating NOTICE.txt

* Log FQDN lookup failure as warning

* Move FQDN initialization to after logging has been configured

* Better log message

* Add metric + test

* Update NOTICE.txt

* Use a single namespace for all add_host_metadata processor instances' monitoring

* Fixing imports

* Create monitoring registry only once

* Check errors

(cherry picked from commit 895505c)

# Conflicts:
#	NOTICE.txt
#	go.mod
#	go.sum
#	libbeat/cmd/instance/beat.go
#	libbeat/processors/add_host_metadata/add_host_metadata.go
#	libbeat/processors/add_host_metadata/add_host_metadata_test.go
ycombinator added a commit that referenced this pull request Apr 6, 2023
* Log any FQDN lookup errors, fallback to OS hostname, and move on.

* Implement fallback logic

* Bumping up go-sysinfo dependency version

* Update call to host.ReportInfo

* Better log message

* Bumping up version on elastic-agent-system-metrics dependency

* Updating NOTICE.txt

* Log FQDN lookup failure as warning

* Move FQDN initialization to after logging has been configured

* Better log message

* Add metric + test

* Update NOTICE.txt

* Use a single namespace for all add_host_metadata processor instances' monitoring

* Fixing imports

* Create monitoring registry only once

* Check errors

(cherry picked from commit 895505c)
ycombinator added a commit that referenced this pull request Apr 6, 2023
…reported hostname (#34971)

* Log any FQDN lookup errors and fallback to OS-reported hostname (#34946)

* Log any FQDN lookup errors, fallback to OS hostname, and move on.

* Implement fallback logic

* Bumping up go-sysinfo dependency version

* Update call to host.ReportInfo

* Better log message

* Bumping up version on elastic-agent-system-metrics dependency

* Updating NOTICE.txt

* Log FQDN lookup failure as warning

* Move FQDN initialization to after logging has been configured

* Better log message

* Add metric + test

* Update NOTICE.txt

* Use a single namespace for all add_host_metadata processor instances' monitoring

* Fixing imports

* Create monitoring registry only once

* Check errors

(cherry picked from commit 895505c)

* Making changes lost in rebase

* Fixing conflicts

* Reordering imports

---------

Co-authored-by: Shaunak Kashyap <[email protected]>
chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023
* Log any FQDN lookup errors, fallback to OS hostname, and move on.

* Implement fallback logic

* Bumping up go-sysinfo dependency version

* Update call to host.ReportInfo

* Better log message

* Bumping up version on elastic-agent-system-metrics dependency

* Updating NOTICE.txt

* Log FQDN lookup failure as warning

* Move FQDN initialization to after logging has been configured

* Better log message

* Add metric + test

* Update NOTICE.txt

* Use a single namespace for all add_host_metadata processor instances' monitoring

* Fixing imports

* Create monitoring registry only once

* Check errors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.7.0 Automated backport with mergify bug Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot start Beats - fails with error: could not get FQDN
5 participants