Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/hostmetrics] change the log level when filesystem fails to scrape patition #18236

Closed
dloucasfx opened this issue Feb 1, 2023 · 8 comments
Assignees
Labels
bug Something isn't working receiver/hostmetrics

Comments

@dloucasfx
Copy link
Contributor

dloucasfx commented Feb 1, 2023

Component(s)

receiver/hostmetrics filesystem scraper

What happened?

Description

This is a gray area between a bug / improvement, but due to the large number of "unnecessary" error messages in the logs, I am filing it as a bug.

After this change a0abefc the filesystem scraper is logging every partition that fails to be scraped, will add an error message through the errors.AddPartial.

From the first look, this is the right approach, however, some partitions (example: windows partitions that are bitlocker encrypted, or any partition that we don't have access to), are known to fail, problem is that user has no way to filter them out before they get scraped and they will end up with error messages polluting their logs.

Steps to Reproduce

Run the hostmerics/filesystem receiver/scraper on system with non-acessible partition, example: windows with BitLocker Drive

Expected Result

No errors should be logged, only when agent is set on debug.
Or, provide a way to filter out those partitions

Actual Result

error   scraperhelper/scrapercontroller.go:197  Error scraping metrics  {"kind":
"receiver", "name": "hostmetrics", "pipeline": "metrics", "error": "failed collecting partitions
 information: \tError 0: This drive is locked by BitLocker Drive Encryption. You must unlock this drive from Control Panel.\n", "scraper": "filesystem"}
                 go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).scrapeMetricsAndReport
                                                                        /builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/[email protected]/receiv
                                                                er/scraperhelper/scrapercontroller.go:197
                                                                ```****

### Collector version

latest

### Environment information

## Environment
OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")


### OpenTelemetry Collector configuration

_No response_

### Log output

```shell
error   scraperhelper/scrapercontroller.go:197  Error scraping metrics  {"kind":
"receiver", "name": "hostmetrics", "pipeline": "metrics", "error": "failed collecting partitions
 information: \tError 0: This drive is locked by BitLocker Drive Encryption. You must unlock this drive from Control Panel.\n", "scraper": "filesystem"}
                 go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).scrapeMetricsAndReport
                                                                        /builds/o11y-gdi/splunk-otel-collector-releaser/.go/pkg/mod/go.opentelemetry.io/[email protected]/receiv
                                                                er/scraperhelper/scrapercontroller.go:197
```


### Additional context

Few ideas:
- Provide a way for user to filter out partitions before we call gopsutil
- provide an option specific to the filesystem scraper to skip those errors
- Extend `AddPartial` to pass the log level (ex: debug)
@dloucasfx dloucasfx added bug Something isn't working needs triage New item requiring triage labels Feb 1, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Feb 1, 2023

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@atoulme atoulme removed the needs triage New item requiring triage label Feb 1, 2023
@atoulme atoulme self-assigned this Feb 2, 2023
@atoulme
Copy link
Contributor

atoulme commented Feb 2, 2023

@dloucasfx for first option, it looks like this is now supported. Check out

@atoulme
Copy link
Contributor

atoulme commented Feb 2, 2023

Did you also mention it would be good to offer a way to configure a zap.Filter on the logger?

@dloucasfx
Copy link
Contributor Author

dloucasfx commented Feb 2, 2023

@dloucasfx for first option, it looks like this is now supported. Check out

@atoulme

The link is for the disk scraper, this issue is in the filesystem scraper; Regardless, the Filesystem has filtering options, but the filtering happens after all the filesystem info is collected, ie: after the error is logged
you can see here that the filtering happens after the errors.AddPartial

@dloucasfx
Copy link
Contributor Author

Did you also mention it would be good to offer a way to configure a zap.Filter on the logger?

Oh yeah, when I was looking into this issue I was hoping that our logging support the zapfilter https://pkg.go.dev/moul.io/zapfilter where user can filter based on log messages. This is definitely an enhancement, but if we have it in place, we can workaround this bug.

@jvoravong
Copy link
Contributor

Taking a look into this issue now.

@jvoravong
Copy link
Contributor

Merged in a small fix, should be available in v0.73.0.

@dmitryax dmitryax closed this as completed Mar 1, 2023
@csmith-poppulo
Copy link

csmith-poppulo commented Sep 27, 2024

I'm having a very similar issue with version 0.95.0 of the collector agent on Windows servers where the disks are locked by SIOS. Our Windows Application event logs are flooded with Errors from the agent when hitting locked disks. The disks are only active on the SQL server node where the roles are currently assigned to and if we have to failover they will migrate. This is a dynamic setup and we do need metrics from those disks whenever they are active on any of the nodes in the cluster.

Should I open a new bug ticket for this? I'm not completely familiar with the process but can definitely use this ticket as my guide along with the documentation for contributing.

1.7274589621483753e+09 error scraperhelper/scrapercontroller.go:200 Error scraping metrics {"kind": "receiver", "name": "hostmetrics/localhost_windows_system", "data_type": "metrics", "error": "failed collecting partitions information: \tError 0: Access is denied.\n", "scraper": "filesystem"} go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).scrapeMetricsAndReport go.opentelemetry.io/collector/[email protected]/scraperhelper/scrapercontroller.go:200 go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).startScraping.func1 go.opentelemetry.io/collector/[email protected]/scraperhelper/scrapercontroller.go:176

image

Quick edit here

I did go ahead and try 0.110.0 and I am having the same issue. I'm not sure how Logz.io repackages the collector though so I was only doing this as a quick test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working receiver/hostmetrics
Projects
None yet
Development

No branches or pull requests

5 participants