Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metricbeat module for Traefik #7413

Merged
merged 49 commits into from
Jul 3, 2018
Merged

Metricbeat module for Traefik #7413

merged 49 commits into from
Jul 3, 2018

Conversation

ycombinator
Copy link
Contributor

@ycombinator ycombinator commented Jun 26, 2018

Resolves #7412.

This PR adds a Metricbeat module for the Traefik reverse proxy/load balancer. In this module, it adds the health metricset, whose data it retrieves from polling the Traefik instance's health HTTP API endpoint.

@ycombinator ycombinator self-assigned this Jun 26, 2018
@ruflin ruflin added the in progress Pull request is currently in progress. label Jun 26, 2018
@ycombinator
Copy link
Contributor Author

ycombinator commented Jun 26, 2018

@ruflin Looking at the beats-ci console output, the failure is this:

08:21:59 ./module/traefik/health/health.go
08:21:59 Code differs from goimports' style ^

How can I fix this?

@exekias Is there some VS Code extension I should use to auto-fix this on save or something? I'm currently using this in my VS Code config:

  // Pick 'gofmt', 'goimports', 'goreturns' or 'goformat' to run on format.
  "go.formatTool": "goreturns",

[EDIT] I tried changing that setting to "goimports" and re-saved health.go but nothing seems to have changed. Maybe I'm barking up the wrong tree here?

@exekias
Copy link
Contributor

exekias commented Jun 26, 2018

You can use make fmt to fix this kind of issues.

As for VS Code, try adding this:

    "go.formatFlags": ["-local", "github.com/elastic/beats"],

Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to add a changelog.

It seems somehow the template loading test is failing. Not sure why, there must be something invalid in the template. If you run the test locally it should give you a good idea on what goes wrong in the logs.

var (
schema = s.Schema{
"uptime": s.Object{
"sec": c.Float("uptime_sec"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be Int?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value is a float in the response from Traefik's health API (http://localhost:8080/health) so I kept it the same here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case i would suggest we use milliseconds instead here, especially having elastic/elasticsearch#31244 in mind. TBH for uptime I would also be fine to just convert it to Int. I don't think there is a benefit of having uptime in sub second detail.

Copy link
Contributor Author

@ycombinator ycombinator Jun 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a benefit of having uptime in sub second detail.

Yeah, good point. Will change to Int.

- key: traefik
title: "traefik"
description: >
experimental[]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we now generate this automatically if the release setting is not set. So you can skip it here.

@ycombinator
Copy link
Contributor Author

ycombinator commented Jun 28, 2018

I'm investigating the integration test failures by running them locally as you suggested. From what I can tell so far I think the failure has to do with the Docker container for Traefik not starting up correctly.

Concretely, I ran this command:

$ go test -v -tags integration github.com/elastic/beats/metricbeat/module/traefik/health

And it's output got "stuck" like so:

=== RUN   TestFetch
time="2018-06-27T18:17:28-07:00" level=info msg="[0/33] [traefik]: Starting "
time="2018-06-27T18:17:28-07:00" level=warning msg="Error while reading .dockerignore (/Users/shaunak/go/src/github.com/elastic/beats/metricbeat/module/traefik/_meta/.dockerignore) : open /Users/shaunak/go/src/github.com/elastic/beats/metricbeat/module/traefik/_meta/.dockerignore: no such file or directory"
time="2018-06-27T18:17:28-07:00" level=info msg="Building metricbeat_traefik..."
time="2018-06-27T18:17:28-07:00" level=info msg="Recreating traefik"
time="2018-06-27T18:17:29-07:00" level=info msg="[1/33] [traefik]: Started "

Running docker-compose ps in another window, I see this:

$ docker-compose ps
        Name           Command            State           Ports
----------------------------------------------------------------
metricbeat_traefik_1   /traefik   Up (health: starting)   80/tcp

And eventually that becomes this:

        Name           Command        State        Ports
---------------------------------------------------------
metricbeat_traefik_1   /traefik   Up (unhealthy)   80/tcp

In either output of docker-compose ps above, notice that port 8080 is not shown, which is supposed to be the Traefik HTTP API port (which is also used by the HEALTHCHECK command in the Traefik Dockerfile, so it's not surprising that the container ultimately reports as unhealthy).

My guess is that this failing Docker health check is what's causing the integration test to be stuck. So I'm playing with the Dockerfile and docker-compose.yml to see if I can get port 8080 up somehow and try the tests again. Stay tuned...

@ruflin
Copy link
Contributor

ruflin commented Jun 28, 2018

It seems that the container tries to expose port 80 instead of 8080. Is treafik by default running on 80 or 8080? Perhaps you have change the config here and add EXPOSE 8080 to the docker file.

@ycombinator
Copy link
Contributor Author

Turned out to be a combination of issues in the Dockerfile, you can see the necessary changes to make the HEALTHCHECK work here: 2fcc0de27355bd68c2719850ed6e207992dcb094.

@@ -0,0 +1,19 @@
{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this file? How is it used?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example output document that we should in the docs. It is generated by this command here: https://github.com/elastic/beats/blob/master/metricbeat/Makefile#L73 As this updates all data.json I normally run the following command:

go test -tags=integration github.com/elastic/beats/metricbeat/module/traefik/... -data

In you case this assumes the traefik docker container is running and exposes the port on localhost. This will generate a data.json file and make update will link it in the docs.

Here you can see an example in the docs: https://www.elastic.co/guide/en/beats/metricbeat/master/metricbeat-metricset-elasticsearch-index.html

@ycombinator ycombinator added review and removed in progress Pull request is currently in progress. labels Jun 28, 2018
@ycombinator ycombinator requested review from jsoriano and ruflin June 28, 2018 09:00
@ycombinator ycombinator changed the title [WIP] Metricbeat module for Traefik Metricbeat module for Traefik Jun 28, 2018
description: >
Metrics obtained from Traefik's health API endpoint
fields:
- name: uptime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One trick you can do here not to have to specify the group and describe is to directly write uptime.sec. The generators will do the rest of the magic for you.

Average response time
fields:
- name: sec
type: float
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed with the other sec type I struggle that we have a sec value that is float. I understand that for the average it makes more sense but in this case I would rather use a smaller unit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, will use milliseconds and int here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For proxies performance monitoring it is useful to have sub-millisecond response times if available.

description: >
Respone status codes
fields:
- name: code
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is correct. If understand the code correct it will be status_codes.404: 17? So code is actually the key and not the value? For such cases have a look at the dynamic templates. Search for object_type usage in our fields.yml to see how it can be used.

"avg_time": s.Object{
"sec": c.Float("average_response_time_sec"),
},
"status_codes": s.Object{},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not define it here but leave it empty. Like this if we don't have any status codes, no object shows up in Elasticsearch.

)

// raw response copied from Traefik instance's health API endpoint
const response = `{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having it to define here, I would recommend to put this into a file which is loaded. This makes it then very easy to have a list of files to test against, for example different versions. See Elasticsearch module as an example.

COMPOSE_SERVICES = ['traefik']

@unittest.skipUnless(metricbeat.INTEGRATION_TESTS, "integration test")
def test_health(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a new method check_metricset in metricbeat.py. Using it could simplify your code here.

@@ -0,0 +1,36 @@
import os
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We recently started to put this file directly into the module directory (see Elasticsearch) to have all pieces in one place. But still both works.

@@ -119,6 +119,7 @@ https://github.com/elastic/beats/compare/v6.2.3...master[Check the HEAD diff]
- Fix field mapping for the system process CPU ticks fields. {pull}7230[7230]
- Fix Windows service metricset when using a 32-bit binary on a 64-bit OS. {pull}7294[7294]
- Fix Jolokia attribute mapping when using wildcards and MBean names with multiple properties. {pull}7321[7321]
- Added Traefik module with health metricset. {pull}7413[7413]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this entry to the Added section

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, good catch. Thanks!

Average response time
fields:
- name: sec
type: float
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For proxies performance monitoring it is useful to have sub-millisecond response times if available.


event := reporter.GetEvents()[0]
assert.NotNil(t, event)
t.Logf("%s/%s event: %+v", ms.Module().Name(), ms.Name(), event)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also check something in the event fields?

@ycombinator
Copy link
Contributor Author

@jsoriano @ruflin Thanks for your review. I believe I've addressed all the feedback now. Ready for your 👀 again.

Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a missing import in a test file, for the rest it LGTM :)

import time
from parameterized import parameterized

sys.path.append(os.path.join(os.path.dirname(__file__), '../../tests/system'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import sys :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahahaha, thank you.

Related: Is there some way to run system tests (i.e. the Python ones) for a specific module?

Copy link
Contributor

@ruflin ruflin Jun 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INTEGRATION_TESTS=1 nosetests module/traefik/ should do the trick. But you must have traefik running locally with docker exposing the port.


event, _ := schema.Apply(health)

statusCodeCountMap := health["total_status_code_count"].(map[string]interface{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should check that the key exists and that the type assertion worked. you can add a second param ok which will contain a bool to do the second part.

type: long
description: >
Average response time in microseconds
- name: status_codes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call it status_code because when someone access it in Kibana he will only look at 1 at the time:

health.status_code.404 > 17


event, _ := eventMapping(data)
report.Event(mb.Event{
MetricSetFields: event,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To follow ECS, you could you add service.name: traefik to it?

@@ -0,0 +1,49 @@
// Licensed to Elasticsearch B.V. under one or more contributor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsoriano Should these now go under mtest package in each module?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I'd prefer that, yes.

@ycombinator
Copy link
Contributor Author

@ruflin Made the changes you suggested in the latest round of review. Ready for your eyes again, when you get a chance. Thanks!

Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few minor additional comments. Except for the float64 issue ok for me to be merged.

"github.com/pkg/errors"

"github.com/elastic/beats/libbeat/common"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline can be removed

}.Build()
)

// MetricSet holds any configuration or state information. It must implement
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I often remove the generator comments and make them more specific to the metricset or just very short. The comments were intended for someone that creates the metricset. Applies also to all other comments.


func eventMapping(health map[string]interface{}) (common.MapStr, *s.Errors) {
if averageResponseTimeSec, ok := health["average_response_time_sec"]; ok {
health["average_response_time_us"] = averageResponseTimeSec.(float64) * 1000 * 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if for whatever reason averageResponseTimeSec cannot be typed to (float64)

@@ -0,0 +1,7 @@
# Module: traefik
# Docs: https://www.elastic.co/guide/en/beats/metricbeat/master/metricbeat-module-traefik.html
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the link to the docs. One issue this has is that we hardcode master here to every released config file will point to master. Would be nice if in the future this would be a variable that is replaced by the version during packaging.

@ycombinator
Copy link
Contributor Author

@ruflin Addressed latest round of review feedback. Ready for your 👀 again. Thanks!

@ruflin ruflin merged commit 4071bc7 into elastic:master Jul 3, 2018
@ruflin
Copy link
Contributor

ruflin commented Jul 3, 2018

@ycombinator Thanks for pushing this forward.

@ycombinator ycombinator added needs_backport PR is waiting to be backported to other branches. v7.0.0-alpha1 v6.4.0 labels Jul 3, 2018
@ycombinator
Copy link
Contributor Author

ycombinator commented Jul 3, 2018

Backported to:

6.x / 6.4.0: 0ec0e7f

EDIT: This backport commit was later removed.

@ycombinator ycombinator removed the needs_backport PR is waiting to be backported to other branches. label Jul 3, 2018
@ycombinator ycombinator deleted the metricbeat/module/traefik branch July 5, 2018 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants