Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(metrics): add metrics integration with prometheus #3339

Merged
merged 2 commits into from
Nov 1, 2019

Conversation

raymondfeng
Copy link
Contributor

@raymondfeng raymondfeng commented Jul 11, 2019

PoC for https://prometheus.io/ integration

Checklist

👉 Read and sign the CLA (Contributor License Agreement) 👈

  • npm test passes on your machine
  • New tests added or existing tests modified to cover all changes
  • Code conforms with the style guide
  • API Documentation in code was updated
  • Documentation in /docs/site was updated
  • Affected artifact templates in packages/cli were updated
  • Affected example projects in examples/* were updated

👉 Check out how to submit a PR 👈

@raymondfeng raymondfeng requested a review from bajtos as a code owner July 11, 2019 18:13
@raymondfeng raymondfeng changed the title feat(metrics): add metrics integration with prometheus [RFC WIP] feat(metrics): add metrics integration with prometheus Jul 11, 2019
@raymondfeng raymondfeng force-pushed the prometheus branch 2 times, most recently from b401dbb to 8bec17b Compare July 12, 2019 05:01
this.component(MetricsComponent);
```

By default, Metrics route is mounted at `/metrics`. This path can be customized
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about access control?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's similar as how we expose openapi specs. Let's worry about that later.

There are a few things to consider:

  1. pull vs push - prometheus prefers pull
  2. use a different rest endpoint (host/port)
  3. ACL

@raymondfeng
Copy link
Contributor Author

FYI: I refactored the module to extensions/metrics - to be consistent with #3360

I had to include c74b870 so that CI can run.

@bajtos
Copy link
Member

bajtos commented Jul 30, 2019

I had to include c74b870 so that CI can run.

As I commented in the other PR, can you please open a new PR to make the necessary changes to allow extensions to be hosted in extensions/ directory?

@bajtos
Copy link
Member

bajtos commented Jul 30, 2019

I am not familiar with Prometheus. What kind of metrics is it collecting? What kind of metrics does it makes sense to provide from a LoopBack application? What can be exposed out of the box and what requires users to provide explicit configuration for?

For example:

  • Request latency - how long does it take to handle a request.
    • From the time we receive request headers until we send the response headers (not waiting for the response body to be sent entirely)
    • From the time we receive request body until we send the response headers
    • From the time we receive request headers until the last byte of the response was sent
  • DataSource latency - how long does it take to execute a call to a datasource (make a DB query, call a backend web-service)
  • Memory usage
  • CPU usage
  • Event loop latency
  • Anything else?

I'd like the documentation for the extension (the README?) to better describe these aspects and educate LB4 users that are new to Prometheus.

@raymondfeng raymondfeng force-pushed the prometheus branch 2 times, most recently from 96d7916 to 0964afd Compare July 30, 2019 17:57
@raymondfeng
Copy link
Contributor Author

@bajtos I have updated README to include more information.

@raymondfeng raymondfeng force-pushed the prometheus branch 2 times, most recently from bd3e6d9 to fbb7f30 Compare August 19, 2019 17:06
@raymondfeng raymondfeng force-pushed the prometheus branch 2 times, most recently from 0ba3204 to 9639c29 Compare August 23, 2019 02:16
@raymondfeng
Copy link
Contributor Author

@bajtos PTAL.

@raymondfeng raymondfeng changed the title [RFC WIP] feat(metrics): add metrics integration with prometheus feat(metrics): add metrics integration with prometheus Aug 26, 2019
Copy link
Member

@bajtos bajtos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal looks reasonable, I'd like to discuss few aspects & design decisions you have made.

extensions/metrics/README.md Show resolved Hide resolved
extensions/metrics/src/__examples__/demo.sh Outdated Show resolved Hide resolved
// Only run the test on Travis with Linux
const verb =
process.env.TRAVIS && os.platform() === 'linux' ? describe : describe.skip;
verb('Metrics (with push gateway)', function() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skipIf(describe, '...', () => {}); does not compile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skipIf(describe, '...', () => {}); does not compile.

That's a known limitation of the current version and/or TypeScript.

Did you try the slightly-longer form I shown in my comment?

skipIf<[(this: Suite) => void], void>( 

process.env.TRAVIS && os.platform() === 'linux' ? describe : describe.skip;
verb('Metrics (with push gateway)', function() {
// eslint-disable-next-line no-invalid-this
this.timeout(30000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to increase the duration of npm test by another 10-30 seconds? I am concerned that npm test is already taking to long to finish to make TDD practical, I am reluctant to make it even worse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just set the timeout conservatively to allow the prom/pushgateway docker container to be up and running. We may have to define a nightly build to run the costly integration tests. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So how long does it usually take to get prom/pushgateway docker container up and running? When I run npm test on my local machine for the second time, how long delay will be introduced because of waiting for docker?

We may have to define a nightly build to run the costly integration tests. What do you think?

I feel it's not enough to run these tests nightly. If a pull request breaks one of these tests, then we will discover the problem too late.

Can we use the approach I have in place for running repository-test tests against real databases? Here is the gist:

  • These tests ARE NOT run as part of npm test.
  • There are clear instruction how to run the tests locally - see e.g. MySQL instructions.
  • The tests expect external services like databases to be already running and available. This way we pay the cost of starting the services only once, not for every test run.
  • There is a single Travis CI job for each test suite (MongoDB, MySQL, etc.), these jobs are executed in parallel with other jobs like npm test, code linting, commit linting, etc. Example job config: .travis.yml#L53-L69 - it does not use Docker, because native MySQL is faster to setup, but Travis CI does support docker.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you write a mock-up pushgateway that will call our metrics endpoint the same way as the real gateway does? That way we can verify push functionality from extensions/metrics tests.

Then we can add a new package to acceptance directory, e.g. acceptance/push-metrics, where we will use a real push gateway running in a docker container, to ensure our push implementation works with real gateways too and catch any discrepancies between our mock gateway and real gateways.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this 30 second timeout now, or at least reduce it to something like 5-10s?

Also I see that you introduce a mock push gateway, which is great! But how can we be sure that it's accurately simulating the behavior of a real gateway? Shouldn't we have an acceptance tests using the docker-based gateway as I proposed in my comment above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the mock-up push gateway gives us enough confidence as we just to have make sure this component is pushing metrics to the gateway (the correctness should have been covered by the prom-client).

extensions/metrics/src/metrics.component.ts Show resolved Hide resolved
package-lock.json Outdated Show resolved Hide resolved
@bajtos
Copy link
Member

bajtos commented Aug 27, 2019

Would you like to expose README of this new component in https://loopback.io/doc/en/lb4/Using-components.html?

@raymondfeng
Copy link
Contributor Author

@bajtos PTAL

Copy link
Member

@bajtos bajtos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch looks much better now! I'd like to discuss few points before approving it.

extensions/metrics/README.md Outdated Show resolved Hide resolved
extensions/metrics/README.md Outdated Show resolved Hide resolved
extensions/metrics/README.md Outdated Show resolved Hide resolved
extensions/metrics/src/interceptors/metrics.interceptor.ts Outdated Show resolved Hide resolved
process.env.TRAVIS && os.platform() === 'linux' ? describe : describe.skip;
verb('Metrics (with push gateway)', function() {
// eslint-disable-next-line no-invalid-this
this.timeout(30000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this 30 second timeout now, or at least reduce it to something like 5-10s?

Also I see that you introduce a mock push gateway, which is great! But how can we be sure that it's accurately simulating the behavior of a real gateway? Shouldn't we have an acceptance tests using the docker-based gateway as I proposed in my comment above?

@raymondfeng
Copy link
Contributor Author

@bajtos PTAL

Copy link
Member

@bajtos bajtos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quickly skimmed through the changes, have few more comments.

Please get at least one more person from the team (@strongloop/loopback-maintainers) to review the changes too.

@raymondfeng raymondfeng force-pushed the prometheus branch 2 times, most recently from ba183e4 to cb0aa16 Compare October 21, 2019 22:09
@raymondfeng raymondfeng force-pushed the prometheus branch 2 times, most recently from 94ce1b5 to dfb854c Compare November 1, 2019 03:21
Copy link
Member

@bajtos bajtos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any more comments.

Please get approval from at least one more person from @strongloop/loopback-maintainers before landing.

@hacksparrow
Copy link
Contributor

As a POC, this looks good.

However, I am not sure if hardwiring a core extension to a particular service is a good idea. Ideally, its interface should be an adapter - users should be able to use Prometheus alternatives, if they want to.

extensions/metrics/README.md Outdated Show resolved Hide resolved
targetName: invocationCtx.targetName,
});
try {
this.counter.inc();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question for the counter: I run the demo and noticed the method invocation # is 105:

# HELP loopback_invocation_total method invocation counts
# TYPE loopback_invocation_total counter
loopback_invocation_total 105

The demo app doesn't have any controller/endpoints, so I am wondering... what are the methods being invoked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. We have a built-in controller in extensions/metrics.

So whenever the /metrics is scraped, our metrics interceptor is triggered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, that makes sense 👍

Copy link
Contributor

@jannyHou jannyHou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚢 I run the demo and tried the /metrics endpoint, the report looks reasonable.
I have a general question for prometheus: is it also aimed to monitor particular endpoints or it's more for monitoring the health of an app?

My understanding for @loopback/extension-metrics is people can use it to monitor their app or project like the demo, do we plan to add a new package under /metrics to monitor our project with this extension?

@raymondfeng
Copy link
Contributor Author

I have a general question for prometheus: is it also aimed to monitor particular endpoints or it's more for monitoring the health of an app?

We already have @loopback/extension-health for health checks. The metrics extension is to enable metrics reporting for prometheus. The metrics includes Node.js runtime, LoopBack framework code (TBA), and application logic.

@raymondfeng raymondfeng force-pushed the prometheus branch 2 times, most recently from 67cb790 to aa9c952 Compare November 1, 2019 20:23
@raymondfeng raymondfeng merged commit 2c11c6d into master Nov 1, 2019
@raymondfeng raymondfeng deleted the prometheus branch November 1, 2019 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CloudNative Cloud native enablement Observability
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants