Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

receive: Add liveness and readiness probe #1537

Merged
merged 5 commits into from
Sep 20, 2019

Conversation

kakkoyun
Copy link
Member

This PR,

  • Adds /-/healthy endpoint for liveness checks.
  • Adds /-/ready endpoint for readiness checks.

Changes

  • Adds /-/healthy endpoint for liveness checks.
  • Adds /-/ready endpoint for readiness checks.
  • Uses prober.Prober for readiness and liveness endpoints.

Verification

  1. make test

  2. Started thanos receive and made a request to related endpoints.

curl http://0.0.0.0:10902/-/healthy
thanos receive is healthy%
curl http://0.0.0.0:10902/-/ready
thanos receive is not ready. Reason: thanos receive is initializing
curl http://0.0.0.0:10902/-/ready
thanos receive is ready%

@kakkoyun
Copy link
Member Author

cc @FUSAKLA

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks!

@@ -278,6 +284,7 @@ func runReceive(
s := newStoreGRPCServer(logger, reg, tracer, tsdbStore, opts)

level.Info(logger).Log("msg", "listening for StoreAPI gRPC", "address", grpcBindAddr)
statusProber.SetReady()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the receiver should probably not be ready until the TSDB is ready? Also not sure about the hashring and the receive interface is also not guarantied to be up at this point.

Maybe this will require some more complex condition for the ready state 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FUSAKLA For the TSDB, it's ready at this stage if you check line 270. It runs after TSDB is open.
For receive interface, I thought if something goes south it'll change liveness state so, readiness won't be needed.

I guess I need to double-check the hashring readiness.

I'll have another look at it.

@kakkoyun
Copy link
Member Author

@FUSAKLA @bwplotka I've updated the logic to set receive ready. Please have another look at it.

Signed-off-by: Kemal Akkoyun <[email protected]>
Signed-off-by: Kemal Akkoyun <[email protected]>
Signed-off-by: Kemal Akkoyun <[email protected]>
Signed-off-by: Kemal Akkoyun <[email protected]>
Signed-off-by: Kemal Akkoyun <[email protected]>
@brancz brancz merged commit 3a6f8e1 into thanos-io:master Sep 20, 2019
@@ -277,6 +290,8 @@ func runReceive(
}
s := newStoreGRPCServer(logger, reg, tracer, tsdbStore, opts)

// Wait hashring to be ready before start serving metrics
<-hashringReady
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we waiting for the hashring to be ready before serving metrics from the store? These things are entirely independent IMO

ivan-kiselev pushed a commit to ivan-kiselev/thanos that referenced this pull request Sep 26, 2019
* Add prober to receive

Signed-off-by: Kemal Akkoyun <[email protected]>

* Add changelog entries

Signed-off-by: Kemal Akkoyun <[email protected]>

* Update README

Signed-off-by: Kemal Akkoyun <[email protected]>

* Remove default

Signed-off-by: Kemal Akkoyun <[email protected]>

* Wait hashring to be ready

Signed-off-by: Kemal Akkoyun <[email protected]>
ivan-kiselev pushed a commit to ivan-kiselev/thanos that referenced this pull request Sep 26, 2019
* Add prober to receive

Signed-off-by: Kemal Akkoyun <[email protected]>

* Add changelog entries

Signed-off-by: Kemal Akkoyun <[email protected]>

* Update README

Signed-off-by: Kemal Akkoyun <[email protected]>

* Remove default

Signed-off-by: Kemal Akkoyun <[email protected]>

* Wait hashring to be ready

Signed-off-by: Kemal Akkoyun <[email protected]>
Signed-off-by: Ivan Kiselev <[email protected]>
brancz pushed a commit that referenced this pull request Sep 26, 2019
* Some updates to compact docs

Signed-off-by: Ivan Kiselev <[email protected]>

* some formatting

Signed-off-by: Ivan Kiselev <[email protected]>

* Update docs/components/compact.md

accept PR suggestions

Co-Authored-By: Bartlomiej Plotka <[email protected]>
Signed-off-by: Ivan Kiselev <[email protected]>

* Add metalmatze to list of maintainers (#1547)

Signed-off-by: Matthias Loibl <[email protected]>
Signed-off-by: Ivan Kiselev <[email protected]>

* resolve comments

Signed-off-by: Ivan Kiselev <[email protected]>

* resolve last comment

Signed-off-by: Ivan Kiselev <[email protected]>

* receive: Add liveness and readiness probe (#1537)

* Add prober to receive

Signed-off-by: Kemal Akkoyun <[email protected]>

* Add changelog entries

Signed-off-by: Kemal Akkoyun <[email protected]>

* Update README

Signed-off-by: Kemal Akkoyun <[email protected]>

* Remove default

Signed-off-by: Kemal Akkoyun <[email protected]>

* Wait hashring to be ready

Signed-off-by: Kemal Akkoyun <[email protected]>
Signed-off-by: Ivan Kiselev <[email protected]>

* downsample: Add liveness and readiness probe (#1540)

* Add readiness and liveness probes for downsampler

Signed-off-by: Kemal Akkoyun <[email protected]>

* Add changelog entry

Signed-off-by: Kemal Akkoyun <[email protected]>

* Remove default

Signed-off-by: Kemal Akkoyun <[email protected]>

* Set ready

Signed-off-by: Kemal Akkoyun <[email protected]>

* Update CHANGELOG

Signed-off-by: Kemal Akkoyun <[email protected]>

* Clean CHANGELOG

Signed-off-by: Kemal Akkoyun <[email protected]>
Signed-off-by: Ivan Kiselev <[email protected]>

* Document the dnssrvnoa option (#1551)

Signed-off-by: Antonio Santos <[email protected]>
Signed-off-by: Ivan Kiselev <[email protected]>

* feat store: added readiness and livenes prober (#1460)

Signed-off-by: Martin Chodur <[email protected]>
Signed-off-by: Ivan Kiselev <[email protected]>

* Add Hotstar to adopters. (#1553)

It's the largest streaming service in India that does cricket and GoT
for India. They have insane scale and are using Thanos to scale their
Prometheus.

Spoke to them offline about adding the logo and will get a signoff here
too.

Signed-off-by: Goutham Veeramachaneni <[email protected]>
Signed-off-by: Ivan Kiselev <[email protected]>

* Fix hotstar logo in the adoptor's list (#1558)

Signed-off-by: Karthik Vijayaraju <[email protected]>
Signed-off-by: Ivan Kiselev <[email protected]>

* Fix typos, including 'fomrat' -> 'format' in tracing.config-file help text. (#1552)

Signed-off-by: Callum Styan <[email protected]>
Signed-off-by: Ivan Kiselev <[email protected]>

* Compactor: Fix for #844 - Ignore object if it is the current directory (#1544)

* Ignore object if it is the current directory

Signed-off-by: Jamie Poole <[email protected]>

* Add full-stop

Signed-off-by: Jamie Poole <[email protected]>
Signed-off-by: Ivan Kiselev <[email protected]>

* Adding doc explaining the importance of groups for compactor (#1555)

Signed-off-by: Leo Meira Vital <[email protected]>
Signed-off-by: Ivan Kiselev <[email protected]>

* Add blank line for list (#1566)

The format of these files is wrong in the web.

Signed-off-by: dongwenjuan <[email protected]>
Signed-off-by: Ivan Kiselev <[email protected]>

* Refactor compactor constants, fix bucket column (#1561)

* compact: unify different time constants

Use downsample.* constants where possible. Move the downsampling time
ranges into constants and use them as well.

Signed-off-by: Giedrius Statkevičius <[email protected]>

* bucket: refactor column calculation into compact

Fix the column's name and name it UNTIL-DOWN because that is what it
actually shows - time until the next downsampling.

Move out the calculation into a separate function into the compact
package. Ideally we could use the retention policies in this calculation
as well but the `bucket` subcommand knows nothing about them :-(

Signed-off-by: Giedrius Statkevičius <[email protected]>

* compact: fix issues with naming

Reorder the constants and fix mistakes.

Signed-off-by: Giedrius Statkevičius <[email protected]>
Signed-off-by: Ivan Kiselev <[email protected]>

* remove duplicate

Signed-off-by: Ivan Kiselev <[email protected]>
GiedriusS pushed a commit that referenced this pull request Oct 28, 2019
* Add prober to receive

Signed-off-by: Kemal Akkoyun <[email protected]>

* Add changelog entries

Signed-off-by: Kemal Akkoyun <[email protected]>

* Update README

Signed-off-by: Kemal Akkoyun <[email protected]>

* Remove default

Signed-off-by: Kemal Akkoyun <[email protected]>

* Wait hashring to be ready

Signed-off-by: Kemal Akkoyun <[email protected]>
Signed-off-by: Giedrius Statkevičius <[email protected]>
GiedriusS pushed a commit that referenced this pull request Oct 28, 2019
* Some updates to compact docs

Signed-off-by: Ivan Kiselev <[email protected]>

* some formatting

Signed-off-by: Ivan Kiselev <[email protected]>

* Update docs/components/compact.md

accept PR suggestions

Co-Authored-By: Bartlomiej Plotka <[email protected]>
Signed-off-by: Ivan Kiselev <[email protected]>

* Add metalmatze to list of maintainers (#1547)

Signed-off-by: Matthias Loibl <[email protected]>
Signed-off-by: Ivan Kiselev <[email protected]>

* resolve comments

Signed-off-by: Ivan Kiselev <[email protected]>

* resolve last comment

Signed-off-by: Ivan Kiselev <[email protected]>

* receive: Add liveness and readiness probe (#1537)

* Add prober to receive

Signed-off-by: Kemal Akkoyun <[email protected]>

* Add changelog entries

Signed-off-by: Kemal Akkoyun <[email protected]>

* Update README

Signed-off-by: Kemal Akkoyun <[email protected]>

* Remove default

Signed-off-by: Kemal Akkoyun <[email protected]>

* Wait hashring to be ready

Signed-off-by: Kemal Akkoyun <[email protected]>
Signed-off-by: Ivan Kiselev <[email protected]>

* downsample: Add liveness and readiness probe (#1540)

* Add readiness and liveness probes for downsampler

Signed-off-by: Kemal Akkoyun <[email protected]>

* Add changelog entry

Signed-off-by: Kemal Akkoyun <[email protected]>

* Remove default

Signed-off-by: Kemal Akkoyun <[email protected]>

* Set ready

Signed-off-by: Kemal Akkoyun <[email protected]>

* Update CHANGELOG

Signed-off-by: Kemal Akkoyun <[email protected]>

* Clean CHANGELOG

Signed-off-by: Kemal Akkoyun <[email protected]>
Signed-off-by: Ivan Kiselev <[email protected]>

* Document the dnssrvnoa option (#1551)

Signed-off-by: Antonio Santos <[email protected]>
Signed-off-by: Ivan Kiselev <[email protected]>

* feat store: added readiness and livenes prober (#1460)

Signed-off-by: Martin Chodur <[email protected]>
Signed-off-by: Ivan Kiselev <[email protected]>

* Add Hotstar to adopters. (#1553)

It's the largest streaming service in India that does cricket and GoT
for India. They have insane scale and are using Thanos to scale their
Prometheus.

Spoke to them offline about adding the logo and will get a signoff here
too.

Signed-off-by: Goutham Veeramachaneni <[email protected]>
Signed-off-by: Ivan Kiselev <[email protected]>

* Fix hotstar logo in the adoptor's list (#1558)

Signed-off-by: Karthik Vijayaraju <[email protected]>
Signed-off-by: Ivan Kiselev <[email protected]>

* Fix typos, including 'fomrat' -> 'format' in tracing.config-file help text. (#1552)

Signed-off-by: Callum Styan <[email protected]>
Signed-off-by: Ivan Kiselev <[email protected]>

* Compactor: Fix for #844 - Ignore object if it is the current directory (#1544)

* Ignore object if it is the current directory

Signed-off-by: Jamie Poole <[email protected]>

* Add full-stop

Signed-off-by: Jamie Poole <[email protected]>
Signed-off-by: Ivan Kiselev <[email protected]>

* Adding doc explaining the importance of groups for compactor (#1555)

Signed-off-by: Leo Meira Vital <[email protected]>
Signed-off-by: Ivan Kiselev <[email protected]>

* Add blank line for list (#1566)

The format of these files is wrong in the web.

Signed-off-by: dongwenjuan <[email protected]>
Signed-off-by: Ivan Kiselev <[email protected]>

* Refactor compactor constants, fix bucket column (#1561)

* compact: unify different time constants

Use downsample.* constants where possible. Move the downsampling time
ranges into constants and use them as well.

Signed-off-by: Giedrius Statkevičius <[email protected]>

* bucket: refactor column calculation into compact

Fix the column's name and name it UNTIL-DOWN because that is what it
actually shows - time until the next downsampling.

Move out the calculation into a separate function into the compact
package. Ideally we could use the retention policies in this calculation
as well but the `bucket` subcommand knows nothing about them :-(

Signed-off-by: Giedrius Statkevičius <[email protected]>

* compact: fix issues with naming

Reorder the constants and fix mistakes.

Signed-off-by: Giedrius Statkevičius <[email protected]>
Signed-off-by: Ivan Kiselev <[email protected]>

* remove duplicate

Signed-off-by: Ivan Kiselev <[email protected]>
Signed-off-by: Giedrius Statkevičius <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants