-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
querier: Add /api/v1/labels support #905
Conversation
dc1498f
to
89e5efc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of this PR.Good stuff 👍
pkg/store/bucket.go
Outdated
s.mtx.RUnlock() | ||
|
||
if err := g.Wait(); err != nil { | ||
return nil, status.Error(codes.Aborted, err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check here for partial response here and return it if it's enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have checked partial response at the ProxyStore.LabelNames()
, should I do the same things here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have misunderstood things, the whole point is to make partialResponse work so once you finish the tests this might be not needed :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm. I don't get you comment @povilasv either as literally there is no way this will return error :
defer runutil.CloseWithLogOnErr(s.logger, indexr, "label names")
res := indexr.LabelNames()
mtx.Lock()
sets = append(sets, res)
mtx.Unlock()
return nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so partial response does not exists here 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@povilasv maybe g.Wait()
will not return an error when the timeout context is cancelled. So in this case, there is no way to return errors in g.Go()
block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So g.Go()
needs to check ctx.Err()
, before executing stuff.
g.Go(func() error {
defer runutil.CloseWithLogOnErr(s.logger, indexr, "label names")
if ctx.Err() != nil {
//handle partial data
}
res := indexr.LabelNames()
mtx.Lock()
sets = append(sets, res)
mtx.Unlock()
return nil
})
Imagine the case where user canceld the request, and we still have tons of blocks to go thru.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jojohappy you forgot to address this ^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I had updated at this moment, I tried to check partial response enabled when ctx
was timeout or cancelled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite confused by this thread, and sorry for delayed answer. I think partial response has really small reason of being handled here in store gw. It might just create confusion and being volation of YAGNI as no one requested every to ignore if block errors out. It is totally different to Querier partial response, where query might be still true if one StoreAPI is down
@povilasv Thanks for your awesome reivew. I have updated the PR, PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just small suggestions.
Plus, let's add CHANGELOG.md entry for Added
section (:
func (s *BucketStore) LabelNames(ctx context.Context, _ *storepb.LabelNamesRequest) (*storepb.LabelNamesResponse, error) { | ||
g, gctx := errgroup.WithContext(ctx) | ||
|
||
s.mtx.RLock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at some point we need to make this more readable.
Either abstract away this or rename as I don't know now if we block Every gRPC call now or what. I needed to dive really deep to tell. But that's not for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we can create a ticket to record the issue.
pkg/store/bucket.go
Outdated
s.mtx.RUnlock() | ||
|
||
if err := g.Wait(); err != nil { | ||
return nil, status.Error(codes.Aborted, err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm. I don't get you comment @povilasv either as literally there is no way this will return error :
defer runutil.CloseWithLogOnErr(s.logger, indexr, "label names")
res := indexr.LabelNames()
mtx.Lock()
sets = append(sets, res)
mtx.Unlock()
return nil
pkg/store/bucket.go
Outdated
s.mtx.RUnlock() | ||
|
||
if err := g.Wait(); err != nil { | ||
return nil, status.Error(codes.Aborted, err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so partial response does not exists here 👍
974373f
to
201f5e3
Compare
return nil, status.Error(codes.Unknown, err.Error()) | ||
} | ||
|
||
if m.Status != "success" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a possability that both data and error are returned together? if yes then handle partial response, if no then close this :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is same for other requests as well we do here 💩 We need to agree on something consistently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@povilasv I will be double check the response from Prometheus. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@povilasv Yes, I had checked and updated the code at this moment below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being a bit PITA but it's hard to handle correctly partial response and contexts ⛰️
I do like this PR and functionality
@povilasv Sorry for the delay, I will update the PR today ASAP. |
@povilasv Done. |
7f35a5d
to
14a3d33
Compare
pkg/store/bucket.go
Outdated
s.mtx.RUnlock() | ||
|
||
if err := g.Wait(); err != nil { | ||
return nil, status.Error(codes.Aborted, err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jojohappy you forgot to address this ^
@povilasv All comments had been addressed, PTAL, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so overall I love this PR, good work @jojohappy @povilasv .. We are quite close to merging this.
However, in recent commits you added PartialResponse
handling everywhere, this is wrong, but let's discuss. It should be only in Querier and I mentioned reasons in my comments.
pkg/store/bucket.go
Outdated
s.mtx.RUnlock() | ||
|
||
if err := g.Wait(); err != nil { | ||
if !r.PartialResponseDisabled { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please kill this, just aborting is fine. We are not in Querier.
Essentially If we have error here, means something serious on store gateway indexes and our version is not TSDB compatibile.
pkg/store/bucket.go
Outdated
s.mtx.RUnlock() | ||
|
||
if err := g.Wait(); err != nil { | ||
return nil, status.Error(codes.Aborted, err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite confused by this thread, and sorry for delayed answer. I think partial response has really small reason of being handled here in store gw. It might just create confusion and being volation of YAGNI as no one requested every to ignore if block errors out. It is totally different to Querier partial response, where query might be still true if one StoreAPI is down
pkg/store/bucket.go
Outdated
func (r *bucketIndexReader) LabelNames(ctx context.Context) ([]string, error) { | ||
res := make([]string, 0, len(r.block.lvals)) | ||
for ln, _ := range r.block.lvals { | ||
if ctx.Err() != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mixed feelings, as IMO it is early, micro optimization that only increases code complexity.
There is no much of the point in checking ctx.Err in every place. It takes more cycles to check this IF than rest of this function not being cancelled (: I guess we wanted to save that memory in this slice when request is cancelled right? I would drop it IMO.
pkg/store/prometheus.go
Outdated
} | ||
|
||
if m.Status != "success" { | ||
if !r.PartialResponseDisabled { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please kill this.
No point in this masking actual error when we have ZERO data to return. No point, and really bad anti-pattern.
If Prometheus errors it means we have no way to tell anything - so error is OK.
Let's agree on something: PartialResponse can be handled and triggered only by querier. It's clear that Partial Response is really tricky to handle and reason about, so let's limit this to basic minimum (Querier proxy StoreAPI only). IMO no value in having it here and on store.
pkg/store/prometheus.go
Outdated
} | ||
|
||
if m.Status != "success" { | ||
if !r.PartialResponseDisabled { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kill this as commented above.
pkg/store/prometheus.go
Outdated
sort.Strings(m.Data) | ||
|
||
if resp.StatusCode == http.StatusNoContent { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean NoContent? Do you think Prometheus return that ?
What if the resp status is something not 200 though? I am pretty sure we need to check for non 200 response as well before even parsing body to sturct. Here and in other methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the resp status is something not 200 though? I am pretty sure we need to check for non 200 response as well before even parsing body to sturct. Here and in other methods.
Sorry, I forgot to handle the server error response, would be added.
What does it mean NoContent? Do you think Prometheus return that ?
I think we should handle it, because Promethues API will return the http header of StatusNoContent
without data. Maybe it would be failed for parsing body, but not sure.
}) | ||
if err != nil { | ||
err = errors.Wrapf(err, "fetch label names from store %s", st) | ||
if r.PartialResponseDisabled { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this is the only valid partial response handling/triggering in this PR (:
} | ||
|
||
return &storepb.LabelNamesResponse{ | ||
Names: strutil.MergeUnsortedSlices(names...), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsorted ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the names from all stores could not be guaranteed sorted, so should we made a sort in all stores or just do it in proxy
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HMM prometheus docs don't state the need for sorting, but all examples are sorted so I guess let's do sorting. I would do it in proxy :)
https://prometheus.io/docs/prometheus/latest/querying/api/#getting-label-names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MergeUnsortedSlices
would do sorting if the slices were not ordered yet. so the method is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow the name is so confusing then! ;p
460009a
to
3ce75cb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add sorting and IMO we are good to go.
@jojohappy sorry for misguiding you on partial response. I owe you 🍺 if we meet in some conference or smth :)
@povilasv It's ok, and I must give respects for your careful code review. Thank you very much. |
Some one said 🍺 ? ❤️ |
Signed-off-by: jojohappy <[email protected]>
Signed-off-by: jojohappy <[email protected]>
Signed-off-by: jojohappy <[email protected]>
Signed-off-by: jojohappy <[email protected]>
Signed-off-by: jojohappy <[email protected]>
Signed-off-by: jojohappy <[email protected]>
Signed-off-by: jojohappy <[email protected]>
Signed-off-by: jojohappy <[email protected]>
Signed-off-by: jojohappy <[email protected]>
3ce75cb
to
90fb062
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks!
* query: cleanup store statuses as they come and go (thanos-io#910) Signed-off-by: Adrien Fillon <[email protected]> * [docs] Example of using official prometheus helm chart to deploy server with sidecar (thanos-io#1003) * update documentation with an example of using official prometheus helm chart Signed-off-by: Ivan Kiselev <[email protected]> * a little formatting to values Signed-off-by: Ivan Kiselev <[email protected]> * satisfy PR comments Signed-off-by: Ivan Kiselev <[email protected]> * Compact: group concurrency (thanos-io#1010) * compact: add concurrency to group compact * add flag to controll the number of goroutines to use when compacting group * update compact.md for group-compact-concurrency * fixed: miss wg.Add() * address CR * regenerate docs * use err group * fix typo in flag description * handle context * set up workers in main loop * move var initialisation * remove debug log * validate concurrency * move comment * warn -> error * remove extra newline * fix typo * dns: Added miekgdns resolver as a hidden option to query and ruler. (thanos-io#1016) Fixes: thanos-io#1015 Signed-off-by: Bartek Plotka <[email protected]> * query: set default evaluation interval (thanos-io#1028) Subqueries allows request with no [specified resolution](https://prometheus.io/blog/2019/01/28/subquery-support/). Set it up and allow to configure default evaluation interval. * store+compactor: pre-compute index cache during compaction (thanos-io#986) Fixes first part of thanos-io#942 This changes allow to safe some startup & sync time in store gateway as it is no longer is needed to compute index-cache from block index on its own. For compatibility store GW still can do it, but it first checks bucket if there is index-cached uploaded already. In the same time, compactor precomputes the index cache file on every compaction. To allow quicker addition of index cache files we added `--index.generate-missing-cache-file` flag, that if enabled precompute missing files on compactor startup. Note that it will take time and it's only one-off step per bucket. Signed-off-by: Aleksei Semiglazov <[email protected]> * Added website for Thanos' docs using Hugo. (thanos-io#807) Hosted in github pages. Signed-off-by: adrien-f <[email protected]> Signed-off-by: Bartek Plotka <[email protected]> * gcs: Fixed scopes for inline ServiceAccount option. (thanos-io#1033) Without this that option was unusable. Signed-off-by: Bartek Plotka <[email protected]> * Fixed root docs and liche is now checking root dir as well. (thanos-io#1040) Signed-off-by: Bartek Plotka <[email protected]> * storage docs: add detail about GCS policies and testing (thanos-io#1037) * add more details about GCS policies and testing * remove fixed names from exec command * Prometheus library updated to v2.8.1 (thanos-io#1009) * compact: group concurrency improvements (thanos-io#1029) * group concurrency improvements * remove unnecessary error check * add to wg in main goroutine * receive: Add block shipping (thanos-io#1011) * receive: Add retention flag for local tsdb storage (thanos-io#1046) * querier: Add /api/v1/labels support (thanos-io#905) * Feature: add /api/v1/labels support Signed-off-by: jojohappy <[email protected]> * Disabled gossip by default, marked all flags as deprecated. (thanos-io#1055) + changed small label. Signed-off-by: Bartek Plotka <[email protected]> * ruler: Fixed Chunk going out or Max Uint16. (thanos-io#1041) Fixes thanos-io#1038 Signed-off-by: Bartek Plotka <[email protected]> * store: azure: allow passing an endpoint parameter for specific regions (thanos-io#980) Fix thanos-io#968 Signed-off-by: Adrien Fillon <[email protected]> * feature: support POST method for series endpoint (thanos-io#1021) Signed-off-by: Joseph Lee <[email protected]> * bucket verify: repair out of order labels (thanos-io#964) * bucket verify: repair out of order labels * verify repair: correctly order series in the index on rewrite When we have label sets that are not in the correct order, fixing that changes the order of the series in the index. So the index must be rewritten in that new order. This makes this repair tool take up a bunch more memory, but produces blocks that verify correctly. * Fix the TSDB block safe-delete function The directory name must be the block ID name exactly to verify. A temp directory or random name will not work here. * verify repair: fix duplicate chunk detection Pointer/reference logic error was eliminating all chunks for a series in a given TSDB block that wasn't the first chunk. Chunks are now referenced correctly via pointers. * PR feedback: use errors.Errorf() instead of fmt.Errorf() * Use errors.New() Some linters catch errors.Errorf() as its not really part of the errors package. * Liberally comment this for loop We're comparing items by pointers, using Go's range variables is misleading here and we need not fall into the same trap. * Take advantage of sort.Interface This prevents us from having to re-implement label sorting. * PR Feedback: Comments are full sentences. * Cut release 0.4.0-rc.0 (thanos-io#1017) * Cut release 0.4.0-rc.0 🎉 🎉 NOTE: This is last release that has gossip. Signed-off-by: Bartek Plotka <[email protected]> Co-Authored-By: povilasv <[email protected]> * Fixed crossbuild. Signed-off-by: Bartek Plotka <[email protected]> * ci: Env fixes. (thanos-io#1058) Signed-off-by: Bartek Plotka <[email protected]> * Removed bzr requirement for make crossbuild. Signed-off-by: Bartek Plotka <[email protected]> * Bump github.com/hashicorp/golang-lru from 0.5.0 to 0.5.1 (thanos-io#1051) Bumps [github.com/hashicorp/golang-lru](https://github.com/hashicorp/golang-lru) from 0.5.0 to 0.5.1. - [Release notes](https://github.com/hashicorp/golang-lru/releases) - [Commits](hashicorp/golang-lru@v0.5.0...v0.5.1) Signed-off-by: dependabot[bot] <[email protected]> * Initialze and correctly register all index cache metrics. (thanos-io#1069) * store/cache: add more tests (thanos-io#1071) * Fixed Downsampling process; Fixed `runutil.CloseAndCaptureErr` (thanos-io#1070) * runutil. Simplified CloseWithErrCapture. Signed-off-by: Bartek Plotka <[email protected]> * Fixed Downsampling process; Fixed runutil.CloseAndCaptureErr Fixes thanos-io#1065 Root cause: * runutil defered capture error function was not passing error properly so unit tests were passing, event though there was bug * streamed block write index cache requires index file which was not closed (saved) properly yet. Closers need to be closed to perform this. Signed-off-by: Bartek Plotka <[email protected]> * objstore: Expose S3 region attribute (thanos-io#1060) Minio is able to autodetect the region for cloud providers like AWS but the logic fails with Scaleway Object Storage solution. Related issue on Minio: minio/mc#2570 * Fixed fetching go-bindata failed (thanos-io#1074) * Fixed bug: - fetching go-bindata failed. - change the repo of go-bindata to github.com/go-bindata/go-bindata, because old repo has been archived. - pin the go-bindata as v3.3.1. Signed-off-by: jojohappy <[email protected]> * Add CHANGELOG Signed-off-by: jojohappy <[email protected]> * Remove CHANGELOG Signed-off-by: jojohappy <[email protected]> * add compare flags func to compare flags between prometheus and sidecar (thanos-io#838) Original message: * update documentation for a max/min block duration add compare flags func to compare flags between prom and sidecar * fix some nits Functional change: now we check the configured flags (if possible) and error out if MinTime != MaxTime. We need to check this always since if that is not true then we will get overlapping blocks. Additionally, an error message is printed out if it is not equal to 2h (the recommended value). * Ensured index cache is best effort, refactored tests, validated edge cases. (thanos-io#1073) Fixes thanos-io#651 Current size also includes slice header. Signed-off-by: Bartek Plotka <[email protected]> * website: Moved to netlify. (thanos-io#1078) Signed-off-by: Bartek Plotka <[email protected]> * website: Fixing netlify. (thanos-io#1080) Signed-off-by: Bartek Plotka <[email protected]> * website: Added "founded by" footer. (thanos-io#1081) Signed-off-by: Bartek Plotka <[email protected]> * store/proxy: properly check if context has ended (thanos-io#1082) How the code was before it could happen that we might receive some series from the stream however by the time we'd send them back to the reader, it would not read it anymore since the deadline would have been exceeded. Properly use a `select` here to get out of the goroutine if the deadline has been exceeded. Might potentially fix a problem where we see one goroutine hanging constantly (and thus blocking from work being done): ``` goroutine profile: total 126 25 @ 0x42f62f 0x40502b 0x405001 0x404de5 0xe7435b 0x45cc41 0xe7435a github.com/improbable-eng/thanos/pkg/store.startStreamSeriesSet.func1+0x18a /go/src/github.com/improbable-eng/thanos/pkg/store/proxy.go:318 ``` * Cut release v0.4.0-rc.1 (thanos-io#1088) Signed-off-by: Bartek Plotka <[email protected]> * website: Removed ghpages handling; fixed docs; and status badge. (thanos-io#1084) Signed-off-by: Bartek Plotka <[email protected]> * Fix readme (thanos-io#1090) * store: Compose indexCache properly allowing injection for testing purposes. (thanos-io#1098) Signed-off-by: Bartek Plotka <[email protected]> * website: add sponsor section on homepage (thanos-io#1062) * website: Adjusted logos sizing and responsiveness. (thanos-io#1105) Signed-off-by: Bartek Plotka <[email protected]> * Add Monzo to "Used by" section 🎉 (thanos-io#1106) * Compactor: remove malformed blocks after delay (thanos-io#1053) * compactor removes malformed blocks after delay * compactor removes malformed blocks after delay * include missing file * reuse existing freshness check * fix comment * remove unused var * fix comment * syncDelay -> consistencyDelay * fix comment * update flag description * address cr * fix dupliacte error handling * minimum value for --consistency-delay * update * docs * add test case * move test to inmem bucket * Add Utility Warehouse to "used by" section (thanos-io#1108) * Add Utility Warehouse logo * Make logo smaller * website: add Adform as users (thanos-io#1109) We use Thanos extensively as well so I have added Adform. * Cut release v0.4.0 (thanos-io#1107) Signed-off-by: Bartek Plotka <[email protected]>
Signed-off-by: jojohappy [email protected]
Changes
Implements: #702
Add
/api/v1/labels
api to thanos query.Verification
Run
curl <query url>/api/v1/labels
to fetch all of labels in Prometheus. Noticed that the api is added in Prometheus 2.6.