Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store: Querying with regex label matchers return invalid metrics in version 0.3.0 #833

Closed
benclapp opened this issue Feb 11, 2019 · 13 comments

Comments

@benclapp
Copy link
Contributor

Thanos, Prometheus and Golang version used

  • Thanos: all components 0.3.0
  • Prometheus: 2.7.1

What happened
After upgrading all components from 0.2.1 to 0.3.0, some dashboards using queries with regex label matchers started showing incorrect metrics. Querying thanos store v0.2.1 from thanos query v0.3.0 works as expected. Toggling deduplication, partial response, and all downsampling options return the same incorrect results. Some query examples:

  • container_memory_usage_bytes{name=~"thanos-compact} returns metrics for all containers, whereas container_memory_usage_bytes{name="thanos-compact} returns metrics for only thanos-compact. In this case, the invalid results are only for a 6h window, all results outside of this window are correct. The blocks in this 6h window were uploaded with thanos 0.2.1 components.
  • up{job=~"thanos"} returns thanos metrics correctly, however up{invalidLabelName=~"thanos"} behaves as if the query was for only up. In this case, the query returns incorrect results for all time.

What you expected to happen
Queries return metrics that correctly match the regex in the query, just as prometheus and thanos-store v0.2.1 do.

How to reproduce it (as minimally and precisely as possible):
Should be covered by the description above, but happy to provide any information required.

Full logs to relevant components
thanos-store

level=info ts=2019-02-11T09:16:17.930002237Z caller=flags.go:75 msg="gossip is disabled"
level=info ts=2019-02-11T09:16:17.930311959Z caller=factory.go:39 msg="loading bucket configuration"
level=info ts=2019-02-11T09:16:42.611885128Z caller=main.go:256 msg="disabled TLS, key and cert must be set to enable"
level=info ts=2019-02-11T09:16:42.6122304Z caller=store.go:197 msg="starting store node"
level=info ts=2019-02-11T09:16:42.612546496Z caller=main.go:308 msg="Listening for metrics" address=0.0.0.0:19193
level=info ts=2019-02-11T09:16:42.612433471Z caller=store.go:166 msg="Listening for StoreAPI gRPC" address=0.0.0.0:19092

thanos-query

level=info ts=2019-02-11T09:30:25.069246106Z caller=flags.go:75 msg="gossip is disabled"
level=info ts=2019-02-11T09:30:25.088541595Z caller=main.go:256 component=query msg="disabled TLS, key and cert must be set to enable"
level=info ts=2019-02-11T09:30:25.088775851Z caller=query.go:468 msg="starting query node"
level=info ts=2019-02-11T09:30:25.092463697Z caller=query.go:437 msg="Listening for query and metrics" address=0.0.0.0:19192
level=info ts=2019-02-11T09:30:25.094654745Z caller=query.go:460 component=query msg="Listening for StoreAPI gRPC" address=0.0.0.0:10901
level=info ts=2019-02-11T09:30:30.096756862Z caller=storeset.go:250 component=storeset msg="adding new store to query storeset" address=172.18.0.4:19092
level=info ts=2019-02-11T09:30:30.096901412Z caller=storeset.go:250 component=storeset msg="adding new store to query storeset" address=172.18.0.3:19090

Anything else we need to know

@bwplotka
Copy link
Member

Hm.. I think it is actually quite related to this:

#834
#829

This all is related to this PR: #753 where we change the postings and matchers. Looking more into that, but I am on holidays, so have limited time.

@bwplotka
Copy link
Member

In mean time try deploying thanos-store in 0.2.1.

@benclapp
Copy link
Contributor Author

Thanos store 0.2.1 seems to be working fine for now, enjoy your holidays!

@bwplotka
Copy link
Member

We fixed the matchers here: #839

Also we added extended test cases, but I don't think we can repro your case. Can you try master image for thanos-store?

@benclapp
Copy link
Contributor Author

Looks like the second scenario from the issue has been fixed! Unfortunately the first has not.

If it'd help I can provide access to the GCS bucket? DM: @BenClapp

@bwplotka
Copy link
Member

bwplotka commented Feb 16, 2019

I think just more info in what is wrong is fine for now.

So this doesn't not work?

up{job=~"thanos"}returns thanos metrics correctly, however up{invalidLabelName=~"thanos"} behaves as if the query was for only up. In this case, the query returns incorrect results for all time.

What Prometheus itself is doing? It is doing as you expect? And what you expect? Because for "invalid label" ~= the flow is that it gives all, because we filtered out what we could - nothing because label does not exist. (:

@benclapp
Copy link
Contributor Author

benclapp commented Feb 17, 2019

Sure :) So with the build from master, this is what I see.

  • Query for up{invalidLabelName=~"foo"} returns all series, like querying up
  • Query for up{job=~"invalidLabelValue"} returns all series, like querying up or up{job=~".+"}

In both cases Prometheus returns no results, as I would expect.

I may be mistaken, but shouldn't the logic be to use the label selectors to select series, rather than to filter away series which don't match?

@benclapp
Copy link
Contributor Author

To update on this scenario from the original post

container_memory_usage_bytes{name=~"thanos-compact} returns metrics for all containers, whereas container_memory_usage_bytes{name="thanos-compact} returns metrics for only thanos-compact. In this case, the invalid results are only for a 6h window, all results outside of this window are correct. The blocks in this 6h window were uploaded with thanos 0.2.1 components.

Blocks that have a successful regex match return results as I would expect, that is only series for {name=thanos-compact}. Blocks that do not have any series that match, return all series for this metric name instead.

@sepich
Copy link
Contributor

sepich commented Feb 19, 2019

Confirming the issue.
Another thing to note is that it works fine from thanos-query UI, for example
go_goroutines{instance=~"thanos-query"}
returns single value.
But grafana doing:

GET api/datasources/proxy/1/api/v1/query_range?query=go_goroutines%7Binstance%3D~%22thanos-query%22%7D&start=1550571840&end=1550582670&step=30

to the same thanos-query and get 174 results.

@bwplotka
Copy link
Member

Blocks that do not have any series that match, return all series for this metric name instead.

Ok I think this is a key issue here (: let me again double check between our and tsdb matching in this detail.

@bwplotka
Copy link
Member

Found & fixed: #862

PTAL and thanks all for details input on what's broken (: Silly bug in edge case ): I wish there were more unit tests for TSDB we could get test cases from. Now our test cases are much more extended then for TSDB itself - hopefully it covers now all we need.

bwplotka added a commit that referenced this issue Feb 20, 2019
domgreen pushed a commit that referenced this issue Feb 21, 2019
@domgreen
Copy link
Contributor

#862 fixed and confirmed by @benclapp 👍 closing.

@sepich
Copy link
Contributor

sepich commented Feb 21, 2019

Thank you, confirming the fix in master-2019-02-21-0c730c1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants