Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long-Range Query Weirdness #396

Closed
treed opened this issue Jun 29, 2018 · 11 comments
Closed

Long-Range Query Weirdness #396

treed opened this issue Jun 29, 2018 · 11 comments

Comments

@treed
Copy link

treed commented Jun 29, 2018

Thanos, Prometheus and Golang version used

thanos, version v0.1.0-rc.1 (branch: master, revision: 466371fca24319f0bf07b579685fe13c5f28d19d)
  build user:       ir@irp250-v23g04
  build date:       20180629-22:42:58
  go version:       go1.10.2

prometheus, version 2.2.1 (branch: HEAD, revision: bc6058c81272a8d938c05e75607371284236aadc)
  build user:       root@149e5b3f0829
  build date:       20180314-14:15:45
  go version:       go1.10

What happened

Queried for weeks of data, and found a bunch of it missing:

screen shot 2018-06-29 at 16 04 45

With a tighter view, Thanos can return the data from the lines on the right, just not with a wide view.

What you expected to happen

Seeing weeks of data (here's one of the underlying Prometheus instances):

screen shot 2018-06-29 at 16 04 10

How to reproduce it (as minimally and precisely as possible):

Unsure. I've done some experimentation to see exactly where the threshold between "full data" and "partial data" lies.

At a query width (unsure what the term for this is) of:

  • 979498s, Thanos shows full data for both sides
  • 979499s, replica A exhibits the problem, but B still shows full data
  • 979512s, Still showing B but not A
  • 979513s, I get a banner saying "no data found" (but the data from the 979512s attempt sticks around)
  • 979529s, the banner goes away and the graph is blank, but the legend says it has data for replica A (but nothing from B)

(I have screenshots for all of the above)

The same problem also happens if you remove the irate() part, and just query for raw data.

Full logs to relevant components

Logs

Querier:

level=info ts=2018-06-29T22:44:14.513851044Z caller=flags.go:53 msg="StoreAPI address that will be propagated through gossip" address=10.62.203.43:10909
level=info ts=2018-06-29T22:44:14.528827684Z caller=flags.go:68 msg="QueryAPI address that will be propagated through gossip" address=10.62.203.43:9090
level=info ts=2018-06-29T22:44:14.543604525Z caller=query.go:251 msg="starting query node"
level=info ts=2018-06-29T22:44:14.549198932Z caller=query.go:243 component=query msg="Listening for StoreAPI gRPC" address=0.0.0.0:10909
level=info ts=2018-06-29T22:44:14.549569349Z caller=query.go:225 msg="Listening for query and metrics" address=0.0.0.0:9090
level=info ts=2018-06-29T22:44:19.551188012Z caller=storeset.go:223 component=storeset msg="adding new store to query storeset" address=10.62.203.43:10905
level=info ts=2018-06-29T22:44:19.551287768Z caller=storeset.go:223 component=storeset msg="adding new store to query storeset" address=10.62.203.43:10901

Sidecar for Replica A:

level=info ts=2018-06-29T22:44:09.700629869Z caller=flags.go:53 msg="StoreAPI address that will be propagated through gossip" address=10.62.203.43:10901
level=info ts=2018-06-29T22:44:09.714410305Z caller=sidecar.go:233 msg="No GCS or S3 bucket was configured, uploads will be disabled"
level=info ts=2018-06-29T22:44:09.714455003Z caller=sidecar.go:270 msg="starting sidecar" peer=
level=info ts=2018-06-29T22:44:09.714541843Z caller=main.go:243 msg="Listening for metrics" address=0.0.0.0:10902
level=info ts=2018-06-29T22:44:09.714571609Z caller=sidecar.go:215 component=sidecar msg="Listening for StoreAPI gRPC" address=0.0.0.0:10901

The other replica's sidecar has identical logs except for the timestamps and port numbers. (All components are running on the same VM)

Anything else we need to know

Object store isn't, in play here. This is just a pair of Prometheus instances with the default 15d retention.

I updated to latest master today after first seeing the problem before starting to gather data for a bug report.

@bwplotka
Copy link
Member

bwplotka commented Jul 2, 2018

Thanks for reporting. Can you try without auto downsampling option? (choose raw data). Can you see all data then?

I might think this is a duplicate of #376 but let us know.

Seems like downsampling is not working 100%, we need to investigate it closer.

@pcallewaert
Copy link

I've the same problem, same setup (Prom v2.2.1 and Thanos v0.1.0-rc1).
In the Thanos WebUI I have tried the Downsampling option Raw data, but did not have any effect.

@bwplotka bwplotka added the bug label Jul 2, 2018
@bwplotka
Copy link
Member

bwplotka commented Jul 2, 2018

Oh I misunderstood. There is no thanos strore. So indeed no downsampling issue for sure. Interesting.

  • Is the query results reproducible all the time?
  • What if you ask for different time range.. for just 1h? Does it show anything?
  • Have you changed the sidecar external labels lately? What are these now and in the past (if there was a change)?

I am suspecting some bug / configuration error with matching the time range and labels, but let's see. As discussed with @pcallewaert, he has different issues (downsampling one)

@treed
Copy link
Author

treed commented Jul 5, 2018

The results I filed the bug about happened consistently throughout an afternoon. I just came back from a (very) long weekend, and found that the queries that were failing then (including the original timeframe, ending on June 29th) are now fully loading, but, if I move the window to look at more recent data, the same problem occurs. It seems to be about crossing some boundary back from the current time.

If I ask for a shorter range, as mentioned in the bug report, things are fine, until the range crosses some threshold and then it hits.

No external labels have been changed since this was all set up a while back.

And, yeah, changing the downsampling setting has no effect. In fact, I'm not even sure it was available on the version I was using when I first noticed this bug.

@bwplotka
Copy link
Member

Interesting characteristics of the downsampling bug found by @asbjxrn:

peculiar case of missing data for long term queries (Cluster is one query node and one store node only. The blobstore contains 2 blocks 162h blocks compacted by Prometheus and uploaded. thanos downsample has been run on it since):

I get a result (graph in ui) when querying the time series directly. The query is a 1 week range query for: node_time_seconds{instance="hostname"}

If I wrap it in rate(...[5m]) I get "no datapoints found" using downsampling (Auto, 5m and 1h all fails, I believe there is only 5m downsamples in the store).

If I choose only raw data, the rate query works again.

Using a shorter range (2d) I can scroll across the whole week using the rate function and downsampled data and get result back each time.

I know there is a known issue with downsampling/long time range queries, but was surprised wrapping it in a function changed the outcome when the plain query worked.

Will be useful to track it down.

@trevorwhitney
Copy link

Seeing this same behavior on 0.1.0-rc2. Graph for one week of data:
screen shot 2018-11-28 at 11 02 57 am

Graph for two weeks of data:
screen shot 2018-11-28 at 11 03 11 am

Changing deduplication, resolution, and downsampling/raw data option has no effect.

Are there any updates on this issue? We have a large environment setup with 20 HA Prometheus pairs each with a Thanos sidecar and a similar set of metrics. We're happy to provide access if it can help debug the problem.

@robertjsullivan
Copy link
Contributor

I spent some time debugging this today. It appears that the Prometheus XOR encoding scheme used in the sidecar cannot encode/decode more than 2^16th number of samples. I think what is happening is the sidecar queries the remote read endpoint without a step size, gets more than 2^16 number of points back and that overflows a uint16 counter on the compression. Those points are then sent back to the querier via grpc with a size indicating only the amount of samples that overflowed the buffer.

Any thoughts on what the proper fix is @bwplotka ?

Here is a quick test that proves it. Apologies for copying the encodeChunk method out of store/prometheus.go

package encoding_test

import (
    "github.com/improbable-eng/thanos/pkg/store/prompb"
    "github.com/improbable-eng/thanos/pkg/store/storepb"
    "github.com/improbable-eng/thanos/pkg/testutil"
    "github.com/prometheus/tsdb/chunkenc"
    "testing"
    "time"
)

func TestEncoding_EncodesAndDecodes_Correctly(t *testing.T) {
    largestNumberOfSamples := 65535
    samples := createSamples(largestNumberOfSamples)
    testutil.Equals(t, len(samples), largestNumberOfSamples)

    _, cb, err := encodeChunk(samples)
    testutil.Ok(t, err)

    chk, err := chunkenc.FromData(chunkenc.EncXOR, cb)
    testutil.Ok(t, err)

    testutil.Equals(t, chk.NumSamples(), largestNumberOfSamples)
}

func TestEncoding_EncodesAndDecodes2ToThe16th_Fails(t *testing.T) {
    tooManySamples := 65536 //2^16
    samples := createSamples(tooManySamples)
    testutil.Equals(t, len(samples), tooManySamples)

    _, cb, err := encodeChunk(samples)
    testutil.Ok(t, err)

    chk, err := chunkenc.FromData(chunkenc.EncXOR, cb)
    testutil.Ok(t, err)

    testutil.Equals(t, chk.NumSamples(), tooManySamples)
}

func createSamples(size int) []prompb.Sample {
    samples := make([]prompb.Sample, 0)
    for i := 0; i < size; i++ {
        sample := prompb.Sample{
            Timestamp: time.Now().UnixNano() + 1,
            Value:     float64(i),
        }
        samples = append(samples, sample)
    }

    return samples
}

func encodeChunk(ss []prompb.Sample) (storepb.Chunk_Encoding, []byte, error) {
    c := chunkenc.NewXORChunk()

    a, err := c.Appender()
    if err != nil {
        return 0, nil, err
    }

    for _, s := range ss {
        a.Append(int64(s.Timestamp), float64(s.Value))
    }
    return storepb.Chunk_XOR, c.Bytes(), nil
}

@robertjsullivan
Copy link
Contributor

I think this is the uint16 enforcing line in prometheus/tsdb . https://github.com/prometheus/tsdb/blob/8d991bdc1e494074ca2d559e5a75f49f13a789ac/chunkenc/xor.go#L75

@kitemongerer
Copy link

It seems like this comment might explain what's going on https://github.com/improbable-eng/thanos/blob/270d81feec83ba5defa19d64e022d84e3b2e1a2a/pkg/store/prometheus.go#L158-L160

If there are more than 2^16 samples, they might need to be split into multiple chunks

@bwplotka
Copy link
Member

bwplotka commented Jan 8, 2019

Wow huge thanks @robertjsullivan for digging into it so deeply. Once again, I got positively surprised by commuinty! ❤️ 💪

So overall.. The limit on of uint16 is totally fine from Prometheus and TSDB perspective, because there should be no more than 120 samples within single chunk and this is done purpose: https://github.com/prometheus/tsdb/blob/master/head.go#L1514

It appears that the Prometheus XOR encoding scheme used in the sidecar cannot encode/decode more than 2^16th number of samples. I think what is happening is the sidecar queries the remote read endpoint without a step size, gets more than 2^16 number of points back and that overflows a uint16 counter on the compression. Those points are then sent back to the querier via grpc with a size indicating only the amount of samples that overflowed the buffer.

You and @kitemongerer are totally right here. I think you spotted valid bug. We totally missed it as we were focused on progressing on this: (https://docs.google.com/document/d/1JqrU3NjM9HoGLSTPYOvR217f5HBKBiJTqikEB9UiJL0/edit?ts=5c2cdaf0#) which would solve this issue entirely.

Since stream remote read is still in progress and since we care about backward compatibility the bug is valid and we should fix the the implementation.

The fix is extremely straightforward to me, we should split series by 2^16 and produce biggest chunks possible, because the comment there is still valid:

// We generally expect all samples of the requested range to be traversed
// so we just encode all samples into one big chunk regardless of size.

I will leave the honour of fixing this to you @robertjsullivan (: If you prefer me or anyone doing fix, I am fine, let us know.

@robertjsullivan
Copy link
Contributor

Created the PR. Let me know if anything needs to change. Thanks!

bwplotka pushed a commit that referenced this issue Jan 9, 2019
…16 samples (#718)

* fixes issue #396, split response into chunks no bigger than 2^16 to avoid overflowing XOR compression

Signed-off-by: Robert Sullivan <[email protected]>

* use single underlying array for chunkSamples

* make SplitSamplesIntoChunksWithMaxSizeOfUint16_e2e use 3 chunks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants