Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query results are incorrect #2027

Closed
dstrek opened this issue Mar 20, 2015 · 8 comments
Closed

query results are incorrect #2027

dstrek opened this issue Mar 20, 2015 · 8 comments
Assignees
Milestone

Comments

@dstrek
Copy link
Contributor

dstrek commented Mar 20, 2015

I inserted one million 'events' with 2 fields, value and name, where value is always 1 and name is a random 3 letter string. The time was a random date between now and 30 days ago. Each batch of 10,000 has an 'app' tag with a value of a random string so there's 100 unique tag values.

> select * from events limit 1
name: events
------------
time            name    value
2015-02-18T11:43:36Z    4z4 1

I can see there are roughly a million points inserted, they were inserted in batches of 10,000 using this code https://gist.github.com/dstrek/e342c32ae1b5dc0b7132 nothing complex just a basic use case

> select count(value) from events
name: events
------------
time            count
1970-01-01T00:00:00Z    997960

However if I try to query for the last 30, 31, 32, 40 days the counts are all different and incorrect.

> select count(value) from events where time > now() - 30d
name: events
------------
time                count
2015-02-04T00:00:00.001207251Z   515206

> select count(value) from events where time > now() - 31d
name: events
------------
time                count
2015-01-26T00:00:00.000789066Z     249888

> select count(value) from events where time > now() - 32d
name: events
------------
time                count
2015-02-14T00:00:00.001704135Z   915488

> select count(value) from events where time > now() - 40d
name: events
------------
time                count
2015-01-05T00:00:00.00068226Z   0

Now if I group the -40d query by 1 day buckets theres actually a count.

> select count(value) from events where time > now() - 40d group by time(1d)
name: events
------------
time            count
2015-02-08T00:00:00Z    0
2015-02-09T00:00:00Z    0
2015-02-10T00:00:00Z    0
2015-02-11T00:00:00Z    0
2015-02-12T00:00:00Z    0
2015-02-13T00:00:00Z    0
2015-02-14T00:00:00Z    0
2015-02-15T00:00:00Z    0
2015-02-16T00:00:00Z    0
2015-02-17T00:00:00Z    0
2015-02-18T00:00:00Z    17108
2015-02-19T00:00:00Z    33187
2015-02-20T00:00:00Z    33415
2015-02-21T00:00:00Z    32977
2015-02-22T00:00:00Z    33317
2015-02-23T00:00:00Z    33315
2015-02-24T00:00:00Z    33242
2015-02-25T00:00:00Z    33327
2015-02-26T00:00:00Z    33464
2015-02-27T00:00:00Z    32919
2015-02-28T00:00:00Z    33427
2015-03-01T00:00:00Z    33467
2015-03-02T00:00:00Z    33238
2015-03-03T00:00:00Z    33298
2015-03-04T00:00:00Z    33282
2015-03-05T00:00:00Z    33156
2015-03-06T00:00:00Z    33072
2015-03-07T00:00:00Z    33445
2015-03-08T00:00:00Z    33138
2015-03-09T00:00:00Z    33477
2015-03-10T00:00:00Z    33290
2015-03-11T00:00:00Z    33371
2015-03-12T00:00:00Z    33080
2015-03-13T00:00:00Z    33262
2015-03-14T00:00:00Z    33348
2015-03-15T00:00:00Z    33450
2015-03-16T00:00:00Z    33260
2015-03-17T00:00:00Z    33156
2015-03-18T00:00:00Z    32890
2015-03-19T00:00:00Z    33450
2015-03-20T00:00:00Z    16132
@dstrek
Copy link
Contributor Author

dstrek commented Mar 20, 2015

Forgot to say I am using the latest master, commit 8e4309b

@dstrek
Copy link
Contributor Author

dstrek commented Mar 20, 2015

It occurred to me to check using a timestamp and it seems like the issue may be the way events are written out? Repeating the query seconds apart the count I get is increasing in value.

> select count(value) from events where time > '2015-02-16T00:00:00Z'
name: events
------------
time                count
2015-01-31T09:55:29.927419114Z   481521

> select count(value) from events where time > '2015-02-16T00:00:00Z'
name: events
------------
time                count
2015-01-31T10:53:11.446247926Z   482837

> select count(value) from events where time > '2015-02-16T00:00:00Z'
name: events
------------
time                count
2015-01-31T11:15:56.312206756Z   483404

So now I'm not sure what this issue should be, there's definitely something wrong with the way these queries are returning inconsistent results, but why is the call to write a batch of points returning successfully if the writes aren't actually done?

I can sum all the values and group in yearly buckets and get what seems to be a correct value yet no group by and it's wrong.

> select sum(value) from events where time > now() - 1500d group by time(365d)
name: events
------------
time            sum
2010-12-22T00:00:00Z
2011-12-22T00:00:00Z
2012-12-21T00:00:00Z
2013-12-21T00:00:00Z
2014-12-21T00:00:00Z    997960

> select sum(value) from events where time > now() - 1500d
name: events
------------
time                sum
2011-01-26T00:00:00.00002219Z   516139

@pauldix
Copy link
Member

pauldix commented Mar 20, 2015

Thanks @dstrek, any chance you can zip up the data directory and post it somewhere so we can troubleshoot?

@pauldix
Copy link
Member

pauldix commented Mar 20, 2015

Wait, so after your test, are the results inconsistent? Are you sure that you're not posting points in the same series with the same timestamp? I ask that last question because they'll overwrite previous points.

@dstrek
Copy link
Contributor Author

dstrek commented Mar 20, 2015

yes I know points with the same timestamp in a series overwrite, each batch has 10,000 chances at randoming the same value between 0 and 2.6m (number of seconds in a month) and I figured that accounted for the missing points where the total is 997960 instead of an even million

I have another measurement with 10 million points and that exhibits the same issues, there's ~100k series in that one though it shouldn't matter as each query should only merge 1k of them and it responds quickly.

> select count(value) from impressions where app = 'm8o5j29bFR44Yw'
name: impressions
-----------------
time            count
1970-01-01T00:00:00Z   99998

> select count(value) from impressions where time > now() - 32d and app = 'm8o5j29bFR44Yw'
name: impressions
-----------------
time                count
2015-02-14T00:00:00.00167993Z   91448

> select count(value) from impressions where time > now() - 33d and app = 'm8o5j29bFR44Yw'
name: impressions
-----------------
time                count
2015-02-01T00:00:00.001133728Z   51593

I can't make sense of the dates that return in these, 2015-02-14 isn't 32d ago, and one more day before that certainly isn't 2015-02-01.

I will upload the data, though it may take a bit.

@dstrek
Copy link
Contributor Author

dstrek commented Mar 20, 2015

http://cdn.fancy.support/influx_db_11m.tar.bz2

$ shasum influx_db_11m.tar.bz2
1d0d9d2283ed3898a5d3f185bfcbbc044774b5c1  influx_db_11m.tar.bz2

@cloudjunky
Copy link

Yeah I am getting a very similar result to this.

Also on count(distinct(value)) I am getting ERR: json: cannot unmarshal object into Go value of type error

e.g.

> select count(distinct(value)) from mydb where time > now() - 7d group by time(5m)
ERR: json: cannot unmarshal object into Go value of type error

If I just count without the 'where time' I get;

> select count(value) from mydb
name: mydb
--------------
time            count
1970-01-01T00:00:00Z    92988

However if I count (without distinct) and use time I get proper results;

> select count(value) from mydb where time > now() - 7d group by time(1d)
name: mydb
--------------
time            count
2015-03-25T00:00:00Z    94319
2015-03-26T00:00:00Z    

@jwilder
Copy link
Contributor

jwilder commented May 18, 2015

I ran the linked test program and verified that it is writing duplicate timestamps which accounts for the missing values.

The slow increments on the count is likely due to timestamps in the future. I saw the same thing by writing points in the future. You could try running a query with time < now() + 7d or similar to see if the missing values show up.

The timestamp returned by the count(value) queries, without a group by time, are going to be a random timestamp that should be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants