[0.9.3-rc1] Data disappears, replaced with a single point every 9 to 20 minutes #3781

gerrickw · 2015-08-21T00:25:32Z

This is slightly hard to explain, but mainly data is disappearing every 10-20 minutes. I am entering data every 10 seconds using a variety of metric names with different tags on the same field and suddenly 10-20 minutes of data will disappear. The exception being a slight tick exactly at around the 9-14 to 20 minute mark where the 10 seconds metric is seen. It appears when more data the interval seems to be around 10ish minutes. 20 minutes when less data.

Query:
SELECT last(value) FROM "requests" WHERE "colo" = 'aaa' AND "pool" = 'zzz' AND "x" = '001' AND "y" = '001' and time > now() - 10m GROUP BY time(10s)

This query will show:
2015-08-20T23:42:20Z 1183
2015-08-20T23:42:30Z 1071
2015-08-20T23:42:40Z 993
2015-08-20T23:42:50Z 1002
2015-08-20T23:43:00Z 1083
2015-08-20T23:43:10Z 1044
2015-08-20T23:43:20Z 1029
2015-08-20T23:43:30Z 1099
2015-08-20T23:43:40Z 1102
2015-08-20T23:43:50Z 1054
... 10 minutes later

015-08-20T23:42:20Z
2015-08-20T23:42:30Z
2015-08-20T23:42:40Z
2015-08-20T23:42:50Z
2015-08-20T23:43:00Z 1083
2015-08-20T23:43:10Z
2015-08-20T23:43:20Z
2015-08-20T23:43:30Z
2015-08-20T23:43:40Z
2015-08-20T23:43:50Z
...
2015-08-20T23:54:40Z
2015-08-20T23:54:50Z
2015-08-20T23:55:00Z
2015-08-20T23:55:10Z 976
2015-08-20T23:55:20Z 1103
2015-08-20T23:55:30Z 1030
2015-08-20T23:55:40Z 1087
2015-08-20T23:55:50Z 956
2015-08-20T23:56:00Z

SHOW RETENTION POLICIES ON db
name duration replicaN default
default "0" 1 true

Example of inputs:

This format will be used on all servers with different names of latency/requests/networking/etc
During a 10 second period all metrics will be pulled and reported. So the same int epochs for all metrics every 10s.

requests,colo=aaa,pool=zzz,prehostname=server-,x=021,y=027 value=1003 1440114198
requests,colo=aaa,pool=zzz,prehostname=server-,x=021,y=028 value=906 1440114198
requests,colo=aaa,pool=zzz,prehostname=server-,x=021,y=029 value=1151 1440114198
requests,colo=aaa,pool=zzz,prehostname=server-,x=021,y=030 value=1009 1440114198
requests,colo=aaa,pool=zzz,prehostname=server-,x=021,y=031 value=1001 1440114198
requests,colo=aaa,pool=zzz,prehostname=server-,x=021,y=032 value=1108 1440114198

A few notes:

This was happening on rc1, wiped completely and installed latest master today, but is still happening.
Config is default.
Confirmed isn't on my client side making sure I am not overwriting the value.

Example of one of the pool of servers from grafana (10 seconds metrics were reported and shown when earlier in the time period):

As a note, I confirmed this isn't a grafana issue as the same happens when querying with the query above.

Let me know if more needs to be known. Hard to know where to start explaining.

otoolep · 2015-08-21T00:34:59Z

Thanks @gerrickw for the report.

Without knowing exactly what is going on here, can you try the nightly build when it next becomes available? Some significant bug fixes went in earlier today, and it would be good to rule out those issues.

https://influxdb.com/download/index.html

gerrickw · 2015-08-21T00:37:29Z

@otoolep
I installed the latest nightly build today. Do you mean the one tomorrow?

otoolep · 2015-08-21T00:48:15Z

We need to update our docs, that should be clear.

Nightly is generated at midnight Pacific time, so there should be another
in 6 hours with the fixes. Alternatively you can build master or the 0.9.3
branch from source.

On Thursday, August 20, 2015, Gerrick W [email protected] wrote:

@otoolep https://github.com/otoolep
I installed the latest nightly build today. Do you mean the one tomorrow?

—
Reply to this email directly or view it on GitHub
#3781 (comment).

gerrickw · 2015-08-21T00:52:16Z

Good to know. I'll deploy latest tomorrow.

Thanks.

otoolep · 2015-08-21T01:20:50Z

Great, thanks @gerrickw -- let us know what you find.

otoolep · 2015-08-21T01:22:36Z

Flagging the milestone for review.

huhongbo · 2015-08-21T02:00:30Z

I have the exactly same problem
group by lost point

pauldix · 2015-08-21T17:58:11Z

Fairly certain this was fixed by #3761. Closing for now, but reopen if you still see this problem on the nightly build from last night.

gerrickw · 2015-08-21T18:46:51Z

Still have the same problem. Although seem to have ticks about every 4-6 minutes now with missing data in the middle. Upgraded to today's nightly build. Have a new error code 500 timeout issue as well, but I'll report that in a different ticket after lunch.

Steps starting at yesterday's master today:

sudo service stop influxdb
ps -ef|grep influxdb # Execute a few time waiting for influx to die.
Delete wal, hh, data directory
sudo dpkg -i influxdb_nightly_amd64.deb # latest nightly today.
sudo service start influxdb
Log shows, "2015/08/21 11:26:30 InfluxDB starting, version 0.9.3-nightly-d259afe, branch master, commit d259afe"
Start loading data.

I don't see a way to reopen this ticket, possibly need permissions?

otoolep · 2015-08-21T19:02:32Z

OK, thanks @gerrickw -- you are running a build with the important fixes, if that is the commit-hash of your system.

Please open a ticket regarding your 500 timeout, and be sure to include details of how you are sending data to the system.

otoolep · 2015-08-21T19:02:55Z

@pauldix -- I am re-opening this, please close if in error.

desa · 2015-08-21T19:40:57Z

@gerrickw having a hard time reproducing this error. Wrote 9190000 points in a few hours ago an everything is still there now.

gerrickw · 2015-08-21T19:57:25Z

I'll see if I can get a test script writing example points similar to my workflow. Need to do a few other things today, but I'll see if I can get something by tonight.

otoolep · 2015-08-21T23:15:15Z

@gerrickw -- that would be great, we're keen to see what is going on here.

Jhors2 · 2015-08-22T00:16:39Z

Updated this morning. I appear to have the same problem. It seems that whenever a new shard is created (I'm not entirely sure this is the case). the data disappears from timestamps behind that shard and only shows up in 9-10 minute increments. I dumped the DB and started over to verify this behavior. If I can recreate I can unicast you my data @otoolep.

FWIW We are all in a raft consistent state according to "show server".

/edit/ Just upgraded to RC2 to see if the fix is there. Will report back. /edit/

gerrickw · 2015-08-22T00:28:27Z

Reproduced using the below script on gist. You will need to pip install influxdb.

A number of arguments to customize things, although appears is reproducible by..
python simple_influx_writer.py --hhh_iii

User/pass/db are defaulted to test_db against localhost, but you can set as desired.

https://gist.github.com/gerrickw/f83fb4d4d69aef2dfd37

Once about 15-20 minutes of data, run the following query and notice data disappearing over time.
SELECT last(value) FROM "e" WHERE "colo" = 'coloa' AND "pool" = 'hhh_iii' AND time > now() - 20m GROUP BY time(10s), "y"

otoolep · 2015-08-22T02:49:40Z

If either @gerrickw or @Jhors2 can build from source, I'd be very interested in knowing if you see the same problem with the patch below in place:

$ git diff
diff --git a/tsdb/engine.go b/tsdb/engine.go
index 71da46a..748c2db 100644
--- a/tsdb/engine.go
+++ b/tsdb/engine.go
@@ -18,7 +18,7 @@ var (
 )

 // DefaultEngine is the default engine used by the shard when initializing.
-const DefaultEngine = "bz1"
+const DefaultEngine = "b1"

 // Engine represents a swappable storage engine for the shard.
 type Engine interface {

otoolep · 2015-08-22T02:56:12Z

Oh, and @gerrickw and @Jhors2 -- if you do run any of these tests please be sure to start with a new system, as shards created with a previous engine are not changed when running with patched software. Thanks for your help.

otoolep · 2015-08-24T00:25:46Z

Alternatively running your test against the stable release 0.9.2 and telling us if you see the same thing would help us rule out a lot of changes.

@gerrickw

gerrickw · 2015-08-24T03:41:53Z

Sure, I'll try out 0.9.2 tomorrow. A bit busy today. >_<

huhongbo · 2015-08-24T12:29:41Z

I've test the 0.9.2 is work fine
but the 0.9.3rc2 is still lost point after group by

s1m0 · 2015-08-24T16:30:18Z

0.9.3rc1 was very bad at losing points but rc3 seems better after 1 hour of running I haven't seen any lost data

s1m0 · 2015-08-24T16:59:56Z

Lost all data from 15minutes ago and beyond, I'll have to revert to 0.9.2...

otoolep · 2015-08-24T18:28:42Z

We are managing to reproduce this issue in-house with the script supplied by @gerrickw -- thanks @gerrickw . However, we'd like still further confirmation from the community that this is a 0.9.3-rc issue only, and that 0.9.2 did not suffer from this. We'll also check that.

Jhors2 · 2015-08-24T18:35:29Z

@otoolep I can confirm I do not experience this problem at all with 0.9.2. This degradation appears to have started once the Compaction/WAL patch happened right when RC1 was cut.

otoolep · 2015-08-24T18:40:03Z

OK, thanks @Jhors2

otoolep · 2015-08-24T19:28:22Z

We have confirmed here that this appears to be an issue with the new bz1 engine, and looks like it's triggered by a flush/compaction cycle.

otoolep · 2015-08-24T19:28:38Z

The problem also persists through a restart of the process.

gerrickw · 2015-08-24T20:16:48Z

Also confirmed it looks better on 0.9.2, although running into #3748 I reported previously due to b1 engine, which causes timeouts after a few hours after the point flushes. Either way looks better after I ran it for a few hours related to data disappearing.

Thanks.

gerrickw · 2015-08-24T22:12:37Z

Oh and as a note I wasn't able to reproduce the 500 error code I mentioned above. There was a point in time where I received all 500 server errors that required me to restart the service for data to load again. If I run into it again and a way to reproduce I'll throw a different issue in.

Seeking to the middle of a compressed block wasn't working properly. Fixes #3781

otoolep · 2015-08-25T20:30:42Z

Thanks very much @gerrickw for initially reporting this issue, and providing the test script -- your help was very important. We believe this issue has been addressed now, and can no longer reproduce this issue with your script.

Please let us know if you do not see an improvement with this change in place.

gerrickw · 2015-08-25T21:06:34Z

Oh great, thanks for the quick fix. Glad the script was useful. I'll try it out tomorrow when the next build is released. :-D

otoolep · 2015-08-25T22:58:10Z

RC3 is now available, which has the fix for this issue:

https://s3.amazonaws.com/influxdb/influxdb_0.9.3-rc3_amd64.deb
https://s3.amazonaws.com/influxdb/influxdb-0.9.3_rc3-1.x86_64.rpm

gerrickw · 2015-08-26T01:39:57Z

Installed and running for 10 minutes and is looking good. Will let it run over night. Thanks.

s1m0 · 2015-08-26T13:33:47Z

Installed rc3 last night. Checked various measurements and found data that had been missing has reappeared so there was no data loss which is good news. Thanks for the quick fix!

otoolep added this to the 0.9.3 milestone Aug 21, 2015

pauldix closed this as completed Aug 21, 2015

otoolep reopened this Aug 21, 2015

otoolep mentioned this issue Aug 24, 2015

[0.9.3-rc1-ish] Aggregate function 'mean' don't return all available points #3802

Closed

beckettsean mentioned this issue Aug 24, 2015

[0.9.3-rc1] some data not returned in query response #3759

Closed

jwilder mentioned this issue Aug 25, 2015

Fix missing data in aggregates with bz1 #3829

Merged

pauldix closed this as completed in 8c6af91 Aug 25, 2015

pauldix added a commit that referenced this issue Aug 25, 2015

Fix bug with bz1 where some data would get hidden.

ad3c400

Seeking to the middle of a compressed block wasn't working properly. Fixes #3781

otoolep mentioned this issue Aug 25, 2015

Data being aggregated on the default retention policy #3839

Closed

gerrickw mentioned this issue Dec 21, 2015

[0.10.0-nightly-146f36c] - Missing data a minute before current time. Comes back later #5193

Closed

[0.9.3-rc1] Data disappears, replaced with a single point every 9 to 20 minutes #3781

[0.9.3-rc1] Data disappears, replaced with a single point every 9 to 20 minutes #3781

Comments

gerrickw commented Aug 21, 2015

otoolep commented Aug 21, 2015

gerrickw commented Aug 21, 2015

otoolep commented Aug 21, 2015

gerrickw commented Aug 21, 2015

otoolep commented Aug 21, 2015

otoolep commented Aug 21, 2015

huhongbo commented Aug 21, 2015

pauldix commented Aug 21, 2015

gerrickw commented Aug 21, 2015

otoolep commented Aug 21, 2015

otoolep commented Aug 21, 2015

desa commented Aug 21, 2015

gerrickw commented Aug 21, 2015

otoolep commented Aug 21, 2015

Jhors2 commented Aug 22, 2015

gerrickw commented Aug 22, 2015

otoolep commented Aug 22, 2015

otoolep commented Aug 22, 2015

otoolep commented Aug 24, 2015

gerrickw commented Aug 24, 2015

huhongbo commented Aug 24, 2015

s1m0 commented Aug 24, 2015

s1m0 commented Aug 24, 2015

otoolep commented Aug 24, 2015

Jhors2 commented Aug 24, 2015

otoolep commented Aug 24, 2015

otoolep commented Aug 24, 2015

otoolep commented Aug 24, 2015

gerrickw commented Aug 24, 2015

gerrickw commented Aug 24, 2015

otoolep commented Aug 25, 2015

gerrickw commented Aug 25, 2015

otoolep commented Aug 25, 2015

gerrickw commented Aug 26, 2015

s1m0 commented Aug 26, 2015