Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.9.3-rc1] Data disappears, replaced with a single point every 9 to 20 minutes #3781

Closed
gerrickw opened this issue Aug 21, 2015 · 35 comments
Closed
Milestone

Comments

@gerrickw
Copy link

This is slightly hard to explain, but mainly data is disappearing every 10-20 minutes. I am entering data every 10 seconds using a variety of metric names with different tags on the same field and suddenly 10-20 minutes of data will disappear. The exception being a slight tick exactly at around the 9-14 to 20 minute mark where the 10 seconds metric is seen. It appears when more data the interval seems to be around 10ish minutes. 20 minutes when less data.

Query:
SELECT last(value) FROM "requests" WHERE "colo" = 'aaa' AND "pool" = 'zzz' AND "x" = '001' AND "y" = '001' and time > now() - 10m GROUP BY time(10s)

This query will show:
2015-08-20T23:42:20Z 1183
2015-08-20T23:42:30Z 1071
2015-08-20T23:42:40Z 993
2015-08-20T23:42:50Z 1002
2015-08-20T23:43:00Z 1083
2015-08-20T23:43:10Z 1044
2015-08-20T23:43:20Z 1029
2015-08-20T23:43:30Z 1099
2015-08-20T23:43:40Z 1102
2015-08-20T23:43:50Z 1054
... 10 minutes later

015-08-20T23:42:20Z
2015-08-20T23:42:30Z
2015-08-20T23:42:40Z
2015-08-20T23:42:50Z
2015-08-20T23:43:00Z 1083
2015-08-20T23:43:10Z
2015-08-20T23:43:20Z
2015-08-20T23:43:30Z
2015-08-20T23:43:40Z
2015-08-20T23:43:50Z
...
2015-08-20T23:54:40Z
2015-08-20T23:54:50Z
2015-08-20T23:55:00Z
2015-08-20T23:55:10Z 976
2015-08-20T23:55:20Z 1103
2015-08-20T23:55:30Z 1030
2015-08-20T23:55:40Z 1087
2015-08-20T23:55:50Z 956
2015-08-20T23:56:00Z

SHOW RETENTION POLICIES ON db
name duration replicaN default
default "0" 1 true

Example of inputs:

  • This format will be used on all servers with different names of latency/requests/networking/etc
  • During a 10 second period all metrics will be pulled and reported. So the same int epochs for all metrics every 10s.

requests,colo=aaa,pool=zzz,prehostname=server-,x=021,y=027 value=1003 1440114198
requests,colo=aaa,pool=zzz,prehostname=server-,x=021,y=028 value=906 1440114198
requests,colo=aaa,pool=zzz,prehostname=server-,x=021,y=029 value=1151 1440114198
requests,colo=aaa,pool=zzz,prehostname=server-,x=021,y=030 value=1009 1440114198
requests,colo=aaa,pool=zzz,prehostname=server-,x=021,y=031 value=1001 1440114198
requests,colo=aaa,pool=zzz,prehostname=server-,x=021,y=032 value=1108 1440114198

A few notes:

  • This was happening on rc1, wiped completely and installed latest master today, but is still happening.
  • Config is default.
  • Confirmed isn't on my client side making sure I am not overwriting the value.

Example of one of the pool of servers from grafana (10 seconds metrics were reported and shown when earlier in the time period):
example-of-datapoints

As a note, I confirmed this isn't a grafana issue as the same happens when querying with the query above.

Let me know if more needs to be known. Hard to know where to start explaining.

@otoolep
Copy link
Contributor

otoolep commented Aug 21, 2015

Thanks @gerrickw for the report.

Without knowing exactly what is going on here, can you try the nightly build when it next becomes available? Some significant bug fixes went in earlier today, and it would be good to rule out those issues.

https://influxdb.com/download/index.html

@gerrickw
Copy link
Author

@otoolep
I installed the latest nightly build today. Do you mean the one tomorrow?

@otoolep
Copy link
Contributor

otoolep commented Aug 21, 2015

We need to update our docs, that should be clear.

Nightly is generated at midnight Pacific time, so there should be another
in 6 hours with the fixes. Alternatively you can build master or the 0.9.3
branch from source.

On Thursday, August 20, 2015, Gerrick W [email protected] wrote:

@otoolep https://github.com/otoolep
I installed the latest nightly build today. Do you mean the one tomorrow?


Reply to this email directly or view it on GitHub
#3781 (comment).

@gerrickw
Copy link
Author

Good to know. I'll deploy latest tomorrow.

Thanks.

@otoolep
Copy link
Contributor

otoolep commented Aug 21, 2015

Great, thanks @gerrickw -- let us know what you find.

@otoolep otoolep added this to the 0.9.3 milestone Aug 21, 2015
@otoolep
Copy link
Contributor

otoolep commented Aug 21, 2015

Flagging the milestone for review.

@huhongbo
Copy link

I have the exactly same problem
group by lost point

@pauldix
Copy link
Member

pauldix commented Aug 21, 2015

Fairly certain this was fixed by #3761. Closing for now, but reopen if you still see this problem on the nightly build from last night.

@pauldix pauldix closed this as completed Aug 21, 2015
@gerrickw
Copy link
Author

Still have the same problem. Although seem to have ticks about every 4-6 minutes now with missing data in the middle. Upgraded to today's nightly build. Have a new error code 500 timeout issue as well, but I'll report that in a different ticket after lunch.

Steps starting at yesterday's master today:

  1. sudo service stop influxdb
  2. ps -ef|grep influxdb # Execute a few time waiting for influx to die.
  3. Delete wal, hh, data directory
  4. sudo dpkg -i influxdb_nightly_amd64.deb # latest nightly today.
  5. sudo service start influxdb
  6. Log shows, "2015/08/21 11:26:30 InfluxDB starting, version 0.9.3-nightly-d259afe, branch master, commit d259afe"
  7. Start loading data.

I don't see a way to reopen this ticket, possibly need permissions?

@otoolep
Copy link
Contributor

otoolep commented Aug 21, 2015

OK, thanks @gerrickw -- you are running a build with the important fixes, if that is the commit-hash of your system.

Please open a ticket regarding your 500 timeout, and be sure to include details of how you are sending data to the system.

@otoolep
Copy link
Contributor

otoolep commented Aug 21, 2015

@pauldix -- I am re-opening this, please close if in error.

@otoolep otoolep reopened this Aug 21, 2015
@desa
Copy link
Contributor

desa commented Aug 21, 2015

@gerrickw having a hard time reproducing this error. Wrote 9190000 points in a few hours ago an everything is still there now.

@gerrickw
Copy link
Author

I'll see if I can get a test script writing example points similar to my workflow. Need to do a few other things today, but I'll see if I can get something by tonight.

@otoolep
Copy link
Contributor

otoolep commented Aug 21, 2015

@gerrickw -- that would be great, we're keen to see what is going on here.

@Jhors2
Copy link

Jhors2 commented Aug 22, 2015

Updated this morning. I appear to have the same problem. It seems that whenever a new shard is created (I'm not entirely sure this is the case). the data disappears from timestamps behind that shard and only shows up in 9-10 minute increments. I dumped the DB and started over to verify this behavior. If I can recreate I can unicast you my data @otoolep.

FWIW We are all in a raft consistent state according to "show server".

/edit/ Just upgraded to RC2 to see if the fix is there. Will report back. /edit/

@gerrickw
Copy link
Author

Reproduced using the below script on gist. You will need to pip install influxdb.

A number of arguments to customize things, although appears is reproducible by..
python simple_influx_writer.py --hhh_iii

User/pass/db are defaulted to test_db against localhost, but you can set as desired.

https://gist.github.com/gerrickw/f83fb4d4d69aef2dfd37

Once about 15-20 minutes of data, run the following query and notice data disappearing over time.
SELECT last(value) FROM "e" WHERE "colo" = 'coloa' AND "pool" = 'hhh_iii' AND time > now() - 20m GROUP BY time(10s), "y"

@otoolep
Copy link
Contributor

otoolep commented Aug 22, 2015

If either @gerrickw or @Jhors2 can build from source, I'd be very interested in knowing if you see the same problem with the patch below in place:

$ git diff
diff --git a/tsdb/engine.go b/tsdb/engine.go
index 71da46a..748c2db 100644
--- a/tsdb/engine.go
+++ b/tsdb/engine.go
@@ -18,7 +18,7 @@ var (
 )

 // DefaultEngine is the default engine used by the shard when initializing.
-const DefaultEngine = "bz1"
+const DefaultEngine = "b1"

 // Engine represents a swappable storage engine for the shard.
 type Engine interface {

@otoolep
Copy link
Contributor

otoolep commented Aug 22, 2015

Oh, and @gerrickw and @Jhors2 -- if you do run any of these tests please be sure to start with a new system, as shards created with a previous engine are not changed when running with patched software. Thanks for your help.

@otoolep
Copy link
Contributor

otoolep commented Aug 24, 2015

Alternatively running your test against the stable release 0.9.2 and telling us if you see the same thing would help us rule out a lot of changes.

@gerrickw

@gerrickw
Copy link
Author

Sure, I'll try out 0.9.2 tomorrow. A bit busy today. >_<

@huhongbo
Copy link

I've test the 0.9.2 is work fine
but the 0.9.3rc2 is still lost point after group by

@s1m0
Copy link

s1m0 commented Aug 24, 2015

0.9.3rc1 was very bad at losing points but rc3 seems better after 1 hour of running I haven't seen any lost data

@s1m0
Copy link

s1m0 commented Aug 24, 2015

Lost all data from 15minutes ago and beyond, I'll have to revert to 0.9.2...

@otoolep
Copy link
Contributor

otoolep commented Aug 24, 2015

We are managing to reproduce this issue in-house with the script supplied by @gerrickw -- thanks @gerrickw . However, we'd like still further confirmation from the community that this is a 0.9.3-rc issue only, and that 0.9.2 did not suffer from this. We'll also check that.

@Jhors2
Copy link

Jhors2 commented Aug 24, 2015

@otoolep I can confirm I do not experience this problem at all with 0.9.2. This degradation appears to have started once the Compaction/WAL patch happened right when RC1 was cut.

@otoolep
Copy link
Contributor

otoolep commented Aug 24, 2015

OK, thanks @Jhors2

@otoolep
Copy link
Contributor

otoolep commented Aug 24, 2015

We have confirmed here that this appears to be an issue with the new bz1 engine, and looks like it's triggered by a flush/compaction cycle.

@otoolep
Copy link
Contributor

otoolep commented Aug 24, 2015

The problem also persists through a restart of the process.

@gerrickw
Copy link
Author

Also confirmed it looks better on 0.9.2, although running into #3748 I reported previously due to b1 engine, which causes timeouts after a few hours after the point flushes. Either way looks better after I ran it for a few hours related to data disappearing.

Thanks.

@gerrickw
Copy link
Author

Oh and as a note I wasn't able to reproduce the 500 error code I mentioned above. There was a point in time where I received all 500 server errors that required me to restart the service for data to load again. If I run into it again and a way to reproduce I'll throw a different issue in.

pauldix added a commit that referenced this issue Aug 25, 2015
Seeking to the middle of a compressed block wasn't working properly. Fixes #3781
@otoolep
Copy link
Contributor

otoolep commented Aug 25, 2015

Thanks very much @gerrickw for initially reporting this issue, and providing the test script -- your help was very important. We believe this issue has been addressed now, and can no longer reproduce this issue with your script.

Please let us know if you do not see an improvement with this change in place.

@gerrickw
Copy link
Author

Oh great, thanks for the quick fix. Glad the script was useful. I'll try it out tomorrow when the next build is released. :-D

@otoolep
Copy link
Contributor

otoolep commented Aug 25, 2015

@gerrickw
Copy link
Author

Installed and running for 10 minutes and is looking good. Will let it run over night. Thanks.

@s1m0
Copy link

s1m0 commented Aug 26, 2015

Installed rc3 last night. Checked various measurements and found data that had been missing has reappeared so there was no data loss which is good news. Thanks for the quick fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants