-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[0.9.3-rc1] Data disappears, replaced with a single point every 9 to 20 minutes #3781
Comments
Thanks @gerrickw for the report. Without knowing exactly what is going on here, can you try the nightly build when it next becomes available? Some significant bug fixes went in earlier today, and it would be good to rule out those issues. |
@otoolep |
We need to update our docs, that should be clear. Nightly is generated at midnight Pacific time, so there should be another On Thursday, August 20, 2015, Gerrick W [email protected] wrote:
|
Good to know. I'll deploy latest tomorrow. Thanks. |
Great, thanks @gerrickw -- let us know what you find. |
Flagging the milestone for review. |
I have the exactly same problem |
Fairly certain this was fixed by #3761. Closing for now, but reopen if you still see this problem on the nightly build from last night. |
Still have the same problem. Although seem to have ticks about every 4-6 minutes now with missing data in the middle. Upgraded to today's nightly build. Have a new error code 500 timeout issue as well, but I'll report that in a different ticket after lunch. Steps starting at yesterday's master today:
I don't see a way to reopen this ticket, possibly need permissions? |
OK, thanks @gerrickw -- you are running a build with the important fixes, if that is the commit-hash of your system. Please open a ticket regarding your 500 timeout, and be sure to include details of how you are sending data to the system. |
@pauldix -- I am re-opening this, please close if in error. |
@gerrickw having a hard time reproducing this error. Wrote 9190000 points in a few hours ago an everything is still there now. |
I'll see if I can get a test script writing example points similar to my workflow. Need to do a few other things today, but I'll see if I can get something by tonight. |
@gerrickw -- that would be great, we're keen to see what is going on here. |
Updated this morning. I appear to have the same problem. It seems that whenever a new shard is created (I'm not entirely sure this is the case). the data disappears from timestamps behind that shard and only shows up in 9-10 minute increments. I dumped the DB and started over to verify this behavior. If I can recreate I can unicast you my data @otoolep. FWIW We are all in a raft consistent state according to "show server". /edit/ Just upgraded to RC2 to see if the fix is there. Will report back. /edit/ |
Reproduced using the below script on gist. You will need to pip install influxdb. A number of arguments to customize things, although appears is reproducible by.. User/pass/db are defaulted to test_db against localhost, but you can set as desired. https://gist.github.com/gerrickw/f83fb4d4d69aef2dfd37 Once about 15-20 minutes of data, run the following query and notice data disappearing over time. |
If either @gerrickw or @Jhors2 can build from source, I'd be very interested in knowing if you see the same problem with the patch below in place:
|
Alternatively running your test against the stable release 0.9.2 and telling us if you see the same thing would help us rule out a lot of changes. |
Sure, I'll try out 0.9.2 tomorrow. A bit busy today. >_< |
I've test the 0.9.2 is work fine |
0.9.3rc1 was very bad at losing points but rc3 seems better after 1 hour of running I haven't seen any lost data |
Lost all data from 15minutes ago and beyond, I'll have to revert to 0.9.2... |
@otoolep I can confirm I do not experience this problem at all with 0.9.2. This degradation appears to have started once the Compaction/WAL patch happened right when RC1 was cut. |
OK, thanks @Jhors2 |
We have confirmed here that this appears to be an issue with the new bz1 engine, and looks like it's triggered by a flush/compaction cycle. |
The problem also persists through a restart of the process. |
Also confirmed it looks better on 0.9.2, although running into #3748 I reported previously due to b1 engine, which causes timeouts after a few hours after the point flushes. Either way looks better after I ran it for a few hours related to data disappearing. Thanks. |
Oh and as a note I wasn't able to reproduce the 500 error code I mentioned above. There was a point in time where I received all 500 server errors that required me to restart the service for data to load again. If I run into it again and a way to reproduce I'll throw a different issue in. |
Seeking to the middle of a compressed block wasn't working properly. Fixes #3781
Thanks very much @gerrickw for initially reporting this issue, and providing the test script -- your help was very important. We believe this issue has been addressed now, and can no longer reproduce this issue with your script. Please let us know if you do not see an improvement with this change in place. |
Oh great, thanks for the quick fix. Glad the script was useful. I'll try it out tomorrow when the next build is released. :-D |
RC3 is now available, which has the fix for this issue: https://s3.amazonaws.com/influxdb/influxdb_0.9.3-rc3_amd64.deb |
Installed and running for 10 minutes and is looking good. Will let it run over night. Thanks. |
Installed rc3 last night. Checked various measurements and found data that had been missing has reappeared so there was no data loss which is good news. Thanks for the quick fix! |
This is slightly hard to explain, but mainly data is disappearing every 10-20 minutes. I am entering data every 10 seconds using a variety of metric names with different tags on the same field and suddenly 10-20 minutes of data will disappear. The exception being a slight tick exactly at around the 9-14 to 20 minute mark where the 10 seconds metric is seen. It appears when more data the interval seems to be around 10ish minutes. 20 minutes when less data.
Query:
SELECT last(value) FROM "requests" WHERE "colo" = 'aaa' AND "pool" = 'zzz' AND "x" = '001' AND "y" = '001' and time > now() - 10m GROUP BY time(10s)
This query will show:
2015-08-20T23:42:20Z 1183
2015-08-20T23:42:30Z 1071
2015-08-20T23:42:40Z 993
2015-08-20T23:42:50Z 1002
2015-08-20T23:43:00Z 1083
2015-08-20T23:43:10Z 1044
2015-08-20T23:43:20Z 1029
2015-08-20T23:43:30Z 1099
2015-08-20T23:43:40Z 1102
2015-08-20T23:43:50Z 1054
... 10 minutes later
015-08-20T23:42:20Z
2015-08-20T23:42:30Z
2015-08-20T23:42:40Z
2015-08-20T23:42:50Z
2015-08-20T23:43:00Z 1083
2015-08-20T23:43:10Z
2015-08-20T23:43:20Z
2015-08-20T23:43:30Z
2015-08-20T23:43:40Z
2015-08-20T23:43:50Z
...
2015-08-20T23:54:40Z
2015-08-20T23:54:50Z
2015-08-20T23:55:00Z
2015-08-20T23:55:10Z 976
2015-08-20T23:55:20Z 1103
2015-08-20T23:55:30Z 1030
2015-08-20T23:55:40Z 1087
2015-08-20T23:55:50Z 956
2015-08-20T23:56:00Z
SHOW RETENTION POLICIES ON db
name duration replicaN default
default "0" 1 true
Example of inputs:
requests,colo=aaa,pool=zzz,prehostname=server-,x=021,y=027 value=1003 1440114198
requests,colo=aaa,pool=zzz,prehostname=server-,x=021,y=028 value=906 1440114198
requests,colo=aaa,pool=zzz,prehostname=server-,x=021,y=029 value=1151 1440114198
requests,colo=aaa,pool=zzz,prehostname=server-,x=021,y=030 value=1009 1440114198
requests,colo=aaa,pool=zzz,prehostname=server-,x=021,y=031 value=1001 1440114198
requests,colo=aaa,pool=zzz,prehostname=server-,x=021,y=032 value=1108 1440114198
A few notes:
Example of one of the pool of servers from grafana (10 seconds metrics were reported and shown when earlier in the time period):
![example-of-datapoints](https://cloud.githubusercontent.com/assets/656611/9398728/3884156e-475f-11e5-9ae5-72779479ed3f.png)
As a note, I confirmed this isn't a grafana issue as the same happens when querying with the query above.
Let me know if more needs to be known. Hard to know where to start explaining.
The text was updated successfully, but these errors were encountered: