-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow writes / memory peaks / crashes caused by large amounts of shards creation #6635
Comments
Can you describe what you are seeing? What writes or data are outside of those dirs? |
I have no idea what is written (tmp files?), this machine only has ubuntu, influx and telegraf (stores to other influx host), no swap file EBS mounted on / (slow) As you can see during retention policy actions we have a lot of writes on our slow disk. In the big picture we are trying to find out why these writes are so slow (40s) I guess this all is not normal for a small database: |
@francisdb Are you swapping? You could try running Also, when you're writes are slow, can you grab some profiles with:
|
we have no swap file, let me have a go at getting those profiles |
Is there a way to get more internal logging? 20160516-16:30:46:910736791-goroutine.txt |
Here is the log where you can see inserts happening in more than 2 minutes: [tsm1] 2016/05/16 18:33:50 /var/lib/influxdb/data/server/default/83512 database index loaded in 1.061µs |
The profiles show many writes blocking trying to create shard groups which is strange. Looking at the code, there is some optimization that could be done there, but the heap dumps show quite a lot of memory in use for shard groups. How many shards do you have and how are they configured? Did you change |
Nope, have always used default config settings |
@francisdb Can you provide the output of |
|
This is a fresh machine from this morning where I imported |
@jwilder thanks for looking into this. Is there anything we can do to make things better? Our server crashes many times per day. It sometimes takes a few minutes time to restart. I will also have a look why we end up with these 2014-11-04 shards, I guess that means we have written data that is going to be discarded anyway. |
Replies on questions from the mailing list
We have this problem also at night when there are 0 queries performed, only data added. Could this be related to overwriting existing data?
We constantly have 4 cores at ~25% and longer 50+% peaks on all cores
I will try to upload a log today and am currently reviewing the writes |
Some logs we got just now during a memory spike:
The restart is triggered by our own monitoring script that restarts influx when it goes over 92% ram. |
I see we now have shards up to 2009-11-18, can a single datapoint uploaded for an old date start the creation of many shards? We now have 425 shards, which on the next retention policy kicking in will be reduced to about 30! |
There is a change in the current nightlies the reduced the lock contention around creating shard groups. You could try testing the latest nightly out to see if that helps. |
@jwilder always a bit tricky to put a nightly in production... |
One more question, can we track the number of shards (groups) with telegraf? |
@francisdb I thought you had a test env you could try the nightly on. My mistake. Shard groups are created based on the time range of points. If you insert one point way in the past, it will cause a shard group for that time range to get created if it does not exist. |
I think we found what is triggering our issues. We pull in data from other services and we had a webhook that told us to pull in data from way in the past. We now make sure we only pull in data that is not outside of the retention period. In any way, I guess creating 100's of shards should not kill the server? |
@jwilder to conclude: Should be easy to reproduce. |
@francisdb I logged #6708 which should hopefully prevent this situation from occurring in the future when fixed. |
We are having a lot of problems with our influx instance and while debugging we saw this:
(disk io charts are totals)
machine:
AWS m3.xlarge (seen same on c3.xlarge)
Influx
0.12.2
(on 0.13.0 we even have more problems)Seen issues on both Ubuntu 14.04 and 16.04
default .deb file install with auth enabled
/data
mounted on local ssd1/wal
mounted on local ssd2We have only ~100 measures
Each with one field and one tag (max 800 values)
Writes about 3 per second (1-8 or ~1000 rows)
Queries about 3 per minute (small bursts)
Steps to reproduce:
Just keep a system running and wait for the retention policy to kick in
Expected behavior:
Fast removal of old data, no acces to disk outside of
/data
Actual behavior:
Takes long time to apply the retention policy, a lot of access outside of
/data
and/wal
Additional info:
During this period we see write speeds go up from ~150 ms to 40s and sometimes the server just goes out of memory
The text was updated successfully, but these errors were encountered: