-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[2.x] InfluxDB bucket stops reading+writing every couple of days #23956
Comments
Hey @nward, I've looked through the profiles attached and haven't seen anything that immediately highlights the issue. If you are able to share more data as mentioned in the issue, I can provide you with a private sftp link. If that works for you, please email me at jsmith @ influxdata.com and I'll get those details to you. Thanks! |
I've emailed you now - cheers! |
Hi @nward, Any update on this problem? I'm experiencing it for months already. |
Hi @zekena2 - yes @jeffreyssmith2nd sent me details to upload my core dumps which I'll do soon. Sorry, just getting back into work here after the holidays :) |
hi @jeffreyssmith2nd Is there any updates on this? it's still happening in peak hours. |
Thanks for the reminder @zekena2. I've uploaded one of our core files now @jeffreyssmith2nd, to the SFTP you provided. Let me know if there's more you need. We have worked around this issue with a script that restarts influxdb if there are not writes for a few minutes, but we can probably update that script to generate new core files if that would be useful. |
Hey thanks for uploading that @nward, do you know by chance if you created the core dump with |
I don't believe so - we are running influxdb from the RPMs and aren't modifying any environment variables. Note that the process is not crashing - it is simply stoping working. We create a core file before restarting it. Given that, I don't think |
Good catch, that would only apply for a crash. The core file appears to be corrupted (getting a lot of A couple of other thoughts:
|
In my case file limits aren't an issue but when this situation happen the number of open files and unix sockets as well as memeroy keeps growing without stopping until restarting (This is actually how I know there's a problem and my script restart influxdb afterward). I upgraded to the latest version and ironically it happened the same night I did the upgrade. I'll try to monitor the compactions and let you know. |
Yes, ulimit is set correctly. It could be that the disk ran out of space though, I recall one of the engineers ran in to that problem at some point, maybe it was that file I uploaded. I have a bunch of core files, I'll upload the lot.
Yes we have compactions running and.. well they're not failing, though I don't see an explicit "success" type message. I see beginning compaction, TSM compaction start, TSM compaction end. We see this.. ~1375 times per day.
Yes we can test on a later version. We had not upgraded around the time of this issue as from memory there was a bunch of open 2.5 performance issues or something like that.
I'm not sure if I still have the trace file, I'll see what I can dig up. |
OK - I have uploaded 4 more core dumps from various times (I am not sure if they are complete I am sorry). We have an aggregation query running every 5 mins storing data in to another bucket, and that appears to still run, though of course without the data from the main bucket that gets stuck - only one bucket is impacted by this issue, so far as I can tell. The trace mentioned was captured at the same time as the attachment posted earlier on this issue - on Nov 29. |
After looking at the trace, there is something hammering the You mentioned earlier that you have flux logging enabled, can you share the queries that are happening within a few minutes of the issue? Also, can you provide any error logs from both influxdb and Telegraf when the lockup happens? |
It happened half an hour ago and all the logs are info logs with msg between "Executing query" and " executing new query" which some of the quries are just for alerting and some are actual users watching a dashboard. Telegraf is filled with context deadline exceeded (Client.Timeout exceeded while awaiting headers) during the hanging I can send you logs by email if you want. They contain the queries as well. |
That could well be the case. I don't know the internals of InfluxDB to know exactly what those methods do, or why they are called - but we have a lot of series and filter them with tags. Note that only the primary data bucket is impacted, and only for writes.
Influxdb doesn't have any error messages from memory.. I'll have a look We have flux logging enabled, so I'll email you the full influx logs leading up to, and just after the issue time. Not secret, but, not public enough to post somewhere that Google will index :-) In telegraf we get "did not complete within its flush interval" for influxdb_v2 output from time to time normally (every few minutes). At the time of the issue, that picks up to maybe 6-7 times per minute. At the time of the issue, in telegraf we also get:
This happens maybe once every second or two. |
Any updates on this issue? |
After reviewing the logs, it looks like both reading and compactions stop on the bucket that is no longer allowing writes. My assumption is something has a lock on that bucket, usually this is caused by deletes. @zekena2 and @nward, are you doing deletes on the buckets that stop receiving writes? Also, @nward do you see the open file/socket issue that @zekena2 mentioned seeing? |
We don't have any deletes it's only writes and reads except of course the bucket retention which is 30d. |
I use this to get the sockets number More info here |
We don't run any deletes, though the data retention runs of course. Would that qualify as "deletes"? I don't see a large number of open sockets, no. We restart influxd pretty fast (within a few minutes) after it fails, so perhaps we don't get to a point where it's a problem. I also wonder if @zekena2's problem is a symptom rather than a cause - if they have lots of client requests (writes or reads) which are blocking, you might reasonably expect there to be lots of open sockets waiting for their requests to be processed. |
We have the same problem. The influx service freezes after 00:00UTC. Sometimes after 2 days, sometimes after a week. After freezing, the influx is not available for writing and after ~ 1 hour, when the RAM runs out, OOM killer comes and restarting influx service. System info: Linux elka2023-influxdb 5.10.0-18-cloud-amd64 #1 SMP Debian 5.10.140-1 (2022-09-02) x86_64 GNU/Linux config:
|
@nward, no retention wouldn't count for what I was thinking about. Retention is the low cost, best way to deletes so that's the right choice. When doing a @MaxOOOOON So when the writing freezes, RAM grows unboundedly until it OOMS? And you're not seeing RAM issues otherwise? Same questions for you, are you doing an explicit deletes and do you see any odd behavior from compactions/retention/etc? |
Yes, around the same time that errors appear, memory consumption starts to increase. Initially, we had 6GB RAM and influx freezes for half an hour, then we added an additional 8GB and now influx frezzes for an hour. during the day, 40-50% of the memory is used and when the nights are quiet(when the influx does not freeze and the grafana does not alert with all alerts), there are no memory jumps. We do not have special deletions by timer. use only built-in bucket retention and downsampling. I can send you logs when influx freezes and when it works normally |
By the way, when we use storage-compact-throughput-burst = 1331648, the influx seems to hang less often, on the march 22nd I set this value and the influx has not yet hung up until today. the default value was set earlier and we also tried to set storage-compact-throughput-burst = 8331648. with this values, in our case, influx hung more often. perhaps this is a coincidence and next week the influx will hang again in a couple of days :( |
@MaxOOOOON can you try with I'm interested to see if limiting compactions resolves the issue since it sounds like it may be happening around the same time as them. Any logs you have around the freezes could also help investigate the issue. |
my virtual machine has 2 cores. I set now storage-max-concurrent-compactions = 1 and storage-compact-throughput-burst = 1331648. after specifying storage-compact-throughput-burst = 1331648, the influx did not hang for 11 days. I specified the parameter on March 22, the influx hung up on the morning of April 3. I sent you a log by email. |
with parameters storage-max-concurrent-compactions = 1 and storage-compact-throughput-burst = 1331648 influx freeze after 4 days(It was April 7th). I also sent the log from April 7 to your email. |
Hey @MaxOOOOON thanks for sending those logs over. Your issue has slightly different behavior than the issues above in that you are getting timeouts on the write rather than a full bucket lockout. Can you generate profiles when you're seeing those writes fail? You can watch the logs for
|
I will try to execute the command |
Having the same issue. Did tweaking those parameters ( |
For me tweaking those parameters did reduce the number of times this happens but did not entirely stop the problem. |
I set the parameter storage-compact-throughput-burst = 665824 and the influx does not freeze, but now the cleaning does not work correctly. if look through the UI, there is no data in the bucket, but if you look directly at the server, the space does not decrease. After restarting the influxdb service, the space is freed up. |
Thanks for the info @zekena2 & @MaxOOOOON. Yeah I'd like to avoid unnecessary (scheduled) restarts, as it is only a temporary workaround, and creates gaps in data. Is there any official info from @influxdata on this - possible RC? |
@MaxOOOOON - I'm glad I found your post, I have been struggling with the same issue since moving from v1.8, I thought I was alone. I am running on windows and was contemplating moving to a Linux install to see if it would help. It seems like at 00:00UTC something triggers, causes Influx to go into a bad state where it continually eats memory. During this bad state nothing is working, and influx responds to clients with "internal error". I even gave the server gobs and gobs of memory (like >100GB; hot-add because its a VM) to see if it was just a process that it needed to work through and it didn't seem like it was helping the situation. Every time I've needed to restart the service (or wait for it to crash on its own due to OOM). I'm going to try tweaking the same parameters too and see if it helps my situation. |
I'm having the same problem.(Influxdb 2.7) Thank you. |
After endless of fine-tuning and testing we still experience this issue on regular basis - tho automatic restart when the issue happens at least mitigates the problem. |
Same issue here as well. Restarting is the way to go, temporarily fixes the issue until next time it appears. Wondering if converting Flux-boards to InfluxQL can solve the issue on a longer term. |
I don't use flux and the queries are only in influxql but we still have this problem. |
Interestingly I've noticed this PR, #24584 its for 1.x but sounds oddly similar to what is happening with 2.x. Could this be the cause and proposed PR a solution? We also see these issues arrise at midnight, when new shards are being created. |
I have the same issue with InfluxDB 2.7.5. I have background scripts who delete some of the data in a cron and when I display in Grafana the data and I query the InfluxDB at the same time as it's deleting the data, the bucket is locked and I couldn't access it anymore until I restart influxd. I try to minimize the restarting by relaunching directly but still it can happen and display errors on our Grafana dashboards |
Steps to reproduce:
List the minimal actions needed to reproduce the behavior.
Expected behavior:
Things keep working
Actual behavior:
InfluxDB2 main index ("telegraf") stops reading/writing data.
Other indexes work fine - including one which is 5m aggregates of the telegraf raw index. (obviously this does not get any new data)
We have had this in the past randomly, but in the last few weeks has happened every few days.
In the past it seemed to happen at 00:00UTC when influx did some internal DB maintenance - but now happens at random times.
Environment info:
Our database is 170GB, mostly metrics inserted every 60s, some every 600s.
storage_writer_ok_points is around 2.5k/s for 7mins, then ~25k/s for 3mins for the every-600s burst.
VM has 32G RAM, 28G of which is in buffers/cache.
4 cores, and typically sits at around 90% idle.
~ 24IOPS, 8MiB/s
Config:
We have enabled flux-log to see if specific queries are causing this - but it doesn't seem to be.
Logs:
Include snippet of errors in log.
Performance:
I captured a 10s pprof which I will attach.
I also have a core dump, and a 60s dump of debug/pprof/trace (though not sure if this has sensitive info but can share privately - the core dump certainly will)
The text was updated successfully, but these errors were encountered: