Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2.x] InfluxDB bucket stops reading+writing every couple of days #23956

Open
nward opened this issue Nov 29, 2022 · 41 comments
Open

[2.x] InfluxDB bucket stops reading+writing every couple of days #23956

nward opened this issue Nov 29, 2022 · 41 comments
Labels
area/2.x OSS 2.0 related issues and PRs

Comments

@nward
Copy link

nward commented Nov 29, 2022

Steps to reproduce:
List the minimal actions needed to reproduce the behavior.

  1. Run influxdb2
  2. Insert metrics (with telegraf)
  3. Wait for some time

Expected behavior:
Things keep working

Actual behavior:
InfluxDB2 main index ("telegraf") stops reading/writing data.
Other indexes work fine - including one which is 5m aggregates of the telegraf raw index. (obviously this does not get any new data)

We have had this in the past randomly, but in the last few weeks has happened every few days.
In the past it seemed to happen at 00:00UTC when influx did some internal DB maintenance - but now happens at random times.

Environment info:

  • System info: Linux 3.10.0-1160.66.1.el7.x86_64 x86_64
  • InfluxDB version: InfluxDB v2.3.0+SNAPSHOT.090f681737 (git: 090f681) build_date: 2022-06-16T19:33:50Z
  • Other relevant environment details: CentOS 7 on vmware - lots of spare IO, CPU, memory.

Our database is 170GB, mostly metrics inserted every 60s, some every 600s.
storage_writer_ok_points is around 2.5k/s for 7mins, then ~25k/s for 3mins for the every-600s burst.

VM has 32G RAM, 28G of which is in buffers/cache.
4 cores, and typically sits at around 90% idle.
~ 24IOPS, 8MiB/s

Config:

bolt-path = "/var/lib/influxdb/influxd.bolt"
engine-path = "/var/lib/influxdb/engine"
flux-log-enabled = "true"

We have enabled flux-log to see if specific queries are causing this - but it doesn't seem to be.

Logs:
Include snippet of errors in log.

Performance:

I captured a 10s pprof which I will attach.

I also have a core dump, and a 60s dump of debug/pprof/trace (though not sure if this has sensitive info but can share privately - the core dump certainly will)

@nward
Copy link
Author

nward commented Nov 29, 2022

all.tar.gz

@jeffreyssmith2nd
Copy link
Contributor

Hey @nward, I've looked through the profiles attached and haven't seen anything that immediately highlights the issue. If you are able to share more data as mentioned in the issue, I can provide you with a private sftp link. If that works for you, please email me at jsmith @ influxdata.com and I'll get those details to you. Thanks!

@nward
Copy link
Author

nward commented Jan 1, 2023

Hey @nward, I've looked through the profiles attached and haven't seen anything that immediately highlights the issue. If you are able to share more data as mentioned in the issue, I can provide you with a private sftp link. If that works for you, please email me at jsmith @ influxdata.com and I'll get those details to you. Thanks!

I've emailed you now - cheers!

@zekena2
Copy link

zekena2 commented Jan 10, 2023

Hi @nward, Any update on this problem? I'm experiencing it for months already.

@nward
Copy link
Author

nward commented Jan 10, 2023

Hi @nward, Any update on this problem? I'm experiencing it for months already.

Hi @zekena2 - yes @jeffreyssmith2nd sent me details to upload my core dumps which I'll do soon. Sorry, just getting back into work here after the holidays :)

@zekena2
Copy link

zekena2 commented Feb 11, 2023

hi @jeffreyssmith2nd Is there any updates on this? it's still happening in peak hours.

@nward
Copy link
Author

nward commented Feb 12, 2023

Thanks for the reminder @zekena2. I've uploaded one of our core files now @jeffreyssmith2nd, to the SFTP you provided. Let me know if there's more you need.

We have worked around this issue with a script that restarts influxdb if there are not writes for a few minutes, but we can probably update that script to generate new core files if that would be useful.

@jeffreyssmith2nd
Copy link
Contributor

Hey thanks for uploading that @nward, do you know by chance if you created the core dump with GOTRACEBACK=crash set?

@nward
Copy link
Author

nward commented Feb 14, 2023

I don't believe so - we are running influxdb from the RPMs and aren't modifying any environment variables.

Note that the process is not crashing - it is simply stoping working. We create a core file before restarting it. Given that, I don't think GOTRACEBACK=crash would have any effect.

@jeffreyssmith2nd
Copy link
Contributor

Good catch, that would only apply for a crash. The core file appears to be corrupted (getting a lot of unreadable error derefing *G error while reading spliced memory at ...), can you check that the ulimit is set properly before you generate it?

A couple of other thoughts:

  • Do you have compactions running regularly and succeeding?
  • Do you have the ability to test on a later version and see if the issue is still present?
  • You mentioned a 60s trace file, if you can throw that on the ftp server too that may provide some insight.

@zekena2
Copy link

zekena2 commented Feb 16, 2023

In my case file limits aren't an issue but when this situation happen the number of open files and unix sockets as well as memeroy keeps growing without stopping until restarting (This is actually how I know there's a problem and my script restart influxdb afterward). I upgraded to the latest version and ironically it happened the same night I did the upgrade. I'll try to monitor the compactions and let you know.

@nward
Copy link
Author

nward commented Feb 16, 2023

Hi @jeffreyssmith2nd

Good catch, that would only apply for a crash. The core file appears to be corrupted (getting a lot of unreadable error derefing *G error while reading spliced memory at ...), can you check that the ulimit is set properly before you generate it?

Yes, ulimit is set correctly. It could be that the disk ran out of space though, I recall one of the engineers ran in to that problem at some point, maybe it was that file I uploaded. I have a bunch of core files, I'll upload the lot.

  • Do you have compactions running regularly and succeeding?

Yes we have compactions running and.. well they're not failing, though I don't see an explicit "success" type message. I see beginning compaction, TSM compaction start, TSM compaction end. We see this.. ~1375 times per day.
We have not configured anything special here - it is the default config from the RPM.

  • Do you have the ability to test on a later version and see if the issue is still present?

Yes we can test on a later version. We had not upgraded around the time of this issue as from memory there was a bunch of open 2.5 performance issues or something like that.

  • You mentioned a 60s trace file, if you can throw that on the ftp server too that may provide some insight.

I'm not sure if I still have the trace file, I'll see what I can dig up.

@nward
Copy link
Author

nward commented Feb 16, 2023

OK - I have uploaded 4 more core dumps from various times (I am not sure if they are complete I am sorry).
I have also uploaded the trace file that I captured when this issue was happening.

We have an aggregation query running every 5 mins storing data in to another bucket, and that appears to still run, though of course without the data from the main bucket that gets stuck - only one bucket is impacted by this issue, so far as I can tell.

The trace mentioned was captured at the same time as the attachment posted earlier on this issue - on Nov 29.

@jeffreyssmith2nd
Copy link
Contributor

After looking at the trace, there is something hammering the HasTagValue call which ultimately stems from a MeasurementNamesByPredicate call. This implies to me there's some a query running that is filtering by measurement/tag value and apparently quite frequently (300,000 calls to HasTagValue in that 1min trace).

You mentioned earlier that you have flux logging enabled, can you share the queries that are happening within a few minutes of the issue?

Also, can you provide any error logs from both influxdb and Telegraf when the lockup happens?

@zekena2
Copy link

zekena2 commented Feb 16, 2023

It happened half an hour ago and all the logs are info logs with msg between "Executing query" and " executing new query" which some of the quries are just for alerting and some are actual users watching a dashboard.

Telegraf is filled with context deadline exceeded (Client.Timeout exceeded while awaiting headers) during the hanging
Influxdb didn't have any error logs.

I can send you logs by email if you want. They contain the queries as well.

@nward
Copy link
Author

nward commented Feb 17, 2023

After looking at the trace, there is something hammering the HasTagValue call which ultimately stems from a MeasurementNamesByPredicate call. This implies to me there's some a query running that is filtering by measurement/tag value and apparently quite frequently (300,000 calls to HasTagValue in that 1min trace).

That could well be the case. I don't know the internals of InfluxDB to know exactly what those methods do, or why they are called - but we have a lot of series and filter them with tags.

Note that only the primary data bucket is impacted, and only for writes.
Reads are OK, and a scheduled task which runs every 10 mins reading from the primary bucket and writing to an aggregated data bucket is fine - it just doesn't write much data after the primary bucket stops writes.

You mentioned earlier that you have flux logging enabled, can you share the queries that are happening within a few minutes of the issue?

Also, can you provide any error logs from both influxdb and Telegraf when the lockup happens?

Influxdb doesn't have any error messages from memory.. I'll have a look We have flux logging enabled, so I'll email you the full influx logs leading up to, and just after the issue time. Not secret, but, not public enough to post somewhere that Google will index :-)

In telegraf we get "did not complete within its flush interval" for influxdb_v2 output from time to time normally (every few minutes). At the time of the issue, that picks up to maybe 6-7 times per minute.

At the time of the issue, in telegraf we also get:

Error writing to outputs.influxdb_v2: Post "http://xxxx:8086/api/v2/write?bucket=telegraf&org=xxxx": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

This happens maybe once every second or two.

@zekena2
Copy link

zekena2 commented Mar 4, 2023

Any updates on this issue?

@jeffreyssmith2nd
Copy link
Contributor

After reviewing the logs, it looks like both reading and compactions stop on the bucket that is no longer allowing writes. My assumption is something has a lock on that bucket, usually this is caused by deletes. @zekena2 and @nward, are you doing deletes on the buckets that stop receiving writes? Also, @nward do you see the open file/socket issue that @zekena2 mentioned seeing?

@zekena2
Copy link

zekena2 commented Mar 6, 2023

We don't have any deletes it's only writes and reads except of course the bucket retention which is 30d.

@zekena2
Copy link

zekena2 commented Mar 7, 2023

I use this to get the sockets number sar -n SOCK 1 1 | tail -n 1 | awk '{ print $2 }' Which probably exists under /proc/net/sockstat in the first line
sockets: used 189

More info here

@nward
Copy link
Author

nward commented Mar 21, 2023

After reviewing the logs, it looks like both reading and compactions stop on the bucket that is no longer allowing writes. My assumption is something has a lock on that bucket, usually this is caused by deletes. @zekena2 and @nward, are you doing deletes on the buckets that stop receiving writes? Also, @nward do you see the open file/socket issue that @zekena2 mentioned seeing?

We don't run any deletes, though the data retention runs of course. Would that qualify as "deletes"?
We see failures at all times of the day, do data retention deletes run constantly, or, only on set intervals (i.e. daily, or similar)?

I don't see a large number of open sockets, no. We restart influxd pretty fast (within a few minutes) after it fails, so perhaps we don't get to a point where it's a problem.

I also wonder if @zekena2's problem is a symptom rather than a cause - if they have lots of client requests (writes or reads) which are blocking, you might reasonably expect there to be lots of open sockets waiting for their requests to be processed.

@MaxOOOOON
Copy link

MaxOOOOON commented Mar 24, 2023

We have the same problem. The influx service freezes after 00:00UTC. Sometimes after 2 days, sometimes after a week.
In logs only:
lvl=debug msg=Request log_id=0giXx1PW000 service=http method=POST host=xxxxx path=/api/v2/write query="bucket=xxxx" proto=HTTP/1.1 status_code=499 response_size=90 content_length=-1 referrer= remote=ip:port user_agent=Telegraf took=10019.286ms error="internal error" error_code="internal error"
and 400 errors on flux requests
lvl=debug msg=Request log_id=0giXx1PW000 service=http method=POST host=xxxx:xxxxx path=/api/v2/query query="org=ORG" proto=HTTP/1.1 status_code=400 response_size=52 content_length=576 referrer= remote=xxxxxx user_agent=influxdb-client-go took=11.008ms error=invalid error_code=invalid body="{\"dialect\":{\"annotations\":[\"datatype\",\"group\",\"default\"],\"delimiter\":\",\",\"header\":true},\"query\":\"xxxxxxxxx\",\"type\":\"flux\"}"

After freezing, the influx is not available for writing and after ~ 1 hour, when the RAM runs out, OOM killer comes and restarting influx service.

System info: Linux elka2023-influxdb 5.10.0-18-cloud-amd64 #1 SMP Debian 5.10.140-1 (2022-09-02) x86_64 GNU/Linux
Influx version: InfluxDB v2.4.0 (git: de247ba) build_date: 2022-08-18T19:41:15Z
OS: Debian 11

config:

bolt-path = "/opt/influxdb-data/influxdb/influxd.bolt"
engine-path = "/opt/influxdb-data/influxdb/engine"

flux-log-enabled = "true"
http-read-timeout = "15s"
http-write-timeout = "15s"
reporting-disabled = "true"
query-queue-size = 10
query-memory-bytes = 20485760
storage-compact-throughput-burst = 1331648
query-concurrency = 10
log-level = "debug"

@jeffreyssmith2nd
Copy link
Contributor

jeffreyssmith2nd commented Mar 30, 2023

We don't run any deletes, though the data retention runs of course. Would that qualify as "deletes"?

@nward, no retention wouldn't count for what I was thinking about. Retention is the low cost, best way to deletes so that's the right choice. When doing a influx delete, we have to walk every shard and grab a lock. If that is getting stuck, that would explain the writing issues but unfortunately sounds like that is not the case here.

@MaxOOOOON So when the writing freezes, RAM grows unboundedly until it OOMS? And you're not seeing RAM issues otherwise?

Same questions for you, are you doing an explicit deletes and do you see any odd behavior from compactions/retention/etc?

@MaxOOOOON
Copy link

MaxOOOOON commented Mar 30, 2023

Yes, around the same time that errors appear, memory consumption starts to increase. Initially, we had 6GB RAM and influx freezes for half an hour, then we added an additional 8GB and now influx frezzes for an hour. during the day, 40-50% of the memory is used and when the nights are quiet(when the influx does not freeze and the grafana does not alert with all alerts), there are no memory jumps.

We do not have special deletions by timer. use only built-in bucket retention and downsampling.

I can send you logs when influx freezes and when it works normally

@MaxOOOOON
Copy link

MaxOOOOON commented Mar 30, 2023

By the way, when we use storage-compact-throughput-burst = 1331648, the influx seems to hang less often, on the march 22nd I set this value and the influx has not yet hung up until today. the default value was set earlier and we also tried to set storage-compact-throughput-burst = 8331648. with this values, in our case, influx hung more often. perhaps this is a coincidence and next week the influx will hang again in a couple of days :(

@jeffreyssmith2nd
Copy link
Contributor

@MaxOOOOON can you try with storage-compact-throughput-burst set to something low (1331648 is fine) and also set storage-max-concurrent-compactions to 1/4 your available threads?

I'm interested to see if limiting compactions resolves the issue since it sounds like it may be happening around the same time as them.

Any logs you have around the freezes could also help investigate the issue.

@MaxOOOOON
Copy link

MaxOOOOON commented Apr 4, 2023

my virtual machine has 2 cores. I set now storage-max-concurrent-compactions = 1 and storage-compact-throughput-burst = 1331648.

after specifying storage-compact-throughput-burst = 1331648, the influx did not hang for 11 days. I specified the parameter on March 22, the influx hung up on the morning of April 3.

I sent you a log by email.

@MaxOOOOON
Copy link

with parameters storage-max-concurrent-compactions = 1 and storage-compact-throughput-burst = 1331648 influx freeze after 4 days(It was April 7th). I also sent the log from April 7 to your email.

@jeffreyssmith2nd
Copy link
Contributor

Hey @MaxOOOOON thanks for sending those logs over. Your issue has slightly different behavior than the issues above in that you are getting timeouts on the write rather than a full bucket lockout. Can you generate profiles when you're seeing those writes fail? You can watch the logs for status_code=499 to know when the writes are timing out and profiles can be generated with

curl -o profiles.tar.gz "http://localhost:8086/debug/pprof/all?cpu=30s"
iostat -xd 1 30 > iostat.txt

@MaxOOOOON
Copy link

MaxOOOOON commented May 17, 2023

curl -o profiles.tar.gz "http://localhost:8086/debug/pprof/all?cpu=30s"
response
curl: (52) Empty reply from server

I will try to execute the command iostat -xd 1 30 > iostat.txt during the problem, but I can send the logs of the atop command with an interval of 1 minute.

@mmarkus13
Copy link

Having the same issue.

Did tweaking those parameters (storage-compact-throughput-burst storage-max-concurrent-compactions) solve it for you, or have you done anything else since to fix it?

@zekena2
Copy link

zekena2 commented Jun 19, 2023

For me tweaking those parameters did reduce the number of times this happens but did not entirely stop the problem.

@MaxOOOOON
Copy link

I set the parameter storage-compact-throughput-burst = 665824 and the influx does not freeze, but now the cleaning does not work correctly. if look through the UI, there is no data in the bucket, but if you look directly at the server, the space does not decrease. After restarting the influxdb service, the space is freed up.

@mmarkus13
Copy link

Thanks for the info @zekena2 & @MaxOOOOON. Yeah I'd like to avoid unnecessary (scheduled) restarts, as it is only a temporary workaround, and creates gaps in data.

Is there any official info from @influxdata on this - possible RC?
Are they aware of the issue?

@john159753
Copy link

@MaxOOOOON - I'm glad I found your post, I have been struggling with the same issue since moving from v1.8, I thought I was alone. I am running on windows and was contemplating moving to a Linux install to see if it would help. It seems like at 00:00UTC something triggers, causes Influx to go into a bad state where it continually eats memory. During this bad state nothing is working, and influx responds to clients with "internal error". I even gave the server gobs and gobs of memory (like >100GB; hot-add because its a VM) to see if it was just a process that it needed to work through and it didn't seem like it was helping the situation. Every time I've needed to restart the service (or wait for it to crash on its own due to OOM).

I'm going to try tweaking the same parameters too and see if it helps my situation.

@ddakker
Copy link

ddakker commented Jan 29, 2024

I'm having the same problem.(Influxdb 2.7)
Did you solve it?

Thank you.

@mmarkus13
Copy link

After endless of fine-tuning and testing we still experience this issue on regular basis - tho automatic restart when the issue happens at least mitigates the problem.

@orjanj
Copy link

orjanj commented Jan 29, 2024

Same issue here as well. Restarting is the way to go, temporarily fixes the issue until next time it appears.
Qualified guessing points to using the Flux language with large amounts of data, from my point of view (might also point to some compaction bugs?).
Might this fit with the same symptoms as others here?

Wondering if converting Flux-boards to InfluxQL can solve the issue on a longer term.

@zekena2
Copy link

zekena2 commented Jan 29, 2024

I don't use flux and the queries are only in influxql but we still have this problem.

@philjb philjb changed the title InfluxDB bucket stops reading+writing every couple of days [2.x] InfluxDB bucket stops reading+writing every couple of days Feb 6, 2024
@philjb philjb added the area/2.x OSS 2.0 related issues and PRs label Feb 6, 2024
@AlbertasB
Copy link

Interestingly I've noticed this PR, #24584 its for 1.x but sounds oddly similar to what is happening with 2.x. Could this be the cause and proposed PR a solution? We also see these issues arrise at midnight, when new shards are being created.

@jeremow
Copy link

jeremow commented Mar 15, 2024

I have the same issue with InfluxDB 2.7.5. I have background scripts who delete some of the data in a cron and when I display in Grafana the data and I query the InfluxDB at the same time as it's deleting the data, the bucket is locked and I couldn't access it anymore until I restart influxd. I try to minimize the restarting by relaunching directly but still it can happen and display errors on our Grafana dashboards

@jeffreyssmith2nd jeffreyssmith2nd removed their assignment Jul 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/2.x OSS 2.0 related issues and PRs
Projects
None yet
Development

No branches or pull requests