-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
startup script influxd-systemd-start.sh stuck in while loop if http auth set #21967
Comments
After Upgrade to Debian Package 1.8.7-1 we were hit not only by #21933 (missing executable permissions) but also by this issue. Unfortunately, in the ubuntu repo, only 1.8.7-1 is available anymore, no previous versions, which makes a downgrade more complicated than necessary in an automated environment... |
thanks @IgorSimic .
So I dig what is an issue in the new file and by removing:
in NOTE: before set:
|
That seems to fix it, however the startup script /usr/lib/influxdb/scripts/influxd-systemd-start.sh keeps polling and waiting for 200 which isn't returned when
/usr/lib/influxdb/scripts/influxd-systemd-start.sh needs to be rewritten completely. |
Updating automated deployments to account for this broken release is a major pain. |
Another issue is that if the My workaround is this customization to the influxdb.service (going back to the old startup behavior):
|
Hi there. v1.8.9 was released today with a fix for this issue. |
Sorry @codyshepherd, |
I can confirm than this issue is not fixed in v1.8.9 release |
Issue still persists. Changing the systemd service file for influx from type=forking to simple allows the service to start. The issue also affects queries sent over https using a self signed certificate. Its as if the ignore TLS verification is totally ignored. |
I'm just curious, why isn't In fact, Influxdb should probably implement proper Liveness and Readiness (in Kubernetes syntax) tests and these could be used instead of this workaround. Since Influxdb 1.x has been superseeded by 2.x I'm going to assume nobody cares about this. I'm also curious why a startup behavior change (like synchronous waiting for the full load of the data files) was introduced in a patch release. It might not be a breaking change, per se, but not far from it. Starting up Influxdb usually takes seconds, but loading all the data files can easily take minutes on a large installation. |
For us, adding a counter makes it even worse. The resulting behavior is a timeout of 10 seconds (10 attempts, sleep 1 second) to start the influxdb daemon. For large databases, this timeout is far too short to read all data from disk. Now we see a systemd loop:
|
We have the exact same issue, and have had to bump the |
With the current script, a large database doesn't have time to finish starting up in the 10 seconds alotted by this script. This causes the service to fail to start up, and systemd to sit looping forever. This PR increases the sleep value to 5 seconds, and the max_attempts to 12, meaning the script will attempt to connect for up to 1 minute before giving up. Part of influxdata#21967
I don't think there is a "correct" value for max_attempts and sleep time, since the real timeout highly depends on a given setup, data size and workload. The whole approach to guess the startup time by a third party shell script is more than fishy. The only instance who know when the daemon is ready to handle client connections is the daemon itself. |
Indeed, thats where my issues stem from. While i'd prefer this was fixed properly (that script has a lot of code issues!), the PR raised above is designed to get influxdb working again in the short-term for those of us with large databases, as right now its broken out of the box. Lets get it working in the short term then come up with a perfectly designed fix :) |
Hi all. Thanks for the continued feedback. We will be pushing an rc (release candidate) release with some additional fixes to the systemd wrapper. I'll post a link to that rc when it is out. |
My instance takes ~50s from process start to http socket bind, so the |
My workaround on RaspberryPi running influx 1.8.9 with 18 month old database is to bump the |
If your database gets that memory demanding you probably might want to go to some stronger hardware than a Raspi. Eventually you will get more and more issues. |
@thorian93 yeah, thanks for the advice. I haven't had problems so far, but I guess it is getting big now. I have created a new InfluxDB (2.0) installation in a docker container on my Synology NAS, which is succesfully storing my sensor data. I am however still searching for a way to safe my old data and transfer it to the Influx installation on my NAS. I know this is a bit off topic, but any help would be appreciated. I currently can't make a backup of my database, because I can't get InfluxDB to run. I have tried having it read files directly from disk, instead of loading them into memory, but that has lead to errors of 'unable to open shard'. I suspect this has to do with a limit on the allowed number of open files. I'll continue investigating. |
Maybe open a dedicated issue detailing your problem @michaelbeljaars, so we do not get off topic here. 🙂 |
For me too
|
Is there a timeline for a final release of rc10? |
I took the rc10 but not able to start the influxDB I am using Rpi 0 with buster installed pi@raspberrypi:~ $ influx pi@raspberrypi:~ $ influxd 2021-09-21T07:40:50.241434Z info InfluxDB starting {"log_id": "0Wipe0jG000", "version": "1.8.10rc0", "branch": "1.8", "commit": "115e24083d89"} but when i run "sudo /usr/lib/influxdb/scripts/influxd-systemd-start.sh" i do see influxdb started but influx still returns failed to connect pi@raspberrypi:~ $ sudo /usr/lib/influxdb/scripts/influxd-systemd-start.sh 2021-09-21T07:38:45.096953Z info InfluxDB starting {"log_id": "0WipXNpW000", "version": "1.8.10rc0", "branch": "1.8", "commit": "115e24083d89"} pi@raspberrypi:~ $ sudo systemctl status influxdb.service Sep 21 00:38:11 raspberrypi influxd-systemd-start.sh[302]: ts=2021-09-21T07:38:11.407127Z lvl=info msg="opened HTTP access log" log_id=0WipSoUl000 service=httpd path=stderr |
We will release another rc (release candidate) within the next week or two, with the official 1.8.10 release coming sometime soon after that. |
@codyshepherd i tried 1.8.20 rc but still was not able to start the influxdb on the raspberry pi (buster) i had to downgrade to 1.8.5 which worked like a charm I was facing the same challenge when upgraded influxdb to 1.8.9 so I have to downgrade back to 1.8.5 sudo apt update |
@vibhubithar which version are you using (neither rc10 nor 1.8.20 are valid versions) and what package/installation method are you using? What is the architecture of the system you're using? Have you tried any of the workarounds described in this issue's conversation, and if so, what were the results (this will help narrow down the cause of the error you're seeing)? From the output at the end of your post above, it does seem like the systemd daemon is in fact starting correctly. You should not need to manually run the systemd start script. 1.8.9 is known to have a startup problem when started via systemd. Upgrading from an old install will not fix this yet, as it will fail to upgrade to 1.8.10rc0 (the current working version), as this is a release candidate only, and has not been published to our package repository. You will need to download and install the rc release manually using apt or dpkg. Please do the following to reproduce your error:
|
The 1.8.10rc1 build is now available. Find the deb for your architecture at https://dl.influxdata.com/influxdb/releases/influxdb_1.8.10rc1_< arch >.deb The deb is available for And the rpm at https://dl.influxdata.com/influxdb/releases/influxdb-1.8.10rc1.< arch >.rpm The rpm is available for |
I was debugging a setup the last hours and now see this bug report. For that i put in a My learnings ( influxdb 1.8.9-1 on raspberry zero ): influxdb takes ~2 minutes to start. I wonder if it wouldn't make sense to have influxd send systemd notifications itself once ready and have systemd handle starting failures? |
Work-around via Ansible:
Template:
|
The startup script in 1.8.10 no longer has a max number of attempts waiting for the health endpoint to respond. But since the script doesn't exit until the health endpoint responds, this is exceeding systemd's |
In my rpi zero w change timeout of systemd.. (default 60 restart de process forever...) |
Landed here via a search for I can't tell from the service status what's going on, but I didn't restart the service because it's only configured to run at startup, and my host's uptime is 28 days, since I upgraded to 1.8.10rc0.
|
I'm closing this bug as fixed in 1.8.10, with the recommendation that those affected should upgrade to that release. Please open a new bug for any issues encountered in 1.8.10. |
As I mentioned in my comment above, I was still running into this issue after upgrading to 1.8.10. Downgrading to 1.8.5 fixed it, without any configuration changes. |
Apologies @tredondo I must have skimmed right over your response. I'm a bit unclear on the current status of your influxdb installation. You upgraded, correct? To which version? And what series of actions led to the log you have posted above? i.e. can you describe a reproducer? |
Np @codyshepherd. Unfortunately I can't provide repro steps. I have a set-it-and-forget-it InfluxDB setup that I don't touch. I restarted the host after upgrading Influx to 1.8.10rc0 per this comment. Influx started successfully, but at some point last week, clients could no longer connect to it. I didn't initiate an I suspect the slow startup might be caused by the size of the database - ~88GB on an XFS volume. |
@tredondo If the influxdb installation did initially come up after you restarted your host, it seems unlikely that you are experiencing this particular issue. I would recommend submitting a new issue describing the sequence of events leading to the unresponsive database, as well as your system configuration in as great of detail you can manage. It could be, for example, that you are experiencing out-of-memory symptoms or disk-related problems. Dialing in to the possibilities would be best done in a fresh issue. |
Thank you @jriobello , It worked on a Rasp 1. |
@codyshepherd: filed a separate issue, #22824. I'm unable to start influxd at all not. Neither 1.8.5, nor 1.8.10 manage to start. |
Upgraded a (large) server from 1.8.4 to 1.8.10, got the broken systemd wrapper as part of the upgrade. Syslog just fills with lines like Restoring the systemd service file from 1.8.4 lets it start just fine. If you're going to do this, remove the superfluous symlink of |
I just wanted to add that I came across this page after a search for "InfluxDB API unavailable after". My setup is also largely untouched and Influxdb was working prior to reboot, but after a reboot it was stuck in the activating state. I am pretty good about keeping it up to date and was already running the 1.8.10 version so although I do recall seeing influxdb restarting in previous versions, for my setup it seems that InfluxDB has just started taking longer to start. At this point it could be due to the size of the DB. Mine is ~200 Gig's, but I'm not sure. Anyhow I seeing the influxd-systemd-start.sh script continuously looping and I noticed it would loop at the 86th attempt I was able to resolve it by adding the longer startup timeout ( I set mine to 120 seconds) as described in issue 22803 here: #22803 Before
Feb 7 10:46:17 Cent7Test influxd-systemd-start.sh[20506]: InfluxDB API unavailable after 85 attempts... edited the influxd.service to add the TimeoutStartSec=120...... [judsonm@Cent7Test]:/home/judsonm>cat /etc/systemd/system/influxd.service [Unit] [Service] [Install] Finally able to start after the 110th attempt: Feb 7 11:24:04 Cent7Test influxd-systemd-start.sh[5254]: InfluxDB API unavailable after 109 attempts... |
Actual behavior:
systemd service stuck in /usr/lib/influxdb/scripts/influxd-systemd-start.sh in while loop if http authentification is set
I belive that problem is introduced in commit c8de72d
Environment info:
dpkg -l | grep influx
ii influxdb 1.8.7-1 arm64 Distributed time-series database.
cat /etc/issue
Ubuntu 18.04.5 LTS
The text was updated successfully, but these errors were encountered: