Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

startup script influxd-systemd-start.sh stuck in while loop if http auth set #21967

Closed
IgorSimic opened this issue Jul 29, 2021 · 62 comments
Closed
Assignees

Comments

@IgorSimic
Copy link

IgorSimic commented Jul 29, 2021

Actual behavior:

systemd service stuck in /usr/lib/influxdb/scripts/influxd-systemd-start.sh in while loop if http authentification is set
I belive that problem is introduced in commit c8de72d

Environment info:

  • System info: Linux 5.10.53-sunxi64 aarch64
  • InfluxDB version: InfluxDB v1.8.7 (git: 1.8 v1.8.7)
  • Other relevant environment details:
    dpkg -l | grep influx
    ii influxdb 1.8.7-1 arm64 Distributed time-series database.
    cat /etc/issue
    Ubuntu 18.04.5 LTS
@IgorSimic IgorSimic changed the title startup script influxd-systemd-start.sh startup script influxd-systemd-start.sh stuck in while loop if http auth set Jul 29, 2021
@AndreasUfert
Copy link

AndreasUfert commented Jul 30, 2021

After Upgrade to Debian Package 1.8.7-1 we were hit not only by #21933 (missing executable permissions) but also by this issue.

Unfortunately, in the ubuntu repo, only 1.8.7-1 is available anymore, no previous versions, which makes a downgrade more complicated than necessary in an automated environment...

@danpawlik
Copy link

danpawlik commented Jul 30, 2021

thanks @IgorSimic .
I replace the systemd service file to old one before the change c8de72d and it works (on Centos 7).

## If you modify this, please also make sure to edit init.sh
#
[Unit]
Description=InfluxDB is an open-source, distributed, time series database
Documentation=https://docs.influxdata.com/influxdb/
After=network-online.target
[Service]
User=influxdb
Group=influxdb
LimitNOFILE=65536
EnvironmentFile=-/etc/default/influxdb
ExecStart=/usr/bin/influxd
KillMode=control-group
Restart=on-failure

[Install]
WantedBy=multi-user.target
Alias=influxd.service

So I dig what is an issue in the new file and by removing:

"Type=forking" 

in
/etc/systemd/system/multi-user.target.wants/influxdb.service it seems to work.

NOTE: before set:

chmod +x /usr/lib/influxdb/scripts/influxd-systemd-start.sh

@AndreasUfert
Copy link

AndreasUfert commented Jul 30, 2021

So I dig what is an issue in the new file and by removing:

"Type=forking" 

in
/etc/systemd/system/multi-user.target.wants/influxdb.service it seems to work.

That seems to fix it, however the startup script /usr/lib/influxdb/scripts/influxd-systemd-start.sh keeps polling and waiting for 200 which isn't returned when

  • https is active (400) or
  • auth is enabled (401)

/usr/lib/influxdb/scripts/influxd-systemd-start.sh needs to be rewritten completely.

@hughes
Copy link

hughes commented Jul 30, 2021

Updating automated deployments to account for this broken release is a major pain.

@robinsmidsrod
Copy link

robinsmidsrod commented Jul 31, 2021

Another issue is that if the http section setting https-enabled = true is used, the script also loops forever, because it assumes that the HTTP endpoint is always listening on http, without checking if it is using https. This was also introduced by the change in 1.8.7.

My workaround is this customization to the influxdb.service (going back to the old startup behavior):

# /etc/systemd/system/influxdb.service.d/override.conf
[Service]
ExecStart=
ExecStart=/usr/bin/influxd -config /etc/influxdb/influxdb.conf $INFLUXD_OPTS
PIDFile=
Type=simple

@codyshepherd
Copy link
Contributor

Hi there. v1.8.9 was released today with a fix for this issue.

@IgorSimic
Copy link
Author

Sorry @codyshepherd,
but it is not fixed yet. What is done is added counter which exits loop with error if maximun tryes reached.
if [ $max_attempts -le 0 ]; then
echo "Failed to reach influxdb $PROTOCOL endpoint at $url"
exit 1
fi
Problem is if http authentication is set, there is no way to get return code 200 from line:
result=$(curl -k -s -o /dev/null $url -w %{http_code})
Return code is 401 and script fail, and in previous version it stuck in loop, but in either version it don't work.
My only chance is to put username and password in that lines like this:
result=$(curl -s -o /dev/null "http://$HOST:$PORT/health?u=username&p=password" -w %{http_code})

@mayasd
Copy link

mayasd commented Aug 6, 2021

I can confirm than this issue is not fixed in v1.8.9 release

@samuraii78
Copy link

Issue still persists. Changing the systemd service file for influx from type=forking to simple allows the service to start.

The issue also affects queries sent over https using a self signed certificate. Its as if the ignore TLS verification is totally ignored.

@robinsmidsrod
Copy link

I'm just curious, why isn't ExecStartPost=/path/to/check-up-script.sh used here to perform the validation that the service is operational? Then you could avoid overriding more than this single line if the script in question doesn't work for your installation.

In fact, Influxdb should probably implement proper Liveness and Readiness (in Kubernetes syntax) tests and these could be used instead of this workaround. Since Influxdb 1.x has been superseeded by 2.x I'm going to assume nobody cares about this.

I'm also curious why a startup behavior change (like synchronous waiting for the full load of the data files) was introduced in a patch release. It might not be a breaking change, per se, but not far from it. Starting up Influxdb usually takes seconds, but loading all the data files can easily take minutes on a large installation.

@stefan-as
Copy link

For us, adding a counter makes it even worse. The resulting behavior is a timeout of 10 seconds (10 attempts, sleep 1 second) to start the influxdb daemon. For large databases, this timeout is far too short to read all data from disk. Now we see a systemd loop:

Aug  6 12:21:43 services-staging-monitor-00 influxd-systemd-start.sh[8931]: Failed to reach influxdb http endpoint at http://ip6-localhost:8086/health
Aug  6 12:21:43 services-staging-monitor-00 systemd[1]: influxdb.service: Control process exited, code=exited status=1
Aug  6 12:21:43 services-staging-monitor-00 systemd[1]: influxdb.service: Unit entered failed state.
Aug  6 12:21:43 services-staging-monitor-00 systemd[1]: influxdb.service: Failed with result 'exit-code'.
Aug  6 12:21:44 services-staging-monitor-00 systemd[1]: influxdb.service: Service hold-off time over, scheduling restart.

@thinkl33t
Copy link

We have the exact same issue, and have had to bump the sleep 1 to sleep 5

thinkl33t added a commit to thinkl33t/influxdb that referenced this issue Aug 6, 2021
With the current script, a large database doesn't have time to finish 
starting up in the 10 seconds alotted by this script.  This causes the 
service to fail to start up, and systemd to sit looping forever.

This PR increases the sleep value to 5 seconds, and the max_attempts to 
12, meaning the script will attempt to connect for up to 1 minute before 
giving up.

Part of influxdata#21967
@stefan-as
Copy link

stefan-as commented Aug 6, 2021

I don't think there is a "correct" value for max_attempts and sleep time, since the real timeout highly depends on a given setup, data size and workload. The whole approach to guess the startup time by a third party shell script is more than fishy. The only instance who know when the daemon is ready to handle client connections is the daemon itself.

@thinkl33t
Copy link

Indeed, thats where my issues stem from.

While i'd prefer this was fixed properly (that script has a lot of code issues!), the PR raised above is designed to get influxdb working again in the short-term for those of us with large databases, as right now its broken out of the box.

Lets get it working in the short term then come up with a perfectly designed fix :)

@codyshepherd
Copy link
Contributor

Hi all. Thanks for the continued feedback. We will be pushing an rc (release candidate) release with some additional fixes to the systemd wrapper. I'll post a link to that rc when it is out.

@dswarbrick
Copy link

My instance takes ~50s from process start to http socket bind, so the max_attempts=10 and sleep 1 is highly optimistic. Is this shell script really the best we can do in the systemd era? It feels more like something that belongs in a sysvinit script. Why not use go-systemd for signalling to systemd when the daemon is ready?

@majorbob
Copy link

majorbob commented Aug 7, 2021

My workaround on RaspberryPi running influx 1.8.9 with 18 month old database is to bump the max_attempts to 15, in /usr/lib/influxdb/scripts/influxd-system-start.sh. Now working again (until next update!).

@thorian93
Copy link

Yeah, I'm now pretty sure it is memory related. Increasing the swap file didn't help, Raspbian doesn't seem to use it. I will need to look into why this is. I did monitor the amount of available memory while starting influxd. It goes down rapidly, until at around 150 MB the program crashes.

If your database gets that memory demanding you probably might want to go to some stronger hardware than a Raspi. Eventually you will get more and more issues.

@michaelbeljaars
Copy link

@thorian93 yeah, thanks for the advice. I haven't had problems so far, but I guess it is getting big now. I have created a new InfluxDB (2.0) installation in a docker container on my Synology NAS, which is succesfully storing my sensor data. I am however still searching for a way to safe my old data and transfer it to the Influx installation on my NAS. I know this is a bit off topic, but any help would be appreciated. I currently can't make a backup of my database, because I can't get InfluxDB to run. I have tried having it read files directly from disk, instead of loading them into memory, but that has lead to errors of 'unable to open shard'. I suspect this has to do with a limit on the allowed number of open files. I'll continue investigating.

@thorian93
Copy link

Maybe open a dedicated issue detailing your problem @michaelbeljaars, so we do not get off topic here. 🙂

@framps
Copy link

framps commented Sep 7, 2021

Removing "Type=forking" worked for me as well.

For me too

# cat /etc/os-release 
PRETTY_NAME="Raspbian GNU/Linux 10 (buster)"
NAME="Raspbian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=raspbian
ID_LIKE=debian
HOME_URL="http://www.raspbian.org/"
SUPPORT_URL="http://www.raspbian.org/RaspbianForums"
BUG_REPORT_URL="http://www.raspbian.org/RaspbianBugs" 
# uname -a
Linux asterix.@@@.de 5.10.60-v7l+ #1449 SMP Wed Aug 25 15:00:44 BST 2021 armv7l GNU/Linux
# dpkg -l | grep influx
ii  influxdb                       1.8.9-1                             armhf        Distributed time-series database.

@prutschman
Copy link

Is there a timeline for a final release of rc10?

@vibhubithar
Copy link

I took the rc10 but not able to start the influxDB

I am using Rpi 0 with buster installed

pi@raspberrypi:~ $ influx
Failed to connect to http://localhost:8086: Get http://localhost:8086/ping: dial tcp [::1]:8086: connect: connection refused
Please check your connection settings and ensure 'influxd' is running.

pi@raspberrypi:~ $ influxd

2021-09-21T07:40:50.241434Z info InfluxDB starting {"log_id": "0Wipe0jG000", "version": "1.8.10rc0", "branch": "1.8", "commit": "115e24083d89"}
2021-09-21T07:40:50.245835Z info Go runtime {"log_id": "0Wipe0jG000", "version": "go1.13.8", "maxprocs": 1}
run: open server: listen: listen tcp 127.0.0.1:8088: bind: address already in use

but when i run "sudo /usr/lib/influxdb/scripts/influxd-systemd-start.sh" i do see influxdb started but influx still returns failed to connect

pi@raspberrypi:~ $ sudo /usr/lib/influxdb/scripts/influxd-systemd-start.sh
Merging with configuration at: /etc/influxdb/influxdb.conf

2021-09-21T07:38:45.096953Z info InfluxDB starting {"log_id": "0WipXNpW000", "version": "1.8.10rc0", "branch": "1.8", "commit": "115e24083d89"}
2021-09-21T07:38:45.105030Z info Go runtime {"log_id": "0WipXNpW000", "version": "go1.13.8", "maxprocs": 1}
run: open server: listen: listen tcp 127.0.0.1:8088: bind: address already in use
Merging with configuration at: /etc/influxdb/influxdb.conf
InfluxDB started

pi@raspberrypi:~ $ sudo systemctl status influxdb.service
● influxdb.service - InfluxDB is an open-source, distributed, time series database
Loaded: loaded (/lib/systemd/system/influxdb.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2021-09-21 00:38:12 PDT; 5min ago
Docs: https://docs.influxdata.com/influxdb/
Main PID: 311 (influxd)
Tasks: 8 (limit: 877)
CGroup: /system.slice/influxdb.service
└─311 /usr/bin/influxd -config /etc/influxdb/influxdb.conf

Sep 21 00:38:11 raspberrypi influxd-systemd-start.sh[302]: ts=2021-09-21T07:38:11.407127Z lvl=info msg="opened HTTP access log" log_id=0WipSoUl000 service=httpd path=stderr
Sep 21 00:38:11 raspberrypi influxd-systemd-start.sh[302]: ts=2021-09-21T07:38:11.407617Z lvl=info msg="Listening on HTTP" log_id=0WipSoUl000 service=httpd addr=192.168.8.227:808
Sep 21 00:38:11 raspberrypi influxd-systemd-start.sh[302]: ts=2021-09-21T07:38:11.412829Z lvl=info msg="Starting retention policy enforcement service" log_id=0WipSoUl000 service=
Sep 21 00:38:11 raspberrypi influxd-systemd-start.sh[302]: ts=2021-09-21T07:38:11.413961Z lvl=info msg="Listening for signals" log_id=0WipSoUl000
Sep 21 00:38:11 raspberrypi influxd-systemd-start.sh[302]: ts=2021-09-21T07:38:11.419409Z lvl=info msg="Storing statistics" log_id=0WipSoUl000 service=monitor db_instance=_intern
Sep 21 00:38:11 raspberrypi influxd-systemd-start.sh[302]: ts=2021-09-21T07:38:11.428506Z lvl=info msg="Sending usage statistics to usage.influxdata.com" log_id=0WipSoUl000
Sep 21 00:38:12 raspberrypi influxd-systemd-start.sh[302]: [httpd] 192.168.8.227 - - [21/Sep/2021:00:38:12 -0700] "GET /health HTTP/1.1" 200 110 "-" "curl/7.64.0" dfd262fb-1aae-1
Sep 21 00:38:12 raspberrypi influxd-systemd-start.sh[302]: InfluxDB started
Sep 21 00:38:12 raspberrypi systemd[1]: Started InfluxDB is an open-source, distributed, time series database.
Sep 21 00:38:45 raspberrypi influxd-systemd-start.sh[302]: [httpd] 192.168.8.227 - - [21/Sep/2021:00:38:45 -0700] "GET /health HTTP/1.1" 200 110 "-" "curl/7.64.0" f3a8ed77-1aae-1

@codyshepherd
Copy link
Contributor

Is there a timeline for a final release of rc10?

We will release another rc (release candidate) within the next week or two, with the official 1.8.10 release coming sometime soon after that.

@vibhubithar
Copy link

@codyshepherd i tried 1.8.20 rc but still was not able to start the influxdb on the raspberry pi (buster) i had to downgrade to 1.8.5 which worked like a charm

I was facing the same challenge when upgraded influxdb to 1.8.9 so I have to downgrade back to 1.8.5

https://vibhubithar.medium.com/workaround-latest-version-of-influxdb-not-starting-on-raspberry-pi-buster-a8b5afa84fce

sudo apt update
sudo apt upgrade -y
wget https://s3.amazonaws.com/dl.influxdata.com/influxdb/releases/influxdb_1.8.5_armhf.deb
sudo systemctl unmask influxdb.service
sudo systemctl start influxdb
sudo systemctl enable influxdb.service

@codyshepherd
Copy link
Contributor

@vibhubithar which version are you using (neither rc10 nor 1.8.20 are valid versions) and what package/installation method are you using? What is the architecture of the system you're using? Have you tried any of the workarounds described in this issue's conversation, and if so, what were the results (this will help narrow down the cause of the error you're seeing)?

From the output at the end of your post above, it does seem like the systemd daemon is in fact starting correctly. You should not need to manually run the systemd start script.

1.8.9 is known to have a startup problem when started via systemd. Upgrading from an old install will not fix this yet, as it will fail to upgrade to 1.8.10rc0 (the current working version), as this is a release candidate only, and has not been published to our package repository. You will need to download and install the rc release manually using apt or dpkg.

Please do the following to reproduce your error:

  • Ensure you've downloaded the 1.8.10rc0 package at https://dl.influxdata.com/influxdb/releases/influxdb_1.8.10rc0_< your arch >.deb
  • install the package using Apt or dpkg
  • attempt one more more of the steps in the getting started guide
  • define what you expect to see happen, and what error behavior you observed instead
  • reproducing this error on a "fresh" machine such as a VM (e.g. using Multipass) can help ensure errors aren't from conflicting system configuration

@codyshepherd
Copy link
Contributor

The 1.8.10rc1 build is now available.

Find the deb for your architecture at https://dl.influxdata.com/influxdb/releases/influxdb_1.8.10rc1_< arch >.deb

The deb is available for amd64, arm64, armel, armhf, and i386

And the rpm at https://dl.influxdata.com/influxdb/releases/influxdb-1.8.10rc1.< arch >.rpm

The rpm is available for x86_64, aarch64, armel, armv7hl, and i386.

@wscheicher
Copy link

I was debugging a setup the last hours and now see this bug report.
Figured myself that the 10s timeout was totally unrealistic and increased it to several minutes.
Since i also had the impression that influxdb fails to start when other resource intensive services are starting/running, i tried to make telegraf wait.

For that i put in a systemd-notify --ready at the end of influxd-systemd-start.sh and Type=notify and NotifyAccess=all to influxdb.service

My learnings ( influxdb 1.8.9-1 on raspberry zero ): influxdb takes ~2 minutes to start.

I wonder if it wouldn't make sense to have influxd send systemd notifications itself once ready and have systemd handle starting failures?

@prupert
Copy link

prupert commented Oct 4, 2021

Work-around via Ansible:

# See: https://github.com/influxdata/influxdb/issues/21967
# Fix startup issue with InfluxDB up to 1.8.9

- name: Ensure drop-in config directory for InfluxDB systemd service exists
  file:
    path: /etc/systemd/system/influxdb.service.d
    state: directory

- name: Ensure work-around for InfluxDB service startup
  template:
    src: service.d.override.conf.j2
    dest: /etc/systemd/system/influxdb.service.d/override.conf
  register: _influxdb_unitfile
  notify:
    - restart influxdb

- name: Force daemon-reload
  systemd:
    daemon_reload: yes
  when: _influxdb_unitfile.changed

Template:

# {{ ansible_managed }}
# Work-around for https://github.com/influxdata/influxdb/issues/21967
[Service]
Type=simple

@dswarbrick
Copy link

dswarbrick commented Oct 17, 2021

The startup script in 1.8.10 no longer has a max number of attempts waiting for the health endpoint to respond. But since the script doesn't exit until the health endpoint responds, this is exceeding systemd's DefaultTimeoutStartSec (= 90s) on my system. I had to add a drop-in to override TimeoutStartSec - 120s was sufficient, but YMMV.

@jriobello
Copy link

thanks @IgorSimic . I replace the systemd service file to old one before the change c8de72d and it works (on Centos 7).

## If you modify this, please also make sure to edit init.sh
#
[Unit]
Description=InfluxDB is an open-source, distributed, time series database
Documentation=https://docs.influxdata.com/influxdb/
After=network-online.target
[Service]
User=influxdb
Group=influxdb
LimitNOFILE=65536
EnvironmentFile=-/etc/default/influxdb
ExecStart=/usr/bin/influxd
KillMode=control-group
Restart=on-failure

[Install]
WantedBy=multi-user.target
Alias=influxd.service

So I dig what is an issue in the new file and by removing:

"Type=forking" 

in /etc/systemd/system/multi-user.target.wants/influxdb.service it seems to work.

NOTE: before set:

chmod +x /usr/lib/influxdb/scripts/influxd-systemd-start.sh

In my rpi zero w change timeout of systemd..
TimeoutStartSec=120

(default 60 restart de process forever...)

@tredondo
Copy link

Landed here via a search for "InfluxDB API unavailable after" attempts, and this was the only result. Influx was working fine a few days ago, but now connections to it time out.

I can't tell from the service status what's going on, but I didn't restart the service because it's only configured to run at startup, and my host's uptime is 28 days, since I upgraded to 1.8.10rc0. Ubuntu 18.04.6 LTS (GNU/Linux 4.15.0-156-generic x86_64)

# systemctl status influxdb.service
● influxdb.service - InfluxDB is an open-source, distributed, time series database
   Loaded: loaded (/lib/systemd/system/influxdb.service; enabled; vendor preset: enabled)
   Active: activating (start) since Mon 2021-10-18 09:02:03 EDT; 1min 20s ago
     Docs: https://docs.influxdata.com/influxdb/
 Main PID: 1062 (code=exited, status=2); Control PID: 16822 (influxd-systemd)
    Tasks: 12 (limit: 4702)
   CGroup: /system.slice/influxdb.service
           ├─16822 /bin/bash -e /usr/lib/influxdb/scripts/influxd-systemd-start.sh
           ├─16827 /usr/bin/influxd -config /etc/influxdb/influxdb.conf
           └─17330 sleep 1

Oct 18 09:03:26 myhost influxd-systemd-start.sh[16822]: ts=2021-10-18T13:03:26.253222Z lvl=info msg="Opened shard" log_id=0XGslC4l000 service=store trace_id=0XGslfpG000 op_name=tsdb_open index_version=inmem path=/mnt/influxdb/data/mydb/autogen/3098 duration=51.100ms
Oct 18 09:03:26 myhost influxd-systemd-start.sh[16822]: ts=2021-10-18T13:03:26.256154Z lvl=info msg="Opened file" log_id=0XGslC4l000 engine=tsm1 service=filestore path=/mnt/influxdb/data/mydb/autogen/3099/000000001-000000001.tsm id=0 duration=0.750ms
Oct 18 09:03:26 myhost influxd-systemd-start.sh[16822]: ts=2021-10-18T13:03:26.259869Z lvl=info msg="Opened shard" log_id=0XGslC4l000 service=store trace_id=0XGslfpG000 op_name=tsdb_open index_version=inmem path=/mnt/influxdb/data/mydb/autogen/3097 duration=58.175ms
Oct 18 09:03:26 myhost influxd-systemd-start.sh[16822]: ts=2021-10-18T13:03:26.276195Z lvl=info msg="Opened file" log_id=0XGslC4l000 engine=tsm1 service=filestore path=/mnt/influxdb/data/mydb/autogen/31/000000084-000000002.tsm id=0 duration=10.887ms
Oct 18 09:03:26 myhost influxd-systemd-start.sh[16822]: ts=2021-10-18T13:03:26.297516Z lvl=info msg="Opened shard" log_id=0XGslC4l000 service=store trace_id=0XGslfpG000 op_name=tsdb_open index_version=inmem path=/mnt/influxdb/data/mydb/autogen/3099 duration=44.045ms
Oct 18 09:03:26 myhost influxd-systemd-start.sh[16822]: ts=2021-10-18T13:03:26.300472Z lvl=info msg="Opened file" log_id=0XGslC4l000 engine=tsm1 service=filestore path=/mnt/influxdb/data/mydb/autogen/3100/000000001-000000001.tsm id=0 duration=0.804ms
Oct 18 09:03:26 myhost influxd-systemd-start.sh[16822]: ts=2021-10-18T13:03:26.315064Z lvl=info msg="Opened shard" log_id=0XGslC4l000 service=store trace_id=0XGslfpG000 op_name=tsdb_open index_version=inmem path=/mnt/influxdb/data/mydb/autogen/31 duration=55.053ms
Oct 18 09:03:26 myhost influxd-systemd-start.sh[16822]: ts=2021-10-18T13:03:26.325964Z lvl=info msg="Opened shard" log_id=0XGslC4l000 service=store trace_id=0XGslfpG000 op_name=tsdb_open index_version=inmem path=/mnt/influxdb/data/mydb/autogen/3100 duration=28.314ms
Oct 18 09:03:26 myhost influxd-systemd-start.sh[16822]: ts=2021-10-18T13:03:26.328916Z lvl=info msg="Opened file" log_id=0XGslC4l000 engine=tsm1 service=filestore path=/mnt/influxdb/data/mydb/autogen/3102/000000001-000000001.tsm id=0 duration=0.696ms
Oct 18 09:03:26 myhost influxd-systemd-start.sh[16822]: ts=2021-10-18T13:03:26.330406Z lvl=info msg="Opened file" log_id=0XGslC4l000 engine=tsm1 service=filestore path=/mnt/influxdb/data/mydb/autogen/3101/000000001-000000001.tsm id=0 duration=0.843ms
Oct 18 09:03:26 myhost influxd-systemd-start.sh[16822]: ts=2021-10-18T13:03:26.416401Z lvl=info msg="Opened shard" log_id=0XGslC4l000 service=store trace_id=0XGslfpG000 op_name=tsdb_open index_version=inmem path=/mnt/influxdb/data/mydb/autogen/3101 duration=101.245ms
Oct 18 09:03:26 myhost influxd-systemd-start.sh[16822]: ts=2021-10-18T13:03:26.416904Z lvl=info msg="Opened shard" log_id=0XGslC4l000 service=store trace_id=0XGslfpG000 op_name=tsdb_open index_version=inmem path=/mnt/influxdb/data/mydb/autogen/3102 duration=90.820ms
Oct 18 09:03:26 myhost influxd-systemd-start.sh[16822]: ts=2021-10-18T13:03:26.419852Z lvl=info msg="Opened file" log_id=0XGslC4l000 engine=tsm1 service=filestore path=/mnt/influxdb/data/mydb/autogen/3104/000000001-000000001.tsm id=0 duration=1.072ms
Oct 18 09:03:26 myhost influxd-systemd-start.sh[16822]: ts=2021-10-18T13:03:26.424410Z lvl=info msg="Opened file" log_id=0XGslC4l000 engine=tsm1 service=filestore path=/mnt/influxdb/data/mydb/autogen/3103/000000001-000000001.tsm id=0 duration=0.995ms
Oct 18 09:03:26 myhost influxd-systemd-start.sh[16822]: ts=2021-10-18T13:03:26.471923Z lvl=info msg="Opened shard" log_id=0XGslC4l000 service=store trace_id=0XGslfpG000 op_name=tsdb_open index_version=inmem path=/mnt/influxdb/data/mydb/autogen/3103 duration=55.408ms
Oct 18 09:03:26 myhost influxd-systemd-start.sh[16822]: ts=2021-10-18T13:03:26.484100Z lvl=info msg="Opened file" log_id=0XGslC4l000 engine=tsm1 service=filestore path=/mnt/influxdb/data/mydb/autogen/3105/000000001-000000001.tsm id=0 duration=0.870ms
Oct 18 09:03:26 myhost influxd-systemd-start.sh[16822]: ts=2021-10-18T13:03:26.504035Z lvl=info msg="Opened shard" log_id=0XGslC4l000 service=store trace_id=0XGslfpG000 op_name=tsdb_open index_version=inmem path=/mnt/influxdb/data/mydb/autogen/3104 duration=87.102ms
Oct 18 09:03:26 myhost influxd-systemd-start.sh[16822]: ts=2021-10-18T13:03:26.523172Z lvl=info msg="Opened file" log_id=0XGslC4l000 engine=tsm1 service=filestore path=/mnt/influxdb/data/mydb/autogen/3106/000000001-000000001.tsm id=0 duration=4.405ms
Oct 18 09:03:26 myhost influxd-systemd-start.sh[16822]: ts=2021-10-18T13:03:26.543759Z lvl=info msg="Opened shard" log_id=0XGslC4l000 service=store trace_id=0XGslfpG000 op_name=tsdb_open index_version=inmem path=/mnt/influxdb/data/mydb/autogen/3105 duration=71.663ms
Oct 18 09:03:26 myhost influxd-systemd-start.sh[16822]: ts=2021-10-18T13:03:26.559327Z lvl=info msg="Opened file" log_id=0XGslC4l000 engine=tsm1 service=filestore path=/mnt/influxdb/data/mydb/autogen/3107/000000001-000000001.tsm id=0 duration=1.215ms
Oct 18 09:03:26 myhost influxd-systemd-start.sh[16822]: ts=2021-10-18T13:03:26.581875Z lvl=info msg="Opened shard" log_id=0XGslC4l000 service=store trace_id=0XGslfpG000 op_name=tsdb_open index_version=inmem path=/mnt/influxdb/data/mydb/autogen/3106 duration=77.750ms
Oct 18 09:03:26 myhost influxd-systemd-start.sh[16822]: ts=2021-10-18T13:03:26.586638Z lvl=info msg="Opened file" log_id=0XGslC4l000 engine=tsm1 service=filestore path=/mnt/influxdb/data/mydb/autogen/3108/000000001-000000001.tsm id=0 duration=0.737ms
Oct 18 09:03:26 myhost influxd-systemd-start.sh[16822]: ts=2021-10-18T13:03:26.590155Z lvl=info msg="Opened shard" log_id=0XGslC4l000 service=store trace_id=0XGslfpG000 op_name=tsdb_open index_version=inmem path=/mnt/influxdb/data/mydb/autogen/3107 duration=46.308ms
Oct 18 09:03:26 myhost influxd-systemd-start.sh[16822]: ts=2021-10-18T13:03:26.594937Z lvl=info msg="Opened file" log_id=0XGslC4l000 engine=tsm1 service=filestore path=/mnt/influxdb/data/mydb/autogen/3109/000000001-000000001.tsm id=0 duration=1.864ms
Oct 18 09:03:26 myhost influxd-systemd-start.sh[16822]: ts=2021-10-18T13:03:26.641453Z lvl=info msg="Opened shard" log_id=0XGslC4l000 service=store trace_id=0XGslfpG000 op_name=tsdb_open index_version=inmem path=/mnt/influxdb/data/mydb/autogen/3108 duration=58.813ms
Oct 18 09:03:26 myhost influxd-systemd-start.sh[16822]: ts=2021-10-18T13:03:26.641547Z lvl=info msg="Opened shard" log_id=0XGslC4l000 service=store trace_id=0XGslfpG000 op_name=tsdb_open index_version=inmem path=/mnt/influxdb/data/mydb/autogen/3109 duration=51.323ms
Oct 18 09:03:26 myhost influxd-systemd-start.sh[16822]: ts=2021-10-18T13:03:26.644278Z lvl=info msg="Opened file" log_id=0XGslC4l000 engine=tsm1 service=filestore path=/mnt/influxdb/data/mydb/autogen/3111/000000001-000000001.tsm id=0 duration=0.997ms
Oct 18 09:03:26 myhost influxd-systemd-start.sh[16822]: ts=2021-10-18T13:03:26.663997Z lvl=info msg="Opened file" log_id=0XGslC4l000 engine=tsm1 service=filestore path=/mnt/influxdb/data/mydb/autogen/3110/000000001-000000001.tsm id=0 duration=4.597ms
Oct 18 09:03:26 myhost influxd-systemd-start.sh[16822]: InfluxDB API unavailable after 82 attempts...

@codyshepherd
Copy link
Contributor

I'm closing this bug as fixed in 1.8.10, with the recommendation that those affected should upgrade to that release. Please open a new bug for any issues encountered in 1.8.10.

@tredondo
Copy link

As I mentioned in my comment above, I was still running into this issue after upgrading to 1.8.10. Downgrading to 1.8.5 fixed it, without any configuration changes.

@codyshepherd
Copy link
Contributor

codyshepherd commented Oct 19, 2021

As I mentioned in my comment above, I was still running into this issue after upgrading to 1.8.10. Downgrading to 1.8.5 fixed it, without any configuration changes.

Apologies @tredondo I must have skimmed right over your response.

I'm a bit unclear on the current status of your influxdb installation. You upgraded, correct? To which version? And what series of actions led to the log you have posted above? i.e. can you describe a reproducer?

@tredondo
Copy link

Np @codyshepherd. Unfortunately I can't provide repro steps. I have a set-it-and-forget-it InfluxDB setup that I don't touch. I restarted the host after upgrading Influx to 1.8.10rc0 per this comment. Influx started successfully, but at some point last week, clients could no longer connect to it. I didn't initiate an influxd restart and the host's uptime was 28+ days, but let me know if I can determine if influxd restarted on its own.

I suspect the slow startup might be caused by the size of the database - ~88GB on an XFS volume.

@codyshepherd
Copy link
Contributor

codyshepherd commented Oct 20, 2021

@tredondo If the influxdb installation did initially come up after you restarted your host, it seems unlikely that you are experiencing this particular issue. I would recommend submitting a new issue describing the sequence of events leading to the unresponsive database, as well as your system configuration in as great of detail you can manage. It could be, for example, that you are experiencing out-of-memory symptoms or disk-related problems. Dialing in to the possibilities would be best done in a fresh issue.

@mterrasson
Copy link

In my rpi zero w change timeout of systemd.. TimeoutStartSec=120

(default 60 restart de process forever...)

Thank you @jriobello , It worked on a Rasp 1.

@tredondo
Copy link

tredondo commented Nov 4, 2021

@codyshepherd: filed a separate issue, #22824. I'm unable to start influxd at all not. Neither 1.8.5, nor 1.8.10 manage to start.

@keith4
Copy link

keith4 commented Dec 30, 2021

Upgraded a (large) server from 1.8.4 to 1.8.10, got the broken systemd wrapper as part of the upgrade. Syslog just fills with lines like influxd-systemd-start.sh[6444]: InfluxDB API unavailable after 56 attempts... (I let it get to 87 before killing it and finding this issue.)

Restoring the systemd service file from 1.8.4 lets it start just fine. If you're going to do this, remove the superfluous symlink of /etc/systemd/system/influxd.service -> /lib/systemd/system/influxdb.service, put the old/working influxdb.service file into /etc/systemd/system (it will take precedence over the bad one in /lib/systemd/system), and run systemctl daemon-reload

@mattjudson
Copy link

mattjudson commented Feb 7, 2022

I just wanted to add that I came across this page after a search for "InfluxDB API unavailable after". My setup is also largely untouched and Influxdb was working prior to reboot, but after a reboot it was stuck in the activating state. I am pretty good about keeping it up to date and was already running the 1.8.10 version so although I do recall seeing influxdb restarting in previous versions, for my setup it seems that InfluxDB has just started taking longer to start. At this point it could be due to the size of the DB. Mine is ~200 Gig's, but I'm not sure.

Anyhow I seeing the influxd-systemd-start.sh script continuously looping and I noticed it would loop at the 86th attempt I was able to resolve it by adding the longer startup timeout ( I set mine to 120 seconds) as described in issue 22803 here: #22803

Before
[judsonm@Cent7Test]:/home/judsonm>sudo yum list influxdb
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile

  • base: mirror.pit.teraswitch.com
  • epel: repos.eggycrew.com
  • extras: nc-centos-mirror.iwebfusion.net
  • updates: mirror.ilumno.com
    Installed Packages
    influxdb.x86_64 1.8.10-1 @InfluxDB
    [judsonm@Cent7Test]:/home/judsonm>

Feb 7 10:46:17 Cent7Test influxd-systemd-start.sh[20506]: InfluxDB API unavailable after 85 attempts...
Feb 7 10:46:18 Cent7Test influxd-systemd-start.sh[20506]: ts=2022-02-07T15:46:18.156772Z lvl=info msg="Opened shard" log_id=0ZXDaBrl000 service=store trace_id=0ZXDaCJ0000 op_name=tsdb_open index_version=inmem path=/var/lib/influxdb/data/telegraf/autogen/1222 duration=3214.365ms
Feb 7 10:46:18 Cent7Test influxd-systemd-start.sh[20506]: ts=2022-02-07T15:46:18.417608Z lvl=info msg="Opened file" log_id=0ZXDaBrl000 engine=tsm1 service=filestore path=/var/lib/influxdb/data/telegraf/autogen/1290/000009196-000000002.tsm id=0 duration=259.313ms
Feb 7 10:46:18 Cent7Test influxd-systemd-start.sh[20506]: InfluxDB API unavailable after 86 attempts...
Feb 7 10:46:18 Cent7Test systemd[1]: influxdb.service start operation timed out. Terminating.
Feb 7 10:46:19 Cent7Test systemd[1]: Failed to start InfluxDB is an open-source, distributed, time series database.
Feb 7 10:46:19 Cent7Test systemd[1]: Unit influxdb.service entered failed state.
Feb 7 10:46:19 Cent7Test systemd[1]: influxdb.service failed.
Feb 7 10:46:19 Cent7Test influxd-systemd-start.sh[21264]: Merging with configuration at: /etc/influxdb/influxdb.conf
Feb 7 10:46:19 Cent7Test influxd-systemd-start.sh[21264]: ts=2022-02-07T15:46:19.592193Z lvl=info msg="InfluxDB starting" log_id=0ZXDfkI0000 version=1.8.10 branch=1.8 commit=688e697c51fd
Feb 7 10:46:19 Cent7Test influxd-systemd-start.sh[21264]: ts=2022-02-07T15:46:19.592221Z lvl=info msg="Go runtime" log_id=0ZXDfkI0000 version=go1.13.8 maxprocs=4
Feb 7 10:46:19 Cent7Test influxd-systemd-start.sh[21264]: Merging with configuration at: /etc/influxdb/influxdb.conf
Feb 7 10:46:19 Cent7Test influxd-systemd-start.sh[21264]: InfluxDB API unavailable after 1 attempts...
Feb 7 10:46:19 Cent7Test influxd-systemd-start.sh[21264]: ts=2022-02-07T15:46:19.698705Z lvl=info msg="Using data dir" log_id=0ZXDfkI0000 service=store path=/var/lib/influxdb/data
Feb 7 10:46:19 Cent7Test influxd-systemd-start.sh[21264]: ts=2022-02-07T15:46:19.698742Z lvl=info msg="Compaction settings" log_id=0ZXDfkI0000 service=store max_concurrent_compactions=2 throughput_bytes_per_second=50331648 throughput_bytes_per_second_burst=50331648
Feb 7 10:46:19 Cent7Test influxd-systemd-start.sh[21264]: ts=2022-02-07T15:46:19.698754Z lvl=info msg="Open store (start)" log_id=0ZXDfkI0000 service=store trace_id=0ZXDfkhW000 op_name=tsdb_open op_event=start
Feb 7 10:46:19 Cent7Test influxd-systemd-start.sh[21264]: ts=2022-02-07T15:46:19.775833Z lvl=info msg="Opened file" log_id=0ZXDfkI0000 engine=tsm1 service=filestore path=/var/lib/influxdb/data/_internal/monitor/1351/000000003-000000002.tsm id=0 duration=21.120ms
Feb 7 10:46:19 Cent7Test influxd-systemd-start.sh[21264]: ts=2022-02-07T15:46:19.775972Z lvl=info msg="Opened file" log_id=0ZXDfkI0000 engine=tsm1 service=filestore path=/var/lib/influxdb/data/_internal/monitor/1341/000000062-000000002.tsm id=0 duration=25.299ms
Feb 7 10:46:19 Cent7Test influxd-systemd-start.sh[21264]: ts=2022-02-07T15:46:19.775996Z lvl=info msg="Reading file" log_id=0ZXDfkI0000 engine=tsm1 service=cacheloader path=/var/lib/influxdb/wal/_internal/monitor/1351/_00016.wal size=10583641
Feb 7 10:46:19 Cent7Test influxd-systemd-start.sh[21264]: ts=2022-02-07T15:46:19.786360Z lvl=info msg="Opened file" log_id=0ZXDfkI0000 engine=tsm1 service=filestore path=/var/lib/influxdb/data/_internal/monitor/1339/000000062-000000002.tsm id=0 duration=34.111ms
Feb 7 10:46:19 Cent7Test influxd-systemd-start.sh[21264]: ts=2022-02-07T15:46:19.795010Z lvl=info msg="Opened file" log_id=0ZXDfkI0000 engine=tsm1 service=filestore path=/var/lib/influxdb/data/_internal/monitor/1343/000000062-000000002.tsm id=0 duration=42.740ms
Feb 7 10:46:20 Cent7Test influxd-systemd-start.sh[21264]: ts=2022-02-07T15:46:20.327701Z lvl=info msg="Reading file" log_id=0ZXDfkI0000 engine=tsm1 service=cacheloader path=/var/lib/influxdb/wal/_internal/monitor/1351/_00017.wal size=10586036
Feb 7 10:46:20 Cent7Test influxd-systemd-start.sh[21264]: InfluxDB API unavailable after 2 attempts...

edited the influxd.service to add the TimeoutStartSec=120......

[judsonm@Cent7Test]:/home/judsonm>cat /etc/systemd/system/influxd.service

[Unit]
Description=InfluxDB is an open-source, distributed, time series database
Documentation=https://docs.influxdata.com/influxdb/
After=network-online.target

[Service]
User=influxdb
Group=influxdb
LimitNOFILE=65536
EnvironmentFile=-/etc/default/influxdb
ExecStart=/usr/lib/influxdb/scripts/influxd-systemd-start.sh
KillMode=control-group
Restart=on-failure
Type=forking
PIDFile=/var/lib/influxdb/influxd.pid
TimeoutStartSec=120

[Install]
WantedBy=multi-user.target
Alias=influxd.service
[judsonm@Cent7Test]:/home/judsonm>

Finally able to start after the 110th attempt:

Feb 7 11:24:04 Cent7Test influxd-systemd-start.sh[5254]: InfluxDB API unavailable after 109 attempts...
Feb 7 11:24:04 Cent7Test influxd-systemd-start.sh[5254]: ts=2022-02-07T16:24:04.749828Z lvl=info msg="Opened shard" log_id=0ZXFj3l0000 service=store trace_id=0ZXFj49G000 op_name=tsdb_open index_version=inmem path=/var/lib/influxdb/data/telegraf/autogen/1320 duration=3422.958ms
Feb 7 11:24:05 Cent7Test influxd-systemd-start.sh[5254]: ts=2022-02-07T16:24:05.554035Z lvl=info msg="Opened shard" log_id=0ZXFj3l0000 service=store trace_id=0ZXFj49G000 op_name=tsdb_open index_version=inmem path=/var/lib/influxdb/data/telegraf/autogen/1262 duration=2319.729ms
Feb 7 11:24:05 Cent7Test influxd-systemd-start.sh[5254]: InfluxDB API unavailable after 110 attempts...
Feb 7 11:24:05 Cent7Test influxd-systemd-start.sh[5254]: ts=2022-02-07T16:24:05.898781Z lvl=info msg="Open store (end)" log_id=0ZXFj3l0000 service=store trace_id=0ZXFj49G000 op_name=tsdb_open op_event=end op_elapsed=114661.413ms
Feb 7 11:24:05 Cent7Test influxd-systemd-start.sh[5254]: ts=2022-02-07T16:24:05.901623Z lvl=info msg="Opened service" log_id=0ZXFj3l0000 service=subscriber
Feb 7 11:24:05 Cent7Test influxd-systemd-start.sh[5254]: ts=2022-02-07T16:24:05.901652Z lvl=info msg="Starting monitor service" log_id=0ZXFj3l0000 service=monitor
Feb 7 11:24:05 Cent7Test influxd-systemd-start.sh[5254]: ts=2022-02-07T16:24:05.901664Z lvl=info msg="Registered diagnostics client" log_id=0ZXFj3l0000 service=monitor name=build
Feb 7 11:24:05 Cent7Test influxd-systemd-start.sh[5254]: ts=2022-02-07T16:24:05.901672Z lvl=info msg="Registered diagnostics client" log_id=0ZXFj3l0000 service=monitor name=runtime
Feb 7 11:24:05 Cent7Test influxd-systemd-start.sh[5254]: ts=2022-02-07T16:24:05.901676Z lvl=info msg="Registered diagnostics client" log_id=0ZXFj3l0000 service=monitor name=network
Feb 7 11:24:05 Cent7Test influxd-systemd-start.sh[5254]: ts=2022-02-07T16:24:05.901699Z lvl=info msg="Registered diagnostics client" log_id=0ZXFj3l0000 service=monitor name=system
Feb 7 11:24:05 Cent7Test influxd-systemd-start.sh[5254]: ts=2022-02-07T16:24:05.901727Z lvl=info msg="Starting precreation service" log_id=0ZXFj3l0000 service=shard-precreation check_interval=10m advance_period=30m
Feb 7 11:24:05 Cent7Test influxd-systemd-start.sh[5254]: ts=2022-02-07T16:24:05.902650Z lvl=info msg="Storing statistics" log_id=0ZXFj3l0000 service=monitor db_instance=_internal db_rp=monitor interval=10s
Feb 7 11:24:05 Cent7Test influxd-systemd-start.sh[5254]: ts=2022-02-07T16:24:05.903930Z lvl=info msg="Starting snapshot service" log_id=0ZXFj3l0000 service=snapshot
Feb 7 11:24:05 Cent7Test influxd-systemd-start.sh[5254]: ts=2022-02-07T16:24:05.903951Z lvl=info msg="Starting continuous query service" log_id=0ZXFj3l0000 service=continuous_querier
Feb 7 11:24:05 Cent7Test influxd-systemd-start.sh[5254]: ts=2022-02-07T16:24:05.903968Z lvl=info msg="Starting HTTP service" log_id=0ZXFj3l0000 service=httpd authentication=false
Feb 7 11:24:05 Cent7Test influxd-systemd-start.sh[5254]: ts=2022-02-07T16:24:05.903975Z lvl=info msg="opened HTTP access log" log_id=0ZXFj3l0000 service=httpd path=stderr
Feb 7 11:24:05 Cent7Test influxd-systemd-start.sh[5254]: ts=2022-02-07T16:24:05.911032Z lvl=info msg="Listening on HTTP" log_id=0ZXFj3l0000 service=httpd addr=0.0.0.0:8086 https=false
Feb 7 11:24:05 Cent7Test influxd-systemd-start.sh[5254]: ts=2022-02-07T16:24:05.911067Z lvl=info msg="Starting retention policy enforcement service" log_id=0ZXFj3l0000 service=retention check_interval=30m
Feb 7 11:24:05 Cent7Test influxd-systemd-start.sh[5254]: ts=2022-02-07T16:24:05.911570Z lvl=info msg="Listening for signals" log_id=0ZXFj3l0000
Feb 7 11:24:05 Cent7Test influxd-systemd-start.sh[5254]: ts=2022-02-07T16:24:05.918103Z lvl=info msg="Sending usage statistics to usage.influxdata.com" log_id=0ZXFj3l0000
Feb 7 11:24:06 Cent7Test influxd-systemd-start.sh[5254]: [httpd] 127.0.0.1 - - [07/Feb/2022:11:24:06 -0500] "GET /health HTTP/1.1" 200 107 "-" "curl/7.29.0" 5f24f92d-8832-11ec-8001-0050569bd466 241
Feb 7 11:24:06 Cent7Test influxd-systemd-start.sh[5254]: InfluxDB started

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests