Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compaction crash loops and data loss on Raspberry Pi 3 B+ under minimal load #11339

Closed
ryan-williams opened this issue Jan 21, 2019 · 68 comments
Closed

Comments

@ryan-williams
Copy link

Following up on this post with a fresh issue to highlight worse symptoms that don't seem explainable by a db-size cutoff (as was speculated on #6975 and elsewhere):

In the month since that post, I've had to forcibly mv the data/collectd directory twice to unstick influx from 1-2min crash loops that lasted days, seemingly due to compaction errors.

Today I'm noticing that my temps database (which I've not messed with during these collectd db problems, and gets about 5 points per second written to it) is missing large swaths of data from the 2 months I've been writing to it:

The last gap, between 1/14 and 1/17, didn't exist this morning (when influx was still crash-looping, before the most recent time I ran mv /var/lib/influxdb/data/collectd ~/collectd.bak). That data was just recently discarded, it seems, possibly around the time I performed my "work-around" for the crash loop:

sudo service influxdb stop
sudo mv /var/lib/influxdb/data/collectd ~/collectd.bak
sudo service influxdb start
influx -execute 'create database collectd'

The default retention policy should not be discarding data, afaict:

> show retention policies on temps
name    duration shardGroupDuration replicaN default
----    -------- ------------------ -------- -------
autogen 0s       168h0m0s           1        true

Here's the last ~7d of syslogs from the RPi server, 99.9% of which is logs from Influx crash-looping.

There seem to be messages about:

  • WAL paths "already existing" when they were to be written to, and
  • compactions failing because they couldn't allocated memory
    • that's confusing because this RPi has consistently been using ≈½ of 1GB memory and 0% of 2GB swap, with a 64GB USB flash drive as a hard disk which is about 30% full).

Is running InfluxDB on an RPi supposed to generally work, or am I in uncharted territory just by attempting it?

@wollew
Copy link

wollew commented Apr 6, 2019

I don't know if it is supposed to work but I can definitely reproduce this issue on my Raspi 3B+.

@ryan-williams
Copy link
Author

I've given up trying to keep it running.

My planned next steps, whenever I have time, are:

  • use a regular, non-timeseries db
  • try running the whole thing in a docker container to provide better repro/forensic hints

@wollew
Copy link

wollew commented Apr 10, 2019

If you're willing to build InfluxDB yourself, you could try the branch in pr #12362

@aemondis
Copy link

aemondis commented Apr 25, 2019

My RPI 3B+ with InfluxDB has just started being hit by this same issue, ironically also while collecting environmental data. I however haven't yet started losing data, but I'm getting the endless crash loops filling up syslog with the same memory allocation and compaction issue(s).

#12362 makes mention of this issue too in relation to mmap on 32-bit platforms, due to the limited allocatable memory. I'm too debating my options... as it runs on a RPI because I don't want a hot and power hungry server for what should be a simple function!

@ryan-williams
Copy link
Author

Thanks for the corroboration!

FWIW, I no longer think this is due to the 32-bit platform / max memory problem.

I don't remember the details, but I think I saw a DB get past that size on my RPi, and also see this crash start well before anything should have been hitting that limit.

I've seen evidence of disk write failures or slowness possibly causing the initial problem (files quarantined with .bad extensions in Influx's data directories).

It seems like we're stuck until one of us captures the full state of one of these failing deployments, and someone who knows how to parse that state has a look…

@aemondis
Copy link

aemondis commented May 3, 2019

I do find it seems to go through fits and starts... it runs fine for several days, then suddenly services start dropping offline. I'm actually thinking to modify the service file to limit scheduler priority (via chrt) to ensure influx cannot consume all the resources of the system, as I get the feeling the load average spiking as high as it does cannot be helping the situation, since the CPU on the Pi tends to become near-unusable under high load. I have also been finding that under high I/O caused by Influx, that I start getting issues with corruption in some places on the SD card...

My influx db sits on an external USB3-based SSD (an old Intel X25-M 74GB SSD, so by no means fast, but definitely a million times faster than the SD card!) - so I don't think the disk I/O is an issue in my case.

Perhaps if you are finding corruption and influx is sitting on the SD, it could be the same issue as me with the high CPU... but perhaps also give a different SD a try? SD cards aren't really designed for heavy write activity, as they don't have any smarts to clean up deleted files with trim etc...

I have also seen many cases of failed SD cards on thin clients due to antivirus definitions, so I know SD card failures are definitely not unheard of...

@ryan-williams
Copy link
Author

gtk you are seeing this, or something similar, on an SSD.

I switched influx to a USB thumb drive in my pi, when I was suspecting it was an SD card IO issue. It should be an order faster than the SD card, but I saw the issue similarly on both.

@akors
Copy link

akors commented May 15, 2019

Hi, just to share my experiences for anybody affected by the issue. I had a database that was about 1.5 GB in size, and influxdb would keep crashing on me.

  • After a few months, influxdb went into a crash loop
  • Upgraded from 1.0 to 1.7, still kept crash looping with OOM errors.
  • Used influx_inspect buildtsi to convert to the new index format
  • Added 2 GB of swap space by setting CONF_SWAPSIZE=2048 in /etc/dphys-swapfile and rebooting
  • Set index-version = "tsi1" in the [data]section.
  • Started the service and let it run for a long time, about 2 hours. This will spam the logfile pretty hard with errors.

The swap space can be disabled or reduced after that.

After reaching about 900 MB and starting to swap, the memory usage actually dropped back down and InfluxDB is using only 200MB. All my data looks like it has been retained, at least from before InfluxDB started crapping out two weeks ago.

@pinkynrg
Copy link

pinkynrg commented May 31, 2019

@alexpaxton I see you are one of the programmer who has contributed the most to the Influxdb project.

Sorry to drag you into the conversation here, but I would love to see some more attention to this post.
Would you be so kind to somehow address the issue?

This is similar issue: #6975
And this is pull request that might fix the issue: #12362

It would be good to know if we can stick to influx with our RPIs or not.

Moving to another engine would be the worst for a lot of us and I feel a very good amount of people use influx on RPIs for IoT projects.

Please let us know and thank you very much in advance.

@fluffynukeit
Copy link

@ryan-williams , how large is each of your uncompressed shard groups under the default retention policy of 168 hours? Using mmapped TSM files, compaction jobs can grab a large chunk of the process address space because they are writing out new TSM files (presumably mmapped) while reading other TSM files (also mmapped). Just ballparking, but you end up needing 2x the mmapped address space: one for TSM inputs and one for TSM outputs. So if your shard group duration is large, resulting in a large files size, you can hit your mmap limit during a compaction job when otherwise you'd have enough headroom in the process address space.

If this is the issue you are encountering, I think #12362 will definitely help you. You might also or alternatively need to change the default retention policy so that the shard group duration is much smaller, so your compaction jobs are handling a much smaller amount of data at a time.

@pinkynrg
Copy link

pinkynrg commented May 31, 2019

@fluffynukeit, what do you think is the uncompressed shard group size limit at an 168 hours RP?
I thought the problem was total size of the database.

@fluffynukeit
Copy link

@pinkynrg I think it will depend on how many TSM files you have and how much data you collect in that 168 hours. By default, TSMs all get mmapped to the process address space. So you could have a situation where a compaction job for 168 hours RP works fine for a nearly empty database, but eventually the size of all the TSM files could be large enough that the compaction job fails because there is not enough address space to do it.

Avoiding so much mmapping was one of the motivating reasons for #12362. My use case is that I wanted to keep my data forever on a device with a projected 10 year lifespan. Even if I made my shard group duration a tiny 1 hour (making compaction jobs very small), I would still eventually hit the address space limit as my database filled up with more and more TSM files.

@pinkynrg
Copy link

pinkynrg commented May 31, 2019

I would like to collect ~5k tags every minute for 90 weeks.
That would also need to get downsampled every 1h and 1d.

I would then route my queries to best bucket (minute, hour, day), depending on the time delta of the query.

I was waiting to size my shards in the best possible way. Right now they are all 7d long.

@fluffynukeit
Copy link

fluffynukeit commented May 31, 2019

What matters is the MB size of the TSM files for each shard group. I'd guess that 2x this size is the upper bound of address space needed for a compaction job. TSM size is not easy to predict because the data get compressed, so you just have to test it out and measure it. In my case, if you look at the logs on #12362, the uncompressed shard group is about ~400-450 MB. So let's assume a compaction job requires 900 MB of address space. With an empty DB, there are no mmapped TSM files, so your process address space is close to empty, and there is much more than 900 MB free. The compaction job runs.

Over time, the older shards will stop getting compacted, but they will still take up address space. Let's say you have 15 shard groups each with 200 MB in them, plus an uncompacted hot shard of 450 MB. That's 3.45GB of address space taken up by database data. If your user-space address limit is 3.6 GB, the next compaction job will likely fail because there's not enough free address space to run it. It would need an additional 450 MB to mmap the compaction job output file.

Don't take my size figures as gospel. I'm just making up numbers to be illustrative. You'll have to test it out for your own data and tune it appropriately. Or use #12362.

@pinkynrg
Copy link

Ok will test #12362.

You confirm that it has been working fine for you so far correct? No errors at all so far?

@fluffynukeit
Copy link

I have not encountered any problems, but I also have not tested it exhaustively. Our device is still in development.

@pinkynrg
Copy link

pinkynrg commented Jun 2, 2019

In the mean time I think I will also try an unofficial image for RPI to use all 64 bit.

https://wiki.debian.org/RaspberryPi3

@fluffynukeit, that should technically resolve the issue too, correct?

UPDATE:

weren't able to try a 64 bit OS because, as predicted, it ends up using almost double the memory for other processes (such as Gunicorn web server for example) so even if it solved the InfluxDB problem it wouldn't be a good final solution anyway.

@stuartcarnie
Copy link
Contributor

tl;dr

Server memory resources were low enough that newly compacted TSM files were unable to be mmaped during the final phase of a compaction.

The log files were analyzed and it was determined that a low memory condition (trace_id=0Cy_LzMW000) resulted in a failed compaction:

TSM compaction (start)
Beginning compaction
Compacting file	/var/lib/influxdb/data/collectd/autogen/49/000000021-000000001.tsm
Compacting file	/var/lib/influxdb/data/collectd/autogen/49/000000022-000000001.tsm
Compacting file	/var/lib/influxdb/data/collectd/autogen/49/000000023-000000001.tsm
Compacting file	/var/lib/influxdb/data/collectd/autogen/49/000000024-000000001.tsm
Compacting file	/var/lib/influxdb/data/collectd/autogen/49/000000025-000000001.tsm
Compacting file	/var/lib/influxdb/data/collectd/autogen/49/000000026-000000001.tsm
Compacting file	/var/lib/influxdb/data/collectd/autogen/49/000000027-000000001.tsm
Compacting file	/var/lib/influxdb/data/collectd/autogen/49/000000028-000000001.tsm
Error replacing new TSM files		cannot allocate memory
TSM compaction (end)

This in turn caused temporary TSM files to be orphaned. Subsequent compactions for this group failed due to the orphaned .tsm.tmp files:

TSM compaction (start)
Beginning compaction
Compacting file	/var/lib/influxdb/data/collectd/autogen/49/000000021-000000001.tsm
Compacting file	/var/lib/influxdb/data/collectd/autogen/49/000000022-000000001.tsm
Compacting file	/var/lib/influxdb/data/collectd/autogen/49/000000023-000000001.tsm
Compacting file	/var/lib/influxdb/data/collectd/autogen/49/000000024-000000001.tsm
Compacting file	/var/lib/influxdb/data/collectd/autogen/49/000000025-000000001.tsm
Compacting file	/var/lib/influxdb/data/collectd/autogen/49/000000026-000000001.tsm
Compacting file	/var/lib/influxdb/data/collectd/autogen/49/000000027-000000001.tsm
Compacting file	/var/lib/influxdb/data/collectd/autogen/49/000000028-000000001.tsm
Aborted compaction		compaction in progress: open /var/lib/influxdb/data/collectd/autogen/49/000000028-000000002.tsm.tmp: file exists
TSM compaction (end)

This issue is filed as #14058.

Low memory issues

Fixing #14058 will not address the problems that occur when additional TSM data cannot be mmaped during a low-memory state.

Due to the low memory condition, snapshots eventually began to fail, resulting in the same mmap error cannot allocate memory. This causes a build up of .wal files and for the in-memory cache to continue to grow. Eventually, the server panicked due to no available memory after the cache grew too large. On restart, the server continues to crash as it is not able to allocate sufficient memory to load the existing .tsm files and build the cache using the large number of existing .wal files.

Notes collected during analysis

FileStore.replace enumerates the new files, removing the .tmp extension:

if err := os.Rename(oldName, newName); err != nil {

Creates a new TSMReader:

tsm, err := NewTSMReader(fd, WithMadviseWillNeed(f.tsmMMAPWillNeed))

which attempts to mmap the file:

m.b, err = mmap(m.f, 0, int(stat.Size()))

mmap fails with ENOMEM and returns cannot allocate memory. FileStore.replace handles the error:

if newName != oldName {
if err1 := os.Rename(newName, oldName); err1 != nil {
return err1
}
}
return err

and renames the file back to .tsm.tmp. The error is returned to the callee, ultimately resulting in the Error replacing new TSM files:

log.Info("Error replacing new TSM files", zap.Error(err))

The .tsm.tmp files are not cleaned up, which only happens in Compactor.writeNewFiles:

for _, f := range files {
if err := os.RemoveAll(f); err != nil {
return nil, err
}
}
// We hit an error and didn't finish the compaction. Remove the temp file and abort.
if err := os.RemoveAll(fileName); err != nil {
return nil, err
}

@jjakob
Copy link

jjakob commented Aug 27, 2019

I had the same issue with InfluxDB on a Raspberry Pi, it was crashing at startup even before starting the compaction. Setting swap via dphys-swapfile to 2GB had no effect. I had reservations on converting from TSM to TSI as there are some other issues open reporting that TSI uses more memory.

The fix was to copy /var/lib/influxdb to a 64-bit Debian Buster based system and run InfluxDB there. This loaded the files and started the compaction immediately which took about 5 minutes to complete as there was a ton of uncompacted files. Memory usage spiked to about 3.8G resident during the initial startup. Subsequent startups after compaction used about 217M resident.

Copying the database back to the Pi resulted in a successful startup of InfluxDB with it using only 163M resident.

So 64-bit systems will use considerably more RAM during normal operation (217M vs 163M) so a 64-bit build of Raspbian may not me the best choice. It definitely wouldn't have helped in my case as the initial startup took 3.8G, the Pi only has 1G RAM, even a 2G swap file may not have been enough.

A long term solution would be to start the compaction way earlier so we don't end up with so many uncompacted files. Perhaps this can be tuned via compaction settings.

@vogler
Copy link

vogler commented Aug 27, 2019

Sorry, getting long. Can someone give a summary what the problem/status is?
My RPi3's influxdb just started cycling "Aborted compaction".

$ sudo du -h -d1 /var/lib/influxdb
1.4G    /var/lib/influxdb/data
8.0K    /var/lib/influxdb/meta
1.8M    /var/lib/influxdb/wal
1.4G    /var/lib/influxdb
$ sudo find /var/lib/influxdb/ -iname '*.tmp'
/var/lib/influxdb/data/telegraf/autogen/60/000000016-000000002.tsm.tmp
/var/lib/influxdb/data/telegraf/autogen/62/000000014-000000002.tsm.tmp

Moving those .tmp files away and restarting doesn't help.
I would like to keep the data - what should I do?
Looks like ~600MB free during influxdb start, how come it "cannot allocate memory"? Can't it just compact less (at once?)?

@jjakob
Copy link

jjakob commented Aug 27, 2019

@vogler I'd first make sure you have as much free memory as possible - stop all other services, reboot if possible (will clear possible memory fragmentation)
My Influx used up all available memory on my Pi (~860MiB) all on its own and still ran out, and this was in the startup phase, even before compaction. The only fix was to copy it to a more powerful 64-bit machine (can be non-ARM, just install the same version Influx from the repos as was on the failing machine), have it compact, then copy it back again. There was no data lost and no other errors. I don't know whether this will keep reoccurring.
Possibly try setting max-concurrent-compactions = 1.

My errors were "Failed to open shard" log_id=0EqCYeWl000 service=store trace_id=0EqCYf1G000 op_name=tsdb_open db_shard_id=88 error="[shard 88] error opening memory map for file /var/lib/influxdb/data/telegraf/autogen/88/000000007-000000002.tsm: cannot allocate memory"

@aemondis
Copy link

aemondis commented Sep 8, 2019

@vogler I second @jjakob on the move to another server. It's the only way to recover the data. Until the InfluxDB team addresses the way compaction operates on address-space limited devices (e.g. 32-bit OS and restrictive RAM of the RPi (even with the 4GB Pi 4, I have the same issue!)) - you have no alternative short of using another time series DB platform. I tried max-concurrent-compactions = 1, but in my case at least it still fails. I just gave up on the compaction process entirely on the Pi and just rely on occasionally shipping everything to a VM on my main PC.

I have since recovered my influx DB multiple times now using the method of transferring to a PC VM. I simply stop the services, tar the files, scp them over, start the services, within about 2 minutes the files are compacted... so I stop the services, ship back, done. The lack of solution for this issue suggests it may be easier to simply script the above solution of log shipping between hosts.

@akors
Copy link

akors commented Sep 11, 2019

Question for the people who moved away from InfluxDB as a result of this issue: which database do you use instead?

By the way, my current "workaround" is to keep my database size very very small. The main offender for DB size was collectd. I cleaned out the store, created retention policies and continuous queries for data downsampling and now the collectd DB currently sits at around 60 megabytes.

This will probably work just fine for me, but is obviously not a solution if you need high-volume, high-resolution data.

@pinkynrg
Copy link

@ITguyDave, do you really have the same issue with RPI4? Is it with a 32 or 64 bit OS?

@jjakob
Copy link

jjakob commented Sep 13, 2019

My above mentioned "fix" only lasted 3.5 days until influxdb started crashing again. Then it was offline for 4 days until somehow coming back on its own, I have no idea how. I didn't check on it until now so have no logs older than when the DB came back again, it may have OOMed the Pi so hard it rebooted or something.
The logs show it crashed at least 5 times before successfully starting and compacting, it's been running fine for 2 days since.

Sep 11 21:20:41 hapi influxd[14877]: ts=2019-09-11T19:20:41.963474Z lvl=info msg="Reading file" log_id=0HpS1GbW000 engine=tsm1 service=cacheloader path=/var/lib/influxdb/wal/_internal/monitor/527/_00831.
Sep 11 21:20:41 hapi influxd[14877]: runtime: out of memory: cannot allocate 1622016-byte block (565379072 in use)
Sep 11 21:20:41 hapi influxd[14877]: fatal error: out of memory

@akors

By the way, my current "workaround" is to keep my database size very very small. The main offender for DB size was collectd. I cleaned out the store, created retention policies and continuous queries for data downsampling and now the collectd DB currently sits at around 60 megabytes.

This will probably work just fine for me, but is obviously not a solution if you need high-volume, high-resolution data.

That's interesting. I think my culprit is telegraf's system metrics, which log a similar amount of data than collectd (every 10s: cpu, load avg, memory, processes, ctx switches, forks, swap usage, disk i/o). My main use case for influx is to log metrics from ebusd via telegraf, which is ~2 measurements/sec max (20/10s), a lot less than telegraf's system metrics.

Can you detail on the continuous queries you created for data downsampling? I wouldn't want to downsample, but it's fine for system metrics, which isn't so important. Maybe a way to lessen the compaction intervals is possible, so each compaction has less uncompacted data to load, but I don't know how or have time to research it, it would be highly appreciated if someone did and shared their findings.

@aemondis
Copy link

@ITguyDave, do you really have the same issue with RPI4? Is it with a 32 or 64 bit OS?

Yes - I am still using Raspbian on it (a 32-bit OS), so the same upper memory limit issue occurs after some time. I've since retasked the RPI4 for other duties, so haven't played around with it much more since, but the RPI3 is still running the InfluxDB.

I however am absolutely certain that if I was running a 64-bit OS, this compaction issue would not occur. It would however over time suffer severe performance degradation during compaction if the memory usage exceeded the 4GB physical and started paging, but it would still succeed (eventually). I have yet to hear of a stable and supported 64-bit RPi OS in any case. There are several out there, but many lose key functionality of the Raspberry Pi, such as GPIO support and require a lot of customisation to get going properly.

That's interesting. I think my culprit is telegraf's system metrics, which log a similar amount of data than collectd (every 10s: cpu, load avg, memory, processes, ctx switches, forks, swap usage, disk i/o). My main use case for influx is to log metrics from ebusd via telegraf, which is ~2 measurements/sec max (20/10s), a lot less than telegraf's system metrics.

Interestingly... my use case for InfluxDB is logging both the telegraf system metrics and received messages via Mosquitto MQTT. All in all, I'm peaking something like 23 metrics/sec when all my MPUs are in full swing - although it does jump around a bit, since some of the sensors can only poll every ~3 seconds, whereas others are polling >4 times per second. The nature of my logging is that there are tens of different metrics, but I am also tagging them by device and sensor. Maybe that partitioning has something to do with it? I'm not sure how behind the scenes InfluxDB treats data that is tagged like this, if it does anything different at all...

Currently my data directory is 1.9GB. The wal directory is at 405MB across 11000 files (and growing rapidly). I'm already suffering the dreaded compaction issues the same day after running the last compaction, so it's just a matter of time before it dies again...

@akors
Copy link

akors commented Sep 14, 2019

Can you detail on the continuous queries you created for data downsampling?

# Create retention policies: retain data for a week, a month and 6 months
CREATE RETENTION POLICY one_week" ON collectd DURATION 1w REPLICATION 1 DEFAULT;
CREATE RETENTION POLICY one_month ON collectd  DURATION 30d REPLICATION 1;
CREATE RETENTION POLICY six_months ON collectd  DURATION 182d REPLICATION 1;

# Create continuous queries: downsample to 1 minute for one month, downsample to 10 minutes for 6 months.
CREATE CONTINUOUS QUERY cq_1m_for_one_month ON collectd BEGIN SELECT mean(*) INTO collectd.one_month.:MEASUREMENT FROM collectd.one_week./.*/ GROUP BY time(1m), * END
CREATE CONTINUOUS QUERY cq_10m_for_six_months ON "collectd" BEGIN SELECT mean(*) INTO "collectd"."six_months".:MEASUREMENT FROM "collectd"."one_month"./.*/ GROUP BY time(10m),* END

Note that this will create "mean_value" and "mean_mean_value" fields in the one_month and six_months retention policy respectively, due to issue #7332 .

@aemondis
Copy link

aemondis commented Oct 9, 2019

For anyone still battling with this issue... Raspbian now has an experimental 64-bit kernel available. I have seen successful compaction on my RPI 4 (4 GB RAM) since switching to that kernel. Technically the 64-bit kernel works on the 3 series too, but I would probably suggest upgrading to a RPI 4 for the extra memory, as it's more likely to sustain larger databases in the long run.

Info on the 64-bit kernel is here: https://www.raspberrypi.org/forums/viewtopic.php?f=29&t=250730

@jjakob
Copy link

jjakob commented Oct 19, 2019

@ITguyDave I suspect that the higher amount of RAM (4 vs 1) is the key factor, not the 64-bit kernel, as I've detailed the memory usage in my previous post. During the compaction influxd used ~3.8G RAM on an amd64 OS, so while this isn't directly comparable to ARM64, it's indicative. If someone wants to do testing with a 64-bit kernel and OS build on RPi3 we'll know for sure, but I doubt it'll improve anything, I suspect it'll make it worse.

@aemondis
Copy link

aemondis commented May 22, 2020

After trying the unofficial Ubuntu release... every issue I had with InfluxDB seems to have gone. Despite not actually changing any settings, the memory usage is minimal, the CPU load is nil under standard load, and the upper limit of pps being sent to the DB is extremely high (I'm seeing about 4300 pps with an ancient Intel 74 GB MLC SSD). I have a 53GB database running currently, and no issues to speak of with compacting any more. I hadn't intended on doing this test today, but shortly after I replied the earlier comment, InfluxDB went into the dreaded endless service restart loop due to the failing compactions.

For anyone else who comes across this thread and wants a minature InfluxDB server that can actually handle a moderately sized DB... you won't get a reliable InfluxDB without a fully 64-bit environment, as InfluxDB does not support 32-bit very well. Just do away with Raspbian and Balena and go to the Ubuntu server image as mentioned by @CJKohler. It seems much more responsive, is running significantly faster and the RPi4 is running almost cool to the touch for the first time, with InfluxDB running faster than it ever has!

@JanHBade
Copy link

JanHBade commented Jun 2, 2020

64Bit Rasperry OS is coming: https://www.raspberrypi.org/forums/viewtopic.php?f=117&t=275370

I orderd a 8GB Pi and will test the setup....

@unreal4u
Copy link

After trying the unofficial Ubuntu release... every issue I had with InfluxDB seems to have gone. Despite not actually changing any settings, the memory usage is minimal, the CPU load is nil under standard load, and the upper limit of pps being sent to the DB is extremely high (I'm seeing about 4300 pps with an ancient Intel 74 GB MLC SSD). I have a 53GB database running currently, and no issues to speak of with compacting any more. I hadn't intended on doing this test today, but shortly after I replied the earlier comment, InfluxDB went into the dreaded endless service restart loop due to the failing compactions.

For anyone else who comes across this thread and wants a minature InfluxDB server that can actually handle a moderately sized DB... you won't get a reliable InfluxDB without a fully 64-bit environment, as InfluxDB does not support 32-bit very well. Just do away with Raspbian and Balena and go to the Ubuntu server image as mentioned by @CJKohler. It seems much more responsive, is running significantly faster and the RPi4 is running almost cool to the touch for the first time, with InfluxDB running faster than it ever has!

Hi @ITguyDave ! I'm seeing the exact same problem here with my pi4 4GB + having some others as well (Xorg crashing randomly and needed to restart networking after each boot because the pi will lose connectivity, more info here: https://www.raspberrypi.org/forums/viewtopic.php?f=28&t=277231 ).

Can you confirm that the GPIO ports are working with Ubuntu?

Greetings.

@aemondis
Copy link

@unreal4u I personally don't use the RPi for GPIO, as I predominantly use mine for acting as mini low-power servers and have Arduino-based MCUs sending data to them over the network, but according to the maintainer of the Ubuntu image (see https://jamesachambers.com/raspberry-pi-4-ubuntu-server-desktop-18-04-3-image-unofficial/), the standard Raspbian kernel and utilities are available so I see no reason they shouldn't work.

For InfluxDB also... be sure to set the index to file-based rather than in-memory, since a growing DB will inevitably have issues with memory caching at some point once there is enough data. You might also need to play around a bit with the retention period and shard groups to better tune how the database manages the underlying shards. I played around a lot with mine over countless hours (I'm still not happy with the performance, but it's 90% better immediately simply by going Ubuntu). My server is logging about 47 parameters every 5 seconds into InfluxDB with no issues now, via Mosquitto MQTT into Telegraf and the CPU is almost idle, with very low memory usage now. Just be sure you have a decent MLC-based SSD attached to get the most out of it - avoid the SD card wherever you can, and be sure to move all frequently-accessed log files to the SSD rather than SD to avoid the card dying from excessive writes. I run an ancient Intel X25-M 74GB SSD for my InfluxDB via USB3 and it runs brilliantly in that setup, considering the very low power needs of that setup.

Let us know your experience with GPIO?

@unreal4u
Copy link

unreal4u commented Aug 19, 2020

Thanks @ITguyDave !

I installed Ubuntu Server during the weekend and finally came around last night to play around with the GPIO ports. And yes, I can confirm they do work without problems!

I haven't played a lot yet with Influxdb but the compactation process did work without issues and the avg. load has come down from a permanent 2.x to <1.0 (not bad considering I run 16+ docker images AND use the GUI as well to display a magic mirror, all while recollecting data through USB ports + GPIO).
The responsiveness of Grafana has increased considerably as well, mainly due to Influxdb's faster response time. I do run Influxdb through Docker, but I'll def. take a look at tuning it, I do however not import that much data (yet :) )

I was already using an SSD, the only quirk is that I had to go back to using a microsd card for /boot and I had to limit the memory amount in order to let the rpi recognize the USB ports, but after that point, it mounts the SSD and I can take advantage of the full 4GB of RAM. More info on that here: https://www.cnx-software.com/2019/11/04/raspberry-pi-4-4gb-models-usb-ports-dont-work-on-ubuntu-19-10/ (The post cites 19.10, but the same applies for 20.04 LTS which is what I'm using).

All in all, I'm quite happy so far, the only thing I miss is vcgencmd, mainly because the command to turn the screen on and off through CLI was super simple and I didn't have to fiddle with xrandr as much, but that is solved now as well :)

Thanks!

@vogler
Copy link

vogler commented Aug 19, 2020

the only quirk is that I had to go back to using a microsd card for /boot

I don't know about Ubuntu, but on Raspbian this is no longer needed after some update. My RPi4 is running solely from SSD.

@unreal4u
Copy link

I don't know about Ubuntu, but on Raspbian this is no longer needed after some update. My RPi4 is running solely from SSD.

It does not seem possible yet. I had that same setup however with Raspberry Pi OS, but my USB ports were nog being recognized at boot so it has to go through the SD card first. Not a big issue, I had the same setup before it was possible to boot directly from USB.

@markg85
Copy link

markg85 commented Jan 10, 2021

I'm facing this very same issue (on an Odroid XU4).
I'm a bit amazed that this bug (i got here via one from 2016!) is open and unresolved for this long. Is influxdb not meant to be run on single board computers?

I did try to copy the files to a desktop, run influx there, let it do it's thing and copy it back. That solved it for about a day or so.
It's not like my influx instance is logging a whole cluster of machines. It has been running fine for a couple years.
But it seems like once you trigger this issue you just cannot solve it unless you either start removing data (which i don't want to) or upgrade to a beefier machine (which i don't want to either).

Influxdb size:

12K     /var/lib/influxdb/meta
3.2G    /var/lib/influxdb/data
33M     /var/lib/influxdb/wal
3.3G    /var/lib/influxdb

Any advice to solve this?

@JanHBade
Copy link

JanHBade commented Jan 10, 2021 via email

@somera
Copy link

somera commented Jan 10, 2021

I'm facing this very same issue (on an Odroid XU4).
I'm a bit amazed that this bug (i got here via one from 2016!) is open and unresolved for this long. Is influxdb not meant to be run on single board computers?

I did try to copy the files to a desktop, run influx there, let it do it's thing and copy it back. That solved it for about a day or so.
It's not like my influx instance is logging a whole cluster of machines. It has been running fine for a couple years.
But it seems like once you trigger this issue you just cannot solve it unless you either start removing data (which i don't want to) or upgrade to a beefier machine (which i don't want to either).

Influxdb size:

12K     /var/lib/influxdb/meta
3.2G    /var/lib/influxdb/data
33M     /var/lib/influxdb/wal
3.3G    /var/lib/influxdb

Any advice to solve this?

You can change the RETENTION POLICY. Example

ALTER RETENTION POLICY autogen ON xxxx DURATION 8w REPLICATION 1 SHARD DURATION 7d DEFAULT

@markg85
Copy link

markg85 commented Jan 10, 2021

I might not entirely understand retention policies..
But from what i get, won't i just be effectively killing my data after that retention period (8 weeks in your example)?

I have no lack of space. Only of memory. I don't want to delete my data.

@somera
Copy link

somera commented Jan 10, 2021

I might not entirely understand retention policies..
But from what i get, won't i just be effectively killing my data after that retention period (8 weeks in your example)?

I have no lack of space. Only of memory. I don't want to delete my data.

Than you should switch to 64bit system with more RAM (Pi 4 4GB/8GB). There is no other solution.

But if you switch to Pi 4 you will get the same problem like now. But later. ;)

Sometime you have to delete very old data. Or use other (better) system than influxdb. Is there one without those problems?

@somera
Copy link

somera commented Jan 10, 2021

@markg85 the DURATION depends on your data. For me wort the example. Cause I'm collection a lot of data. Every 10 seconds. Sometimes from 10 computers. Perhaps with you data if work wir 10w or 14w.

@markg85
Copy link

markg85 commented Jan 10, 2021

I might not entirely understand retention policies..
But from what i get, won't i just be effectively killing my data after that retention period (8 weeks in your example)?
I have no lack of space. Only of memory. I don't want to delete my data.

Than you should switch to 64bit system with more RAM (Pi 4 4GB/8GB). There is no other solution.

But if you switch to Pi 4 you will get the same problem like now. But later. ;)

Sometime you have to delete very old data. Or use other (better) system than influxdb. Is there one without those problems?

Is that seriously the way how influxdb is developed. Stuffing all it's data in memory and thus eventually you need to upgrade...
Now it's a bit ironic that they themselves host this blog post: https://www.influxdata.com/blog/the-worlds-smallest-influxdb-server/

I'm not going to remove my data. Switching to another SBC is also difficult (not impossible) because the one this is running on is hosting a bit more services then just influxdb. I kinda hate to migrate that. Also, the XU4 is still a quite powerful CPU. I'd argue that the XU4 is more powerful still, just not in terms of memory.

If there is a better alternative out there for 32 bit ARM setups, i'm all ears :)

@markg85
Copy link

markg85 commented Jan 10, 2021

This looks very promising! https://www.youtube.com/watch?v=C4YV-9CrawA The prometheus "tsdb" that was new with their 2.0 release (3 years ago). Why isn't influxdb using something like it?

@aemondis
Copy link

I might not entirely understand retention policies..
But from what i get, won't i just be effectively killing my data after that retention period (8 weeks in your example)?
I have no lack of space. Only of memory. I don't want to delete my data.

Than you should switch to 64bit system with more RAM (Pi 4 4GB/8GB). There is no other solution.

But if you switch to Pi 4 you will get the same problem like now. But later. ;)

Sometime you have to delete very old data. Or use other (better) system than influxdb. Is there one without those problems?

@somera Not entirely true... after going to Ubuntu on my RPi 4 with 64-bit, I now have a 64GB database running on an SSD that has been running flawlessly for many months 24x7 now. Prior to that, the database maximum size was 3GB before compaction issues would prevent the influx service even starting. The key comes down to disabling the in-memory support by switching the sharding method in influxdb.conf:
index-version = "tsi1"

By default it is "inmem", meaning the shards sit in memory. This will cause issues with compaction, as in order to compact it needs to compact multiple shards at the same time in memory, thus depending on the size of the shard it could be too big to fit in memory (especially on a 32-bit OS). This is why your mileage may vary with adjustment of retention periods, a VERY complex topic on InfluxDB in itself - as the number of data points will impact on the amount of data requiring compaction. You might see some improvement by adjusting to shorter retention, but it heavily depends on the data being stored. I myself actually have indefinite retention running, and have no issues at all since moving away from inmem. Performance is slightly lower, but as it's on an SSD and on a USB3 thanks to the RPi4 - it's not all that noticeable. As it's a RPi anyway, the memory isn't blazing fast regardless, so I personally would not be concerned about the performance decrease unless you're running a production workload.

IMPORTANT: do NOT change to tsi1 unless you have a decent external SSD, as the increased disk activity will destroy any standard microSD cards since most non-industrial cards don't have any wear levelling capability, thus will hit the limit of writes the card can handle and cause the memory chips on the card itself to fail. I can speak to this from experience...

Is that seriously the way how influxdb is developed. Stuffing all it's data in memory and thus eventually you need to upgrade...
Now it's a bit ironic that they themselves host this blog post: https://www.influxdata.com/blog/the-worlds-smallest-influxdb-server/

@markg85 I've said it before, InfluxDB cares not for our plight. To them, it's all about Facebook-scale usage. 32-bit and IoT are definitely not their focus, we don't even show up on the radar and that much is clear by the lack of any dev response to this thread. Maybe someone who has the skills could create a fork of InfluxDB some day, to actually handle small hardware? I see a massive market for this type of thing, since IoT sensors out in the wild are all too common, and more often than not don't have the luxury of high-end hardware or network bandwidth. A distributed architecture with Influx running on RPIs scattered in remote farm sheds surrounded by LoRA sensors in a field, is just one such example where a solution like this could thrive.

FWIW - the real irony with influx is that after I switched to tsi1 rather than inmem for sharding, performance actually IMPROVED due to influx no longer logging thousands of errors every second to the syslog. My log files were rotating on average every hour through influx error messages. It's definitely not what you would expect in moving away from memory cache...

If there is a better alternative out there for 32 bit ARM setups, i'm all ears :)

@markg85 Ubuntu 64-bit (I run this, RPi3 technically can run it too). 32-bit you should avoid at all costs. If the CPU doesn't have support for 64-bit, try my comments above to switch to tsi1 and you might be able to workaround the compaction issue. Compaction for me has been flawless ever since. It actually compacted the data that had been failing to compact for 6 months after switching it, even though it took 15 minutes to complete... much more reliable than copying back and forward to desktop hardware.

@somera
Copy link

somera commented Jan 11, 2021

@aemondis thx for the info!

@markg85
Copy link

markg85 commented Jan 11, 2021

@aemondis, thank you for that detailed reply! That's much appreciated!

My appreciation for influxdb went straight through the floor. There are lots and lots of IoT/sensor projects out there where single board computers are involved, influxdb often is too. Then to figure out that you're basically installing a timebomb is disappointing to say the least.

In these environments it can be expected to run a sbc for some fancy functionality. If you need to run a desktop pc or a more higher end sbc you quickly just don't use it.

In my specific case i'm running the odroid XU4 with the home server package. That home server adds a daughter board giving you access to two SATA connections (and a bunch of other stuff). I can't just throw that away as there isn't a real alternative for it.

I get that it's ill advised to use 32bit platforms. I myself am a developer and i too also just discard it and say "use 64 bit". Truth be told, that's for the desktop and the x86-64 architecture, not ARM.

I'm already running a second SBC (rk3399 based) for media player purposes.
And a third one (raspberri pi 1.. it's going to encounter this issue in a couple of years) for just collecting net power statistics.
I'm not going to add a fourth one just for influxdb.

I don't know how i'm going to solve this issue. Very definitely not a fourth SBC.. I might add it to my rk3399. I might search for alternatives.. i just don't know. Yet.

@fluffynukeit
Copy link

fluffynukeit commented Jan 11, 2021

I wrote the #12362 patch to prevent the compaction issue on 32 bit systems, and I believe it does work (at the very least it did). However, my former company, the one for whom I was doing this work, ran into subsequent problems when running influxdb even with that patch, or perhaps because of it. The problem was that as the DB content got larger and larger, influxdb took longer and longer to boot up on a SBC running a microSD card. At one point it was over 5 minutes, and that kind of bootup time is just not acceptable for our application. I tried to mess with the DB configuration, trying different indexing methods and such, but I was unable to find a solution. We had had enough headaches with influx at that point to pull the plug. We eliminated it from our device entirely, which was a real headache because it was the keystone of our software architecture. And why wouldn't it be the keystone? It's a great convenience to use a web service to stick your data into an easily searchable database with great compression that tells the entire story of your system. Sensor data, configuration, syslogd, etc, all together in a neat package. But it just failed for us in enough ways that we had to move on. We don't have a replacement solution, either. If you want to record data on this system, you have to plug it via crossover into a PC that is hosting an influx instance.

I think it's true that influxdb is just targeting a different use case that we want. They want to be the DB in a the cloud that boots up and runs always, never shutting down, collecting metrics from net-capable devices, and running nearly constant queries for analytics. In many embedded cases, mine included, we just want something that boots up quickly, records data efficiently, is robust to power loss, and might have no or only sporadic internet access. I don't even need to run queries very often; I really only look at the data if something went wrong.

Notice that these are all implementation gripes. I think the influxdb web interface is pretty good, and one reason I chose it originally was that it was one of only a few options that allowed me to specify my own timestamps. An embedded-focused alternative could keep the same interface but make different tradeoff decisions in the implementation behind the scenes.

@markg85
Copy link

markg85 commented Jan 11, 2021

I wrote the #12362 patch to prevent the compaction issue on 32 bit systems, and I believe it does work (at the very least it did). However, my former company, the one for whom I was doing this work, ran into subsequent problems when running influxdb even with that patch, or perhaps because of it. The problem was that as the DB content got larger and larger, influxdb took longer and longer to boot up on a SBC running a microSD card. At one point it was over 5 minutes, and that kind of bootup time is just not acceptable for our application. I tried to mess with the DB configuration, trying different indexing methods and such, but I was unable to find a solution. We had had enough headaches with influx at that point to pull the plug. We eliminated it from our device entirely, which was a real headache because it was the keystone of our software architecture. And why wouldn't it be the keystone? It's a great convenience to use a web service to stick your data into an easily searchable database with great compression that tells the entire story of your system. Sensor data, configuration, syslogd, etc, all together in a neat package. But it just failed for us in enough ways that we had to move on. We don't have a replacement solution, either. If you want to record data on this system, you have to plug it via crossover into a PC that is hosting an influx instance.

Thank you for your insight!
It's sad to see influx is, apparently, just not meant to be used for us SBC users.

Notice that these are all implementation gripes. I think the influxdb web interface is pretty good, and one reason I chose it originally was that it was one of only a few options that allowed me to specify my own timestamps. An embedded-focused alternative could keep the same interface but make different tradeoff decisions in the implementation behind the scenes.

I don't think you need a lot (or any) trade-offs. As a user you just need to be made aware that none of your queries should go over your memory boundary. So for instance, say you have 10GB of collected data spread over years of collecting. If the queries you run never reach a point now till as far back as what 3.2GB is then there would be no issue at all. And i'd be willing to bet that in most of the influxdb usecases this is enough of a range to go back at least a year.

And if you go further back, say to some data that sits at the 4GB point, that you'll simply suffer the performance penalty that comes with it.

But... I'm also guessing that the storage engine isn't as efficient in data storage as that one in prometheus 2.0. And depending on the type of data you collect you can use other efficient ways of compression for that specific data type. I don't know what influx does and does not do here, but i just have a hunch that it can be optimized a lot.

@aemondis
Copy link

@markg85 & @fluffynukeit

I wrote the #12362 patch to prevent the compaction issue on 32 bit systems, and I believe it does work (at the very least it did). However, my former company, the one for whom I was doing this work, ran into subsequent problems when running influxdb even with that patch, or perhaps because of it. The problem was that as the DB content got larger and larger, influxdb took longer and longer to boot up on a SBC running a microSD card. At one point it was over 5 minutes, and that kind of bootup time is just not acceptable for our application. I tried to mess with the DB configuration, trying different indexing methods and such, but I was unable to find a solution. We had had enough headaches with influx at that point to pull the plug. We eliminated it from our device entirely, which was a real headache because it was the keystone of our software architecture. And why wouldn't it be the keystone? It's a great convenience to use a web service to stick your data into an easily searchable database with great compression that tells the entire story of your system. Sensor data, configuration, syslogd, etc, all together in a neat package. But it just failed for us in enough ways that we had to move on. We don't have a replacement solution, either. If you want to record data on this system, you have to plug it via crossover into a PC that is hosting an influx instance.

I recall looking at the suggested fix code, and I reckon that might have simply been due to the sheer volume of parsing required at the file-system layer of the database files. Depending on the retention and compaction configuration and volume, the number of files can blow out dramatically. In my InfluxDB RPi4, before compaction I ran a count and returned 15,000,000 files. After I found the tsi1 approach, that dropped down to about 150,000. SBC hardware is pretty woefully underspecced for handling such volumes.

Thank you for your insight!
It's sad to see influx is, apparently, just not meant to be used for us SBC users.

I think a lot of it comes down to the architecture and distribution. Influx could be made to work, but there will be no "single" solution. Whilst there is technically no reason why Influx shouldn't be able to run on SBC, the volume of data and the potential overheads involved just make it too heavy for big workloads centrally. As with any IoT-like solution, it makes sense to scale-out solutions, such as I mentioned earlier having various LoRA-enabled sensors in a field, talking back to a localised SBC to act as an aggregator. You would then have an upstream "central" console that consolidates the aggregated information or even pulls the raw data periodically. Through this architecture, you would then focus on having very short retention on the regional collectors, and this would workaround the underlying limitations of Influx on small systems. I however still firmly believe that Influx has failed the SBC community in simply refusing to acknowledge that SBC is a viable deployment platform for it, so there really does need to be consideration in the solution on how to make it work properly on such hardware - and this wouldn't be hard for someone who is intimately familiar with the code architecture underlying Influx.

Notice that these are all implementation gripes. I think the influxdb web interface is pretty good, and one reason I chose it originally was that it was one of only a few options that allowed me to specify my own timestamps. An embedded-focused alternative could keep the same interface but make different tradeoff decisions in the implementation behind the scenes.

I tried several alternatives, and came to a realisation that most viable alternatives were either not as efficient to query, or had poor/non-existent compression capabilities (e.g. the Postgres-enabled TSDB solutions). Influx is a fantastic product that has simply failed to embrace one of the most potent use cases for it: data logging in the field. This is essentially what underpins SBC and where such solutions are most prominent.

I don't think you need a lot (or any) trade-offs. As a user you just need to be made aware that none of your queries should go over your memory boundary. So for instance, say you have 10GB of collected data spread over years of collecting. If the queries you run never reach a point now till as far back as what 3.2GB is then there would be no issue at all. And i'd be willing to bet that in most of the influxdb usecases this is enough of a range to go back at least a year.
And if you go further back, say to some data that sits at the 4GB point, that you'll simply suffer the performance penalty that comes with it.

If one is querying such a large volume of data, I would suggest aggregation should be more prominently used to avoid such a penalty. Querying such a long range of raw data would be a big no-no and would probably even bring high-end server hardware to its knees. To facilitate this, it would be better to batch-query in loops and hope the memory model of Influx itself can ensure you don't end up utilising all available memory with stale data (i.e. it should age out queries that are no longer being used).

I haven't tested it, but I wonder if using tsi1 would actually change the memory caching behaviour, since the shard should be remaining on disk in this configuration, rather than being loaded into memory?

But... I'm also guessing that the storage engine isn't as efficient in data storage as that one in prometheus 2.0. And depending on the type of data you collect you can use other efficient ways of compression for that specific data type. I don't know what influx does and does not do here, but i just have a hunch that it can be optimized a lot.

I took a look at prometheus a long time back, it would be interesting to hear your experience with the latest version vs. influxDB? I can't recall why I decided against it at the time, but there was a specific reason for it (possibly Grafana-related?). I did try it in the early days though, and for some reason it didn't do what I needed it to. I might play around with it again if I get the time.

@aemondis
Copy link

@aemondis, thank you for that detailed reply! That's much appreciated!

My appreciation for influxdb went straight through the floor. There are lots and lots of IoT/sensor projects out there where single board computers are involved, influxdb often is too. Then to figure out that you're basically installing a timebomb is disappointing to say the least.

In these environments it can be expected to run a sbc for some fancy functionality. If you need to run a desktop pc or a more higher end sbc you quickly just don't use it.

In my specific case i'm running the odroid XU4 with the home server package. That home server adds a daughter board giving you access to two SATA connections (and a bunch of other stuff). I can't just throw that away as there isn't a real alternative for it.

I get that it's ill advised to use 32bit platforms. I myself am a developer and i too also just discard it and say "use 64 bit". Truth be told, that's for the desktop and the x86-64 architecture, not ARM.

I'm already running a second SBC (rk3399 based) for media player purposes.
And a third one (raspberri pi 1.. it's going to encounter this issue in a couple of years) for just collecting net power statistics.
I'm not going to add a fourth one just for influxdb.

I don't know how i'm going to solve this issue. Very definitely not a fourth SBC.. I might add it to my rk3399. I might search for alternatives.. i just don't know. Yet.

Interesting re: odroid XU4, it's not a platform I've looked into (there's so many SBC solutions out there these days...), and time is rarely on my side of late. Totally agree on 32-bit vs. 64-bit, it's not always necessary but generally most modern ARM IP is 64-bit enabled, but more often than not simply not available due to OS or HW limitations. 64-bit generally is better optimised in modern hardware though, even in ARM so you can often extract slightly more performance out of it; plus it has the advantage of a seemingly endless amount of memory being available for usage (rather than the restrictive 4GB upper limit). Even if you only have 4GB RAM in the system, the optimisations in memory management can still be beneficial as I discovered in my RPi 3 (which saw a healthy boost switching from 32-bit Ubuntu to 64-bit Ubuntu).

Give the tsi1 approach a try - you might find it just make Influx workable on the 32-bit platform. It will still utilise a lot of memory during compaction, but if you are splitting out the shards into smaller sizes, you might just find it will fit in a 32-bit footprint. You will definitely need to play around a bit, as it's highly dependent on your data volumes. It took me almost a month of daily tuning to get mine running the way I like it, but it paid off and has been flawless ever since. Memory usage during compaction peaks at 2.4GB on mine, but has not failed any compaction since.

Completely agree on the "timebomb". It reminds me of early day IBM/dell servers (if I recall) that had a time bomb in the BIOS that after a certain date would simply "disable" the RAID controller. Or even the old Y2K thing. The fact that compaction simply fails at a certain point due to data volumes is bad software design, and there are absolutely ways to work around it. In truth, with a large enough volume of data on high-end hardware, in theory you could actually encounter the same issue. This suggests it is a bug that if resolved would benefit both SBC and enterprise markets, and also potentially better optimise the code to be more "graceful" in handling such events (instead of spamming syslog with errors, retrying, failing on the same issue, spamming, etc. until eventually the service can't load any more). Believe me, whilst I was trying to find a solution to this I was frustrated and would even to this day not recommend InfluxDB until such an issue is fixed. It's too much of a liability, as it is really a time-bomb triggered by data volume.

@aemondis
Copy link

aemondis commented Jan 12, 2021

FWIW - here's some useful reading material to understand tuning of Influx that helped me get mine working: https://www.influxdata.com/blog/influxdb-shards-retention-policies/
https://www.influxdata.com/blog/simplifying-influxdb-retention-policy-best-practices/
https://docs.influxdata.com/influxdb/v1.8/guides/downsample_and_retain/
https://docs.influxdata.com/influxdb/v1.8/query_language/manage-database/
https://community.influxdata.com/t/points-per-shard-group/7419

I am still using the 1.x variant of Influx, have yet to try out 2.x version - but it looks to have a very different architecture for downsampling data that is a bit more fine-grained. I might need to get a PhD to get my head around it first though...

This is what I have currently (I have just altered the standard autogen retention policy on the telegraf DB):
> show retention policies on telegraf
name duration shardGroupDuration replicaN default
---- -------- ------------------ -------- -------
autogen 9600h0m0s 24h0m0s 1 true

The above is simply 400 days of history. If you query a long range of data, you will get memory issues - but I tend to query specific limited durations from within that window, and when I aggregate results it is grouped by larger averages, thus reducing the load.

@markg85
Copy link

markg85 commented Jan 13, 2021

@aemondis Just a note regarding data aggregation and more powerful hardware. I both agree and disagree :)
In terms of data collecting with many sensors (or many servers, whatever one collects), if you aggregate - say - 500MiB or more a year of data then i agree. You eventually just need to either be smarter with the data or have beefier hardware to handle it.

But in the IoT world, especially a home environment like mine, where i'm aggregating data of my net meter and - say - about 15 IoT devices + weather information. That should be very possible on a SBC for years!

Just goes to show that influx isn't designed for home IoT usage. Even though one of their usecases is IoT https://www.influxdata.com/customers/iot-data-platform/ there too it seems very much tailored to commercial needs. Funny side note, that page of theirs mentions IoT examples. Why the **** do they mention planets???

@lesam
Copy link
Contributor

lesam commented Mar 8, 2022

We no longer support 32 bit systems, closing.

@lesam lesam closed this as completed Mar 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests