Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error compacting TSM files: cannot allocate memory #6975

Closed
denis-arruda opened this issue Jul 7, 2016 · 28 comments
Closed

error compacting TSM files: cannot allocate memory #6975

denis-arruda opened this issue Jul 7, 2016 · 28 comments

Comments

@denis-arruda
Copy link

denis-arruda commented Jul 7, 2016

Bug report

System info: InfluxDB 0.13.0-1, Linux softsw69 2.6.34.10-0.6-desktop #1 SMP PREEMPT 2011-12-13 18:27:38 +0100 x86_64 x86_64 x86_64 GNU/Linux (Suse Linux)

Steps to reproduce:

After almost two days writting around 5.000 metrics per seconds, InfluxDB crashes. When I start InfluxDB process again, I got this error in log:

[cacheloader] 2016/07/07 10:40:44 reading file /home/influxdb/wal/performance/default/2/_02089.wal, size 10485900
[cacheloader] 2016/07/07 10:40:44 reading file /home/influxdb/wal/_internal/monitor/4/_00002.wal, size 0
[shard] 2016/07/07 10:40:44 /home/influxdb/data/_internal/monitor/4 database index loaded in 1.725057ms
[store] 2016/07/07 10:40:44 /home/influxdb/data/_internal/monitor/4 opened in 1.04092455s
[cacheloader] 2016/07/07 10:40:46 reading file /home/influxdb/wal/performance/default/2/_02090.wal, size 10487370
[cacheloader] 2016/07/07 10:40:48 reading file /home/influxdb/wal/performance/default/2/_02091.wal, size 10487395
[cacheloader] 2016/07/07 10:40:50 reading file /home/influxdb/wal/performance/default/2/_02092.wal, size 6489923
[cacheloader] 2016/07/07 10:40:51 reading file /home/influxdb/wal/performance/default/2/_02093.wal, size 10486224
[cacheloader] 2016/07/07 10:40:53 reading file /home/influxdb/wal/performance/default/2/_02094.wal, size 10486823
[cacheloader] 2016/07/07 10:40:54 reading file /home/influxdb/wal/performance/default/2/_02095.wal, size 10486442
[cacheloader] 2016/07/07 10:40:56 reading file /home/influxdb/wal/performance/default/2/_02096.wal, size 6498702
[cacheloader] 2016/07/07 10:40:57 reading file /home/influxdb/wal/performance/default/2/_02097.wal, size 10487287
[cacheloader] 2016/07/07 10:40:59 reading file /home/influxdb/wal/performance/default/2/_02098.wal, size 10487150
[cacheloader] 2016/07/07 10:41:01 reading file /home/influxdb/wal/performance/default/2/_02099.wal, size 10486054
[cacheloader] 2016/07/07 10:41:04 reading file /home/influxdb/wal/performance/default/2/_02100.wal, size 6571972
[cacheloader] 2016/07/07 10:41:05 reading file /home/influxdb/wal/performance/default/2/_02101.wal, size 971433
[cacheloader] 2016/07/07 10:41:05 reading file /home/influxdb/wal/performance/default/2/_02102.wal, size 0
[tsm1] 2016/07/07 10:41:05 beginning full compaction of group 0, 4 TSM files
[tsm1] 2016/07/07 10:41:05 compacting full group (0) /home/influxdb/data/performance/default/2/000000336-000000006.tsm (#0)
[tsm1] 2016/07/07 10:41:05 compacting full group (0) /home/influxdb/data/performance/default/2/000000416-000000005.tsm (#1)
[tsm1] 2016/07/07 10:41:05 compacting full group (0) /home/influxdb/data/performance/default/2/000000464-000000005.tsm (#2)
[tsm1] 2016/07/07 10:41:05 compacting full group (0) /home/influxdb/data/performance/default/2/000000496-000000005.tsm (#3)
[tsm1] 2016/07/07 10:41:05 error compacting TSM files: cannot allocate memory
[tsm1] 2016/07/07 10:41:05 beginning level 1 compaction of group 0, 2 TSM files
[tsm1] 2016/07/07 10:41:05 compacting level 1 group (0) /home/influxdb/data/performance/default/2/000000521-000000001.tsm (#0)
[tsm1] 2016/07/07 10:41:05 compacting level 1 group (0) /home/influxdb/data/performance/default/2/000000522-000000001.tsm (#1)
[tsm1] 2016/07/07 10:41:05 beginning level 3 compaction of group 0, 4 TSM files
[tsm1] 2016/07/07 10:41:05 compacting level 3 group (0) /home/influxdb/data/performance/default/2/000000500-000000003.tsm (#0)
[tsm1] 2016/07/07 10:41:05 compacting level 3 group (0) /home/influxdb/data/performance/default/2/000000504-000000003.tsm (#1)
[tsm1] 2016/07/07 10:41:05 compacting level 3 group (0) /home/influxdb/data/performance/default/2/000000508-000000003.tsm (#2)
[tsm1] 2016/07/07 10:41:05 compacting level 3 group (0) /home/influxdb/data/performance/default/2/000000512-000000003.tsm (#3)
fatal error: runtime: out of memory

runtime stack:
runtime.throw(0xce4740, 0x16)
/usr/local/go/src/runtime/panic.go:547 +0x90
runtime.sysMap(0xc84ef00000, 0x100000, 0x456100, 0x10efab8)

What is happenning?

I know that there are still space on disk.

@carbolymer
Copy link

This is the same issue: #5440
InfluxDB is using ridulously large amount of RAM.

@jwilder
Copy link
Contributor

jwilder commented Jul 18, 2016

@denis-arruda Can you test with a current nightly and include the full stack trace as a gist? fatal error: runtime: out of memory indicates you ran out of memory, but the stack trace does not have enough information to indicate where the allocation failed.

@denis-arruda
Copy link
Author

denis-arruda commented Jul 18, 2016

I can not test it now.

Meanwhile, I am sending the complete log for version 0.13.0:

data.log.txt

@jwilder
Copy link
Contributor

jwilder commented Jul 18, 2016

@denis-arruda That trace shows a lot of writes blocked up in the WAL and a compaction run which might have triggered the OOM. The backed up writes are likely the cause of the memory issue though. What kind of disks do you have? Can you run iostat -xzd 1 and attach the output? Your disks might be saturated.

#7024 has a fix that may help you. It will be in tonight's nightly build.

@denis-arruda
Copy link
Author

The command you asked in my server returns "comand not found".

I dont'know the disk type. I know the server I am using is a virtual machine in VMWare.

@jwilder
Copy link
Contributor

jwilder commented Jul 18, 2016

You would need to install the sysstat package on you system if iostat is not available.

@ericb48
Copy link

ericb48 commented Apr 10, 2017

I have the same messages but influxdb doesn't crash. run on a scaleway ARM server.

Apr 10 11:06:29 scw-c24a7e influxd[11431]: [I] 2017-04-10T09:06:29Z beginning optimize compaction of group 0, 8 TSM files engine=tsm1
Apr 10 11:06:29 scw-c24a7e influxd[11431]: [I] 2017-04-10T09:06:29Z beginning optimize compaction of group 0, 8 TSM files engine=tsm1
Apr 10 11:06:29 scw-c24a7e influxd[11431]: [I] 2017-04-10T09:06:29Z compacting optimize group (0) /data/influxdb/data/telegraf/one_year/197/000000033-000000005.tsm (#0) engine=tsm1
Apr 10 11:06:29 scw-c24a7e influxd[11431]: [I] 2017-04-10T09:06:29Z compacting optimize group (0) /data/influxdb/data/telegraf/one_year/197/000000033-000000005.tsm (#0) engine=tsm1
Apr 10 11:06:29 scw-c24a7e influxd[11431]: [I] 2017-04-10T09:06:29Z compacting optimize group (0) /data/influxdb/data/telegraf/one_year/197/000000065-000000005.tsm (#1) engine=tsm1
Apr 10 11:06:29 scw-c24a7e influxd[11431]: [I] 2017-04-10T09:06:29Z compacting optimize group (0) /data/influxdb/data/telegraf/one_year/197/000000065-000000005.tsm (#1) engine=tsm1
Apr 10 11:06:29 scw-c24a7e influxd[11431]: [I] 2017-04-10T09:06:29Z compacting optimize group (0) /data/influxdb/data/telegraf/one_year/197/000000065-000000006.tsm (#2) engine=tsm1
Apr 10 11:06:29 scw-c24a7e influxd[11431]: [I] 2017-04-10T09:06:29Z compacting optimize group (0) /data/influxdb/data/telegraf/one_year/197/000000065-000000006.tsm (#2) engine=tsm1
Apr 10 11:06:29 scw-c24a7e influxd[11431]: [I] 2017-04-10T09:06:29Z compacting optimize group (0) /data/influxdb/data/telegraf/one_year/197/000000081-000000004.tsm (#3) engine=tsm1
Apr 10 11:06:29 scw-c24a7e influxd[11431]: [I] 2017-04-10T09:06:29Z compacting optimize group (0) /data/influxdb/data/telegraf/one_year/197/000000081-000000004.tsm (#3) engine=tsm1
Apr 10 11:06:29 scw-c24a7e influxd[11431]: [I] 2017-04-10T09:06:29Z compacting optimize group (0) /data/influxdb/data/telegraf/one_year/197/000000081-000000005.tsm (#4) engine=tsm1
Apr 10 11:06:29 scw-c24a7e influxd[11431]: [I] 2017-04-10T09:06:29Z compacting optimize group (0) /data/influxdb/data/telegraf/one_year/197/000000081-000000005.tsm (#4) engine=tsm1
Apr 10 11:06:29 scw-c24a7e influxd[11431]: [I] 2017-04-10T09:06:29Z compacting optimize group (0) /data/influxdb/data/telegraf/one_year/197/000000097-000000004.tsm (#5) engine=tsm1
Apr 10 11:06:29 scw-c24a7e influxd[11431]: [I] 2017-04-10T09:06:29Z compacting optimize group (0) /data/influxdb/data/telegraf/one_year/197/000000097-000000004.tsm (#5) engine=tsm1
Apr 10 11:06:29 scw-c24a7e influxd[11431]: [I] 2017-04-10T09:06:29Z compacting optimize group (0) /data/influxdb/data/telegraf/one_year/197/000000113-000000004.tsm (#6) engine=tsm1
Apr 10 11:06:29 scw-c24a7e influxd[11431]: [I] 2017-04-10T09:06:29Z compacting optimize group (0) /data/influxdb/data/telegraf/one_year/197/000000113-000000004.tsm (#6) engine=tsm1
Apr 10 11:06:29 scw-c24a7e influxd[11431]: [I] 2017-04-10T09:06:29Z compacting optimize group (0) /data/influxdb/data/telegraf/one_year/197/000000129-000000004.tsm (#7) engine=tsm1
Apr 10 11:06:29 scw-c24a7e influxd[11431]: [I] 2017-04-10T09:06:29Z compacting optimize group (0) /data/influxdb/data/telegraf/one_year/197/000000129-000000004.tsm (#7) engine=tsm1
Apr 10 11:06:29 scw-c24a7e influxd[11431]: [I] 2017-04-10T09:06:29Z error compacting TSM files: cannot allocate memory engine=tsm1
Apr 10 11:06:29 scw-c24a7e influxd[11431]: [I] 2017-04-10T09:06:29Z error compacting TSM files: cannot allocate memory engine=tsm1

@think-free
Copy link

Same here on Rasperry pi 3 + raspbian, go this two errors :

error adding new TSM files from snapshot: cannot allocate memory engine=tsm1
error writing snapshot: cannot allocate memory engine=tsm1

free -m :

             total       used       free     shared    buffers     cached
Mem:           925        835         89         46         38        351
-/+ buffers/cache:        445        479
Swap:           99          3         96

Influx :

12497 influxdb 20 0 2079960 365808 7948 S 5.9 38.6 167:52.15 influxd

@Lenbok
Copy link

Lenbok commented Sep 16, 2017

Also getting the same error as @think-free on my raspberry pi 3.

Influx has been happily running for ages then all of a sudden starts spamming the logs with the two error messages (and the database in /var/lib/influxdb starts rapidly expanding), eventually causing out of disk space. After clearing up the logs and rebooting, it seems to be running fine although it looks like the last few days of data is gone :-(.

Perhaps there is a memory leak somewhere?

@mvillarejo
Copy link

Same behaviour here

@jpetermans
Copy link

jpetermans commented Dec 1, 2017

Getting the similar messages hitting the log every 1 second. Raspberry Pi 3, Influx 1.3.7. My log files are growing faster than my databases at this stage. Any direction on a fix?

Nov 30 22:36:40 3jas11 influxd[14240]: [I] 2017-12-01T06:36:40Z beginning level 3 compaction of group 0, 4 TSM files engine=tsm1
Nov 30 22:36:40 3jas11 influxd[14240]: [I] 2017-12-01T06:36:40Z compacting level 3 group (0) /var/lib/influxdb/data/collectd/annual/59/000000004-000000003.tsm (#0) engine=tsm1
Nov 30 22:36:40 3jas11 influxd[14240]: [I] 2017-12-01T06:36:40Z compacting level 3 group (0) /var/lib/influxdb/data/collectd/annual/59/000000009-000000003.tsm (#1) engine=tsm1
Nov 30 22:36:40 3jas11 influxd[14240]: [I] 2017-12-01T06:36:40Z compacting level 3 group (0) /var/lib/influxdb/data/collectd/annual/59/000000013-000000003.tsm (#2) engine=tsm1
Nov 30 22:36:40 3jas11 influxd[14240]: [I] 2017-12-01T06:36:40Z compacting level 3 group (0) /var/lib/influxdb/data/collectd/annual/59/000000017-000000003.tsm (#3) engine=tsm1
Nov 30 22:36:40 3jas11 influxd[14240]: [I] 2017-12-01T06:36:40Z error compacting TSM files: cannot allocate memory engine=tsm1
Nov 30 22:36:40 3jas11 influxd[14240]: [I] 2017-12-01T06:36:40Z beginning level 2 compaction of group 0, 3 TSM files engine=tsm1

@Lenbok
Copy link

Lenbok commented Dec 9, 2017

Mine started doing this again (continuously for the last three days before I noticed). Restarting influxdb didn't help. On an off chance I tried updating influxdb (from 1.3.5 to 1.4.2), and things were looking hopeful when influxd restarted (i.e. it didn't error straight after startup). A few minutes later I got a couple of similar looking error messages which don't sound good, although it's not continuing to spam the logs, so maybe influxdb has got itself into a happier place:

Dec  9 14:37:32 nodered influxd[30010]: [I] 2017-12-09T01:37:32Z beginning full compaction, 3 TSM files engine=tsm1
Dec  9 14:37:32 nodered influxd[30010]: [I] 2017-12-09T01:37:32Z compacting full /var/lib/influxdb/data/sensors/autogen/13/000000016-000000004.tsm (#0) engine=tsm1
Dec  9 14:37:32 nodered influxd[30010]: [I] 2017-12-09T01:37:32Z compacting full /var/lib/influxdb/data/sensors/autogen/13/000000032-000000004.tsm (#1) engine=tsm1
Dec  9 14:37:32 nodered influxd[30010]: [I] 2017-12-09T01:37:32Z compacting full /var/lib/influxdb/data/sensors/autogen/13/000000032-000000005.tsm (#2) engine=tsm1
Dec  9 14:37:32 nodered influxd[30010]: [I] 2017-12-09T01:37:32Z beginning full compaction, 3 TSM files engine=tsm1
Dec  9 14:37:32 nodered influxd[30010]: [I] 2017-12-09T01:37:32Z compacting full /var/lib/influxdb/data/sensors/autogen/12/000000032-000000005.tsm (#0) engine=tsm1
Dec  9 14:37:32 nodered influxd[30010]: [I] 2017-12-09T01:37:32Z compacting full /var/lib/influxdb/data/sensors/autogen/12/000000048-000000004.tsm (#1) engine=tsm1
Dec  9 14:37:32 nodered influxd[30010]: [I] 2017-12-09T01:37:32Z compacting full /var/lib/influxdb/data/sensors/autogen/12/000000048-000000005.tsm (#2) engine=tsm1
Dec  9 14:37:32 nodered influxd[30010]: [I] 2017-12-09T01:37:32Z Sending usage statistics to usage.influxdata.com
Dec  9 14:40:53 nodered influxd[30010]: [I] 2017-12-09T01:40:53Z error replacing new TSM files: cannot allocate memory engine=tsm1
Dec  9 14:42:32 nodered influxd[30010]: [I] 2017-12-09T01:42:32Z error replacing new TSM files: cannot allocate memory engine=tsm1

@Lenbok
Copy link

Lenbok commented Dec 9, 2017

Nope, it's not really any better than before. A few hours after the upgrade, influxdb suddenly started using 200% CPU (on my 4 core pi 3), doing constant IO and producing error messages in the log (although much less frequently than prior to the upgrade):

image
image

Dec  9 18:07:31 nodered influxd[30010]: [I] 2017-12-09T05:07:31Z Retention policy shard deletion check commencing. service=retention
Dec  9 18:37:16 nodered influxd[30010]: [I] 2017-12-09T05:37:16Z beginning full compaction, 3 TSM files engine=tsm1
Dec  9 18:37:16 nodered influxd[30010]: [I] 2017-12-09T05:37:16Z compacting full /var/lib/influxdb/data/sensors/autogen/12/000000032-000000005.tsm (#0) engine=tsm1
Dec  9 18:37:16 nodered influxd[30010]: [I] 2017-12-09T05:37:16Z compacting full /var/lib/influxdb/data/sensors/autogen/12/000000048-000000004.tsm (#1) engine=tsm1
Dec  9 18:37:16 nodered influxd[30010]: [I] 2017-12-09T05:37:16Z beginning full compaction, 7 TSM files engine=tsm1
Dec  9 18:37:16 nodered influxd[30010]: [I] 2017-12-09T05:37:16Z compacting full /var/lib/influxdb/data/sensors/autogen/12/000000048-000000005.tsm (#2) engine=tsm1
Dec  9 18:37:16 nodered influxd[30010]: [I] 2017-12-09T05:37:16Z compacting full /var/lib/influxdb/data/sensors/autogen/11/000000032-000000005.tsm (#0) engine=tsm1
Dec  9 18:37:16 nodered influxd[30010]: [I] 2017-12-09T05:37:16Z compacting full /var/lib/influxdb/data/sensors/autogen/11/000000036-000000003.tsm (#1) engine=tsm1
Dec  9 18:37:16 nodered influxd[30010]: [I] 2017-12-09T05:37:16Z compacting full /var/lib/influxdb/data/sensors/autogen/11/000000040-000000003.tsm (#2) engine=tsm1
Dec  9 18:37:16 nodered influxd[30010]: [I] 2017-12-09T05:37:16Z compacting full /var/lib/influxdb/data/sensors/autogen/11/000000044-000000003.tsm (#3) engine=tsm1
Dec  9 18:37:16 nodered influxd[30010]: [I] 2017-12-09T05:37:16Z compacting full /var/lib/influxdb/data/sensors/autogen/11/000000046-000000002.tsm (#4) engine=tsm1
Dec  9 18:37:16 nodered influxd[30010]: [I] 2017-12-09T05:37:16Z compacting full /var/lib/influxdb/data/sensors/autogen/11/000000047-000000001.tsm (#5) engine=tsm1
Dec  9 18:37:16 nodered influxd[30010]: [I] 2017-12-09T05:37:16Z compacting full /var/lib/influxdb/data/sensors/autogen/11/000000047-000000002.tsm (#6) engine=tsm1
Dec  9 18:37:31 nodered influxd[30010]: [I] 2017-12-09T05:37:31Z Retention policy shard deletion check commencing. service=retention
Dec  9 18:41:32 nodered influxd[30010]: [I] 2017-12-09T05:41:32Z error replacing new TSM files: cannot allocate memory engine=tsm1
Dec  9 18:41:34 nodered influxd[30010]: [I] 2017-12-09T05:41:34Z beginning full compaction, 3 TSM files engine=tsm1
Dec  9 18:41:34 nodered influxd[30010]: [I] 2017-12-09T05:41:34Z compacting full /var/lib/influxdb/data/sensors/autogen/12/000000032-000000005.tsm (#0) engine=tsm1
Dec  9 18:41:34 nodered influxd[30010]: [I] 2017-12-09T05:41:34Z compacting full /var/lib/influxdb/data/sensors/autogen/12/000000048-000000004.tsm (#1) engine=tsm1
Dec  9 18:41:34 nodered influxd[30010]: [I] 2017-12-09T05:41:34Z compacting full /var/lib/influxdb/data/sensors/autogen/12/000000048-000000005.tsm (#2) engine=tsm1
Dec  9 18:41:41 nodered influxd[30010]: [I] 2017-12-09T05:41:41Z error replacing new TSM files: cannot allocate memory engine=tsm1
Dec  9 18:41:43 nodered influxd[30010]: [I] 2017-12-09T05:41:43Z beginning full compaction, 7 TSM files engine=tsm1
Dec  9 18:41:43 nodered influxd[30010]: [I] 2017-12-09T05:41:43Z compacting full /var/lib/influxdb/data/sensors/autogen/11/000000032-000000005.tsm (#0) engine=tsm1
Dec  9 18:41:43 nodered influxd[30010]: [I] 2017-12-09T05:41:43Z compacting full /var/lib/influxdb/data/sensors/autogen/11/000000036-000000003.tsm (#1) engine=tsm1
Dec  9 18:41:43 nodered influxd[30010]: [I] 2017-12-09T05:41:43Z compacting full /var/lib/influxdb/data/sensors/autogen/11/000000040-000000003.tsm (#2) engine=tsm1
Dec  9 18:41:43 nodered influxd[30010]: [I] 2017-12-09T05:41:43Z compacting full /var/lib/influxdb/data/sensors/autogen/11/000000044-000000003.tsm (#3) engine=tsm1
Dec  9 18:41:43 nodered influxd[30010]: [I] 2017-12-09T05:41:43Z compacting full /var/lib/influxdb/data/sensors/autogen/11/000000046-000000002.tsm (#4) engine=tsm1
Dec  9 18:41:43 nodered influxd[30010]: [I] 2017-12-09T05:41:43Z compacting full /var/lib/influxdb/data/sensors/autogen/11/000000047-000000001.tsm (#5) engine=tsm1
Dec  9 18:41:43 nodered influxd[30010]: [I] 2017-12-09T05:41:43Z compacting full /var/lib/influxdb/data/sensors/autogen/11/000000047-000000002.tsm (#6) engine=tsm1
Dec  9 18:46:03 nodered influxd[30010]: [I] 2017-12-09T05:46:03Z error replacing new TSM files: cannot allocate memory engine=tsm1
Dec  9 18:46:05 nodered influxd[30010]: [I] 2017-12-09T05:46:05Z beginning full compaction, 3 TSM files engine=tsm1
Dec  9 18:46:05 nodered influxd[30010]: [I] 2017-12-09T05:46:05Z compacting full /var/lib/influxdb/data/sensors/autogen/12/000000032-000000005.tsm (#0) engine=tsm1
Dec  9 18:46:05 nodered influxd[30010]: [I] 2017-12-09T05:46:05Z compacting full /var/lib/influxdb/data/sensors/autogen/12/000000048-000000004.tsm (#1) engine=tsm1
Dec  9 18:46:05 nodered influxd[30010]: [I] 2017-12-09T05:46:05Z compacting full /var/lib/influxdb/data/sensors/autogen/12/000000048-000000005.tsm (#2) engine=tsm1
Dec  9 18:46:15 nodered influxd[30010]: [I] 2017-12-09T05:46:15Z error replacing new TSM files: cannot allocate memory engine=tsm1

Edit: BTW, memory profile doesn't look like it is running low, unless it's trying to grab a whole bunch at once:
image

@Lenbok
Copy link

Lenbok commented Feb 6, 2018

And it's doing it again...

Feb  6 16:54:58 nodered influxd[332]: [I] 2018-02-06T03:54:58Z beginning full compaction, 4 TSM files engine=tsm1
Feb  6 16:54:58 nodered influxd[332]: [I] 2018-02-06T03:54:58Z compacting full /var/lib/influxdb/data/sensors/autogen/21/000000016-000000004.tsm (#0) engine=tsm1
Feb  6 16:54:58 nodered influxd[332]: [I] 2018-02-06T03:54:58Z compacting full /var/lib/influxdb/data/sensors/autogen/21/000000021-000000003.tsm (#1) engine=tsm1
Feb  6 16:54:58 nodered influxd[332]: [I] 2018-02-06T03:54:58Z compacting full /var/lib/influxdb/data/sensors/autogen/21/000000026-000000003.tsm (#2) engine=tsm1
Feb  6 16:54:58 nodered influxd[332]: [I] 2018-02-06T03:54:58Z compacting full /var/lib/influxdb/data/sensors/autogen/21/000000027-000000001.tsm (#3) engine=tsm1
Feb  6 16:56:54 nodered influxd[332]: [I] 2018-02-06T03:56:54Z error replacing new TSM files: cannot allocate memory engine=tsm1
Feb  6 16:56:56 nodered influxd[332]: [I] 2018-02-06T03:56:56Z beginning full compaction, 4 TSM files engine=tsm1
Feb  6 16:56:56 nodered influxd[332]: [I] 2018-02-06T03:56:56Z compacting full /var/lib/influxdb/data/sensors/autogen/21/000000016-000000004.tsm (#0) engine=tsm1
Feb  6 16:56:56 nodered influxd[332]: [I] 2018-02-06T03:56:56Z compacting full /var/lib/influxdb/data/sensors/autogen/21/000000021-000000003.tsm (#1) engine=tsm1
Feb  6 16:56:56 nodered influxd[332]: [I] 2018-02-06T03:56:56Z compacting full /var/lib/influxdb/data/sensors/autogen/21/000000026-000000003.tsm (#2) engine=tsm1
Feb  6 16:56:56 nodered influxd[332]: [I] 2018-02-06T03:56:56Z compacting full /var/lib/influxdb/data/sensors/autogen/21/000000027-000000001.tsm (#3) engine=tsm1
Feb  6 16:58:50 nodered influxd[332]: [I] 2018-02-06T03:58:50Z error replacing new TSM files: cannot allocate memory engine=tsm1
Feb  6 16:58:52 nodered influxd[332]: [I] 2018-02-06T03:58:52Z beginning full compaction, 4 TSM files engine=tsm1
Feb  6 16:58:52 nodered influxd[332]: [I] 2018-02-06T03:58:52Z compacting full /var/lib/influxdb/data/sensors/autogen/21/000000016-000000004.tsm (#0) engine=tsm1
Feb  6 16:58:52 nodered influxd[332]: [I] 2018-02-06T03:58:52Z compacting full /var/lib/influxdb/data/sensors/autogen/21/000000021-000000003.tsm (#1) engine=tsm1
Feb  6 16:58:52 nodered influxd[332]: [I] 2018-02-06T03:58:52Z compacting full /var/lib/influxdb/data/sensors/autogen/21/000000026-000000003.tsm (#2) engine=tsm1
Feb  6 16:58:52 nodered influxd[332]: [I] 2018-02-06T03:58:52Z compacting full /var/lib/influxdb/data/sensors/autogen/21/000000027-000000001.tsm (#3) engine=tsm1
Feb  6 17:00:49 nodered influxd[332]: [I] 2018-02-06T04:00:49Z error replacing new TSM files: cannot allocate memory engine=tsm1

@jwilder any tips to how to sort this out?

@Redferne
Copy link

Having similar issue.

Apr 25 17:55:10 hassbian influxd[4783]: ts=2018-04-25T15:55:10.364336Z lvl=info msg="Error replacing new TSM files" log_id=07gFbOnW000 engine=tsm1 tsm1_strategy=full tsm1_optimize=false trace_id=07gN~yU0000 op_name=tsm1_compact_group error="cannot allocate memory"

Started out-of-the-blue, been running without a hitch for over a year.
InfluxDB v1.5.2 (git: 1.5 02d7d4f)

Please advice.

@dgnorton
Copy link
Contributor

@Redferne are you still seeing this issue?

@Redferne
Copy link

Omg. Forgot about this. No, my problem was caused by running influxd on 32bit system and database got bigger than x GByte. Would have been nice with a warning though ☺️

@dgnorton
Copy link
Contributor

@Redferne thanks. Going to close this for now. We'll address it in the future if it turns out to be a frequent problem for users.

@Lenbok
Copy link

Lenbok commented Oct 17, 2018

Just for future reference for anyone coming across this -- the problem went away for me on my raspberry pi 3 after I swapped to a larger / faster SD card. Possibly once compactions were triggered they caused more I/O than the SD card could keep up with and the I/O kept accumulating faster than it could be serviced.

@Lenbok
Copy link

Lenbok commented Dec 12, 2018

@dgnorton It seems I spoke too soon. Mine started doing this again since 3rd December. This is so frustrating, I think I'm just going to delete influxdb altogether.

@gs4711
Copy link

gs4711 commented Dec 19, 2018

Suddently started to occure with no visible reason for it:

influxd[19485]: ts=2018-12-19T12:54:01.007492Z lvl=info msg="Error replacing new TSM files" log_id=0CTdI0HW000 engine=tsm1 tsm1_strategy=full tsm1_optimize=false trace_id=0CTdJ3~l000 op_name=tsm1_compact_group error="cannot allocate memory" influxd[19485]: ts=2018-12-19T12:54:02.969238Z lvl=info msg="Aborted compaction" log_id=0CTdI0HW000 engine=tsm1 tsm1_strategy=full tsm1_optimize=false trace_id=0CTdKTr0000 op_name=tsm1_compact_group error="compaction in progress: open /var/lib/influxdb/data/openhab/openhab_retention/1673/000000276-000000003.tsm.tmp: file exists"
The existing files do not point to any possible reason:

-rw-r--r--  1 influxdb influxdb  6118411 Dec 13 21:14 000000131-000000004.tsm
-rw-r--r--  1 influxdb influxdb  6177033 Dec 18 06:52 000000260-000000004.tsm
-rw-r--r--  1 influxdb influxdb   402900 Dec 16 15:03 000000268-000000002.tsm
-rw-r--r--  1 influxdb influxdb   402065 Dec 16 19:59 000000276-000000002.tsm
-rw-r--r--  1 influxdb influxdb 12771613 Dec 19 13:54 000000276-000000003.tsm.tmp
-rw-r--r--  1 influxdb influxdb    31047 Dec 11 00:00 fields.idx

Removing the latest tsm files solves this issue only for some days. During the occurence of the error there is more than enough free memory available.

Any hints?

@think-free
Copy link

think-free commented Dec 19, 2018 via email

@gs4711
Copy link

gs4711 commented Dec 19, 2018

DB size ? You are running on a rpi, right ? I had this problem, moving the database to a 64 bits system solved the issue ...

1.2G /var/lib/influxdb/data
24K /var/lib/influxdb/meta
1.5M /var/lib/influxdb/wal

Hhhhm, you're pointing into a direction, which I really do not like - still hoping that one developer raises a hand and tells us that this is not the truth!

@ryan-williams
Copy link

ryan-williams commented Dec 22, 2018

I'm suddenly seeing this on a (32 bit) raspberry pi 3 b+, consistent with the hypothesis that it started when one DB (collectd) reached 1GB:

$ du -sh /var/lib/influxdb/data/*
1.1G	/var/lib/influxdb/data/collectd
14M	/var/lib/influxdb/data/_internal
204M	/var/lib/influxdb/data/temps

influx was running fine for weeks, but now is in a crash-loop every few minutes, generating a ton of logspam.

Here's a histogram of the msg values from the logs above:

  16 msg="Compacting file"
  18 msg="Opened shard"
  38 msg="Opened file"
7521 msg="Reading file"

7512 of the 7521 Reading file messages are for the collectd database, relating to files spanning basically all successive integers from /var/lib/influxdb/wal/collectd/autogen/22/_10769.wal to /var/lib/influxdb/wal/collectd/autogen/22/_18305.wal.

The process seems to die during compaction, with the messages:

Dec 21 23:23:19 rpi2 influxd[21364]: ts=2018-12-21T23:23:19.859433Z lvl=info msg="Error adding new TSM files from snapshot. Removing temp files." log_id=0CWl~kg0000 engine=tsm1 trace_id=0CWm70Sl000 op_name=tsm1_cache_snapshot error="cannot allocate memory"
Dec 21 23:23:19 rpi2 influxd[21364]: ts=2018-12-21T23:23:19.896133Z lvl=info msg="Cache snapshot (end)" log_id=0CWl~kg0000 engine=tsm1 trace_id=0CWm70Sl000 op_name=tsm1_cache_snapshot op_event=end op_elapsed=13668.832ms
Dec 21 23:23:19 rpi2 influxd[21364]: ts=2018-12-21T23:23:19.983874Z lvl=info msg="Error writing snapshot" log_id=0CWl~kg0000 engine=tsm1 error="cannot allocate memory"
Dec 21 23:23:19 rpi2 influxd[21364]: ts=2018-12-21T23:23:19.984015Z lvl=info msg="Cache snapshot (start)" log_id=0CWl~kg0000 engine=tsm1 trace_id=0CWm7rBl000 op_name=tsm1_cache_snapshot op_event=start
Dec 21 23:23:23 rpi2 influxd[21364]: runtime: out of memory: cannot allocate 8192-byte block (694747136 in use)
Dec 21 23:23:23 rpi2 influxd[21364]: fatal error: out of memory

followed by a bunch of stacktrace/heapdump. Overall memory use on the system doesn't seem particularly high when this is happening (and there's plenty of swap space available), so I'm not sure why the 8kb memory-allocation is failing.

I worked around this issue by wiping out / backing up the existing collectd database:

sudo service influxdb stop
sudo mv /var/lib/influxdb/data/collectd ~/collectd
sudo service influxdb start
influx -execute 'create database collectd'

After this, my collectd database seems fine, and it even has about 30 hours of historical data, which I guess was being buffered by influx in some fashion while the compaction crash-loop was occurring.

Of course, this means I can only ever have Influx save the last few weeks of my collectd metrics, and I have to watch for the 1GB limit and keep "backing up" the collectd db, which is obviously not ideal.

Cross-linking #10486 which seems like a canonical discussion of this issue.

@fluffynukeit
Copy link

FYI I have an open PR in #12362 that is the successor to #10486.

@wollew
Copy link

wollew commented Apr 8, 2019

It is awfully quiet in #12362, what's influxdata's policy on including them in any release?

@fluffynukeit
Copy link

I haven't seen any response from the Influx team regarding the PR. I made it a PR off of branch 1.8 because I don't want to adopt version 2.0 any time soon, but I think 2.0 is what the Influx team is focusing on right now. Honestly, it wouldn't be hard for someone familiar with the 2.0 codebase to port the fixes into 2.0, assuming they aren't in there already. I'll request an update on the PR thread.

@mumrau
Copy link

mumrau commented Mar 16, 2021

2.0 won't ever support 32 bits system, thus we kinda are in a dead end are we?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests