Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service silently crashes after start #954

Closed
vsviridov opened this issue Sep 19, 2014 · 44 comments
Closed

Service silently crashes after start #954

vsviridov opened this issue Sep 19, 2014 · 44 comments

Comments

@vsviridov
Copy link

Today we've experienced a situation when the database stopped responding to HTTP queries on the admin interface.
After the restart it stopped responding to all HTTP requests and crashes occasionally.
If I restart it with clean data folder it starts normally.
There's nothing in the log that points to any potential issues.

Please advise.

@antage
Copy link

antage commented Sep 21, 2014

I have same issue.
After two days of uptime a instance stopped responding admin interface and queries generated log messages:

[2014/09/21 14:43:43 UTC] [INFO] (github.com/influxdb/influxdb/coordinator.(*CoordinatorImpl).RunQuery:81) Start Query: 
db: grafana, u: grafana, q: select dashboard from "grafana.dashboard_ZG9hbnRhZ2VuYW1l"
[2014/09/21 14:43:43 UTC] [EROR] (github.com/influxdb/influxdb/common.RecoverFunc:20) ********************************BU
G********************************
Database: grafana
Query: [select dashboard from grafana.dashboard_ZG9hbnRhZ2VuYW1l]
Error: runtime error: invalid memory address or nil pointer dereference. Stacktrace: goroutine 20327 [running]:
github.com/influxdb/influxdb/common.RecoverFunc(0xc20ab02366, 0x7, 0xc2081fad40, 0x38, 0x7f908c427ef0)
        /root/gocodez/src/github.com/influxdb/influxdb/common/recover.go:14 +0xb7
runtime.panic(0xbfba20, 0x100e173)
        /root/.gvm/gos/go1.3.1/src/pkg/runtime/panic.c:248 +0x18d
github.com/influxdb/influxdb/datastore.(*Shard).getIterators(0xc208cb6280, 0xc208866718, 0x1, 0x1, 0xc208280e48, 0x8, 0x
8, 0xc208280fb8, 0x8, 0x8, ...)
        /root/gocodez/src/github.com/influxdb/influxdb/datastore/shard.go:500 +0x1fc
github.com/influxdb/influxdb/datastore.(*Shard).executeQueryForSeries(0xc208cb6280, 0xc208dc3d60, 0xc20a30e8a0, 0x22, 0x
c208280ed0, 0x1, 0x1, 0x7f9092abd7c8, 0xc20980ecc0, 0x0, ...)
        /root/gocodez/src/github.com/influxdb/influxdb/datastore/shard.go:175 +0x713
github.com/influxdb/influxdb/datastore.(*Shard).Query(0xc208cb6280, 0xc208dc3d60, 0x7f9092abd7c8, 0xc20980ecc0, 0x0, 0x0
)
        /root/gocodez/src/github.com/influxdb/influxdb/datastore/shard.go:135 +0x69f
github.com/influxdb/influxdb/cluste

After restart the instance don't open any tcp ports. However it writes in the log:

[2014/09/21 14:52:14 UTC] [INFO] (github.com/influxdb/influxdb/server.(*Server).ListenAndServe:135) Starting admin interface on port 8083
[2014/09/21 14:52:14 UTC] [INFO] (github.com/influxdb/influxdb/server.(*Server).ListenAndServe:148) Starting Graphite Listener on port 2003
[2014/09/21 14:52:14 UTC] [INFO] (github.com/influxdb/influxdb/server.(*Server).ListenAndServe:161) UDP server is disabled
[2014/09/21 14:52:14 UTC] [INFO] (github.com/influxdb/influxdb/server.(*Server).ListenAndServe:161) UDP server is disabled
[2014/09/21 14:52:14 UTC] [INFO] (github.com/influxdb/influxdb/server.(*Server).ListenAndServe:190) Starting Http Api server on port 8086

Debian 7 (wheezy) amd64. InfluxDB 0.8.2 installed from deb file.

@selbyk
Copy link

selbyk commented Sep 22, 2014

I am also having this problem on a Debian Wheezy install.

Things were running fine for about 3 days, then this morning I noticed the graphs had stopped updating and port 8083 was unavailable. I couldn't find anything in the logs and tried updating the package, but I'd rather not lose my data.

I was storing mostly small text blurbs with sentiment analysis data..

@vsviridov
Copy link
Author

Same behaviour. I can see the open ports in the netstat output, but can't connect to them.
I still have the files with the database, if you need them to try to reproduce this behaviour.

@jvshahid
Copy link
Contributor

What version of InfluxDB are you guys running ?

@selbyk
Copy link

selbyk commented Sep 22, 2014

I'm not sure the version I was using before but, it was the latest deb package from last week and then I upgraded the package from the latest deb again yesterday, which didn't help.

I've only been running influx for a few days, so this is a serious dissappointment. It's understandable, though. Hopefully this is a simple fix that is just hard to see.

@selbyk
Copy link

selbyk commented Sep 22, 2014

Here is the full output of /opt/influxdb/shared/log.txt:

http://know.selby.io/grafana/log.txt

@vsviridov
Copy link
Author

I was running 0.8.1, it experienced the hang-up. I've updated to 0.8.2 in hopes that it would solve it, but it did not.

@jvshahid
Copy link
Contributor

Are you using a single node or a cluster ?

@vsviridov
Copy link
Author

Single node.

@selbyk
Copy link

selbyk commented Sep 22, 2014

I am using a single node, just started learning influx.

@vsviridov Are you running debian also?
Did we all happen to do a recent system upgrade?
@vsviridov & @antage, how long were you writing to the db and how much were you writing on average? And what type of data were you writing?

I thought maybe I had overloaded influxdb with requests and influxdb ran out of memory or didn't like some some text being sent (they should have been safe... except I just remembered I forgot to replace " with /" so a lot of broken strings could have been being sent to influx).

I am logging IRC messages and my load avg dropped significantly around 7 am, about the time people start waking up and when I image my influx server went down.

@jvshahid
Copy link
Contributor

how many shards do you guys have ? you should be able to see all the shards by ls-ing /opt/influxdb/shared/data/db/shard_db_v2 and what's the config option max-open-shards set to ? also, are you guys using grafana ?

@selbyk
Copy link

selbyk commented Sep 22, 2014

I was using grafana when it went down.

@jvshahid
Copy link
Contributor

can you also answer the other questions

@selbyk
Copy link

selbyk commented Sep 22, 2014

selby@know:~$ ls /opt/influxdb/shared/data/db/shard_db_v2
00001  00012  00023  00034  00045  00056  00067  00078  00089  00100  00111  00122  00133  00144  00155  00166  00177  00188  00199  00210  00221  00232  00243  00254  00265  00276  00287  00298  00309
00002  00013  00024  00035  00046  00057  00068  00079  00090  00101  00112  00123  00134  00145  00156  00167  00178  00189  00200  00211  00222  00233  00244  00255  00266  00277  00288  00299  00310
00003  00014  00025  00036  00047  00058  00069  00080  00091  00102  00113  00124  00135  00146  00157  00168  00179  00190  00201  00212  00223  00234  00245  00256  00267  00278  00289  00300  00311
00004  00015  00026  00037  00048  00059  00070  00081  00092  00103  00114  00125  00136  00147  00158  00169  00180  00191  00202  00213  00224  00235  00246  00257  00268  00279  00290  00301  00312
00005  00016  00027  00038  00049  00060  00071  00082  00093  00104  00115  00126  00137  00148  00159  00170  00181  00192  00203  00214  00225  00236  00247  00258  00269  00280  00291  00302  00313
00006  00017  00028  00039  00050  00061  00072  00083  00094  00105  00116  00127  00138  00149  00160  00171  00182  00193  00204  00215  00226  00237  00248  00259  00270  00281  00292  00303  00314
00007  00018  00029  00040  00051  00062  00073  00084  00095  00106  00117  00128  00139  00150  00161  00172  00183  00194  00205  00216  00227  00238  00249  00260  00271  00282  00293  00304  00315
00008  00019  00030  00041  00052  00063  00074  00085  00096  00107  00118  00129  00140  00151  00162  00173  00184  00195  00206  00217  00228  00239  00250  00261  00272  00283  00294  00305  00316
00009  00020  00031  00042  00053  00064  00075  00086  00097  00108  00119  00130  00141  00152  00163  00174  00185  00196  00207  00218  00229  00240  00251  00262  00273  00284  00295  00306  00317
00010  00021  00032  00043  00054  00065  00076  00087  00098  00109  00120  00131  00142  00153  00164  00175  00186  00197  00208  00219  00230  00241  00252  00263  00274  00285  00296  00307  00318
00011  00022  00033  00044  00055  00066  00077  00088  00099  00110  00121  00132  00143  00154  00165  00176  00187  00198  00209  00220  00231  00242  00253  00264  00275  00286  00297  00308

I don't think I've changed anything from the sample config.

# Welcome to the InfluxDB configuration file.

# If hostname (on the OS) doesn't return a name that can be resolved by the other
# systems in the cluster, you'll have to set the hostname to an IP or something
# that can be resolved here.
# hostname = ""

bind-address = "0.0.0.0"

# Once every 24 hours InfluxDB will report anonymous data to m.influxdb.com
# The data includes raft name (random 8 bytes), os, arch and version
# We don't track ip addresses of servers reporting. This is only used
# to track the number of instances running and the versions which
# is very helpful for us.
# Change this option to true to disable reporting.
reporting-disabled = false

[logging]
# logging level can be one of "debug", "info", "warn" or "error"
level  = "info"
file   = "/opt/influxdb/shared/log.txt"         # stdout to log to standard out

# Configure the admin server
[admin]
port   = 8083              # binding is disabled if the port isn't set
assets = "/opt/influxdb/current/admin"

# Configure the http api
[api]
port     = 8086    # binding is disabled if the port isn't set
# ssl-port = 8084    # Ssl support is enabled if you set a port and cert
# ssl-cert = /path/to/cert.pem

# connections will timeout after this amount of time. Ensures that clients that misbehave
# and keep alive connections they don't use won't end up connection a million times.
# However, if a request is taking longer than this to complete, could be a problem.
read-timeout = "5s"

[input_plugins]

  # Configure the graphite api
  [input_plugins.graphite]
  enabled = false
  # port = 2003
  # database = ""  # store graphite data in this database
  # udp_enabled = true # enable udp interface on the same port as the tcp interface

  # Configure the udp api
  [input_plugins.udp]
  enabled = false
  # port = 4444
  # database = ""

  # Configure multiple udp apis each can write to separate db.  Just
  # repeat the following section to enable multiple udp apis on
  # different ports.
  [[input_plugins.udp_servers]] # array of tables
  enabled = false
  # port = 5551
  # database = "db1"

# Raft configuration
[raft]
# The raft port should be open between all servers in a cluster.
# However, this port shouldn't be accessible from the internet.

port = 8090

# Where the raft logs are stored. The user running InfluxDB will need read/write access.
dir  = "/opt/influxdb/shared/data/raft"

# election-timeout = "1s"

[storage]

dir = "/opt/influxdb/shared/data/db"
# How many requests to potentially buffer in memory. If the buffer gets filled then writes
# will still be logged and once the local storage has caught up (or compacted) the writes
# will be replayed from the WAL
write-buffer-size = 10000

# the engine to use for new shards, old shards will continue to use the same engine
default-engine = "rocksdb"

# The default setting on this is 0, which means unlimited. Set this to something if you want to
# limit the max number of open files. max-open-files is per shard so this * that will be max.
max-open-shards = 0

# The default setting is 100. This option tells how many points will be fetched from LevelDb before
# they get flushed into backend.
point-batch-size = 100

# The number of points to batch in memory before writing them to leveldb. Lowering this number will
# reduce the memory usage, but will result in slower writes.
write-batch-size = 5000000

# The server will check this often for shards that have expired that should be cleared.
retention-sweep-period = "10m"

[storage.engines.leveldb]

# Maximum mmap open files, this will affect the virtual memory used by
# the process
max-open-files = 1000

# LRU cache size, LRU is used by leveldb to store contents of the
# uncompressed sstables. You can use `m` or `g` prefix for megabytes
# and gigabytes, respectively.
lru-cache-size = "200m"

[storage.engines.rocksdb]

# Maximum mmap open files, this will affect the virtual memory used by
# the process
max-open-files = 1000

# LRU cache size, LRU is used by rocksdb to store contents of the
# uncompressed sstables. You can use `m` or `g` prefix for megabytes
# and gigabytes, respectively.
lru-cache-size = "200m"

[storage.engines.hyperleveldb]

# Maximum mmap open files, this will affect the virtual memory used by
# the process
max-open-files = 1000

# LRU cache size, LRU is used by rocksdb to store contents of the
# uncompressed sstables. You can use `m` or `g` prefix for megabytes
# and gigabytes, respectively.
lru-cache-size = "200m"

[storage.engines.lmdb]

map-size = "100g"

[cluster]
# A comma separated list of servers to seed
# this server. this is only relevant when the
# server is joining a new cluster. Otherwise
# the server will use the list of known servers
# prior to shutting down. Any server can be pointed to
# as a seed. It will find the Raft leader automatically.

# Here's an example. Note that the port on the host is the same as the raft port.
# seed-servers = ["hosta:8090","hostb:8090"]

# Replication happens over a TCP connection with a Protobuf protocol.
# This port should be reachable between all servers in a cluster.
# However, this port shouldn't be accessible from the internet.

protobuf_port = 8099
protobuf_timeout = "2s" # the write timeout on the protobuf conn any duration parseable by time.ParseDuration
protobuf_heartbeat = "200ms" # the heartbeat interval between the servers. must be parseable by time.ParseDuration
protobuf_min_backoff = "1s" # the minimum backoff after a failed heartbeat attempt
protobuf_max_backoff = "10s" # the maxmimum backoff after a failed heartbeat attempt

# How many write requests to potentially buffer in memory per server. If the buffer gets filled then writes
# will still be logged and once the server has caught up (or come back online) the writes
# will be replayed from the WAL
write-buffer-size = 1000

# the maximum number of responses to buffer from remote nodes, if the
# expected number of responses exceed this number then querying will
# happen sequentially and the buffer size will be limited to this
# number
max-response-buffer-size = 100

# When queries get distributed out to shards, they go in parallel. This means that results can get buffered
# in memory since results will come in any order, but have to be processed in the correct time order.
# Setting this higher will give better performance, but you'll need more memory. Setting this to 1 will ensure
# that you don't need to buffer in memory, but you won't get the best performance.
concurrent-shard-query-limit = 10

[wal]

dir   = "/opt/influxdb/shared/data/wal"
flush-after = 1000 # the number of writes after which wal will be flushed, 0 for flushing on every write
bookmark-after = 1000 # the number of writes after which a bookmark will be created

# the number of writes after which an index entry is created pointing
# to the offset of the first request, default to 1k
index-after = 1000

# the number of requests per one log file, if new requests came in a
# new log file will be created
requests-per-logfile = 10000

@selbyk
Copy link

selbyk commented Sep 22, 2014

Should I try to compile master or another branch?

@jvshahid
Copy link
Contributor

No, master doesn't have a fix for this issue. what's the output of cat /proc/$(pidof influxdb)/limits, you might need to bump up the limits of the process

@vsviridov
Copy link
Author

We're writing performance stats every 5 seconds and grafana instance is also open. The size attained was around 256Mb on disk, so it's very little. At the time when it first stopped responding there was no indication of OOM or not having enough disk space.

We are running on CentOS 6.4 (x64).

268 shard folders. All default settings on creation.
max_open_shards is set to 0 (default).

@vsviridov
Copy link
Author

Here's my limits for currently running instance

Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            10485760             unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             1024                 30444                processes
Max open files            1024                 4096                 files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       30444                30444                signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

@selbyk
Copy link

selbyk commented Sep 22, 2014

selby@know:~$ cat /proc/$(pidof influxdb)/limits
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             16001                16001                processes 
Max open files            1024                 4096                 files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       16001                16001                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        

@perqa
Copy link

perqa commented Sep 22, 2014

I have the same problem on Ubuntu 14. I have uninstalled, cleaned up, and reinstalled. It works for a few days, but eventually stops responding. If I restart InfluxDB, it dies by itself within a few seconds. I have not used the database very much, only a few tests. After installation, I inserted some data (apprx 1.5 million points), and I've made a few queries using Grafana and InfluxDB admin.

Both the web interface (port 8083) and API (port 8086) becomes unresponsive. The API responds with error messages:

Internal Error: runtime error: invalid memory address or nil pointer dereference

Example query:
http://myip.per:8086/db/grafana/series?p=****&q=select+title,+tags+from+%2Fgrafana.dashboard_.*%2F+where++title+%3D~+%2F.*.*%2Fi&time_precision=s&u=****

vagrant@ubuntu-14:~$ service influxdb status
influxdb Process is running [ OK ]

vagrant@ubuntu-14:~$ sudo service influxdb restart
influxdb process was stopped [ OK ]
Starting the process influxdb [ OK ]
influxdb process was started [ OK ]

5 s later:
vagrant@ubuntu-14:~$ service influxdb status
influxdb Process is not running [ FAILED ]

Tail of log file:
vagrant@ubuntu-14:~$ sudo tail /opt/influxdb/shared/log.txt
[2014/09/19 08:39:07 UTC] INFO DATASTORE: opening or creating shard /opt/influxdb/shared/data/db/shard_db_v2/00294
[2014/09/19 08:39:07 UTC] EROR AddShards: error setting local store: %!(EXTRA _os.PathError=open /opt/influxdb/shared/data/db/shard_db_v2/00294/type: too many open files)
[2014/09/19 08:39:07 UTC] [INFO] (github.com/influxdb/influxdb/datastore.(_ShardDatastore).GetOrCreateShard:162) DATASTORE: opening or creating shard /opt/influxdb/shared/data/db/shard_db_v2/00295
[2014/09/19 08:39:07 UTC] EROR AddShards: error setting local store: %!(EXTRA _os.PathError=open /opt/influxdb/shared/data/db/shard_db_v2/00295/type: too many open files)
[2014/09/19 08:39:07 UTC] [INFO] (github.com/influxdb/influxdb/datastore.(_ShardDatastore).GetOrCreateShard:162) DATASTORE: opening or creating shard /opt/influxdb/shared/data/db/shard_db_v2/00296
[2014/09/19 08:39:07 UTC] INFO Adding shard to default: 296 - start: Thu Jul 29 00:00:00 +0000 UTC 2004 (1091059200). end: Thu Aug 5 00:00:00 +0000 UTC 2004 (1091664000). isLocal: true. servers: [1]
[2014/09/19 08:39:07 UTC] INFO DATASTORE: opening or creating shard /opt/influxdb/shared/data/db/shard_db_v2/00297
[2014/09/19 08:39:07 UTC] EROR AddShards: error setting local store: %!(EXTRA _os.PathError=open /opt/influxdb/shared/data/db/shard_db_v2/00297/type: too many open files)
[2014/09/19 08:39:07 UTC] [INFO] (github.com/influxdb/influxdb/datastore.(_ShardDatastore).GetOrCreateShard:162) DATASTORE: opening or creating shard /opt/influxdb/shared/data/db/shard_db_v2/00298
[2014/09/19 08:39:07 UTC] EROR AddShards: error setting local store: %!(EXTRA *os.PathError=open /opt/influxdb/shared/data/db/shard_db_v2/00298/type: too many open files)

This is the data I inserted to InfluxDB and MySQL:
+-----------------------+-------------------+-------------------+-------------------+
| Reported Statistic | Data Size | Index Size | Total Size |
+-----------------------+-------------------+-------------------+-------------------+
| ki_fidelix InnoDB | 107.83 MB | 16.00 KB | 107.84 MB |
+-----------------------+-------------------+-------------------+-------------------+

InfluxDB
vagrant@ubuntu-14:/opt/influxdb/shared$ sudo du -sh data/*
2.0G data/db
120K data/raft
8.0K data/wal

The same data: 108MB becomes 2GB in InfluxDB....?

@jvshahid
Copy link
Contributor

I think this problem is caused by a change in grafana that caused unexpected behavior in InfluxDB. Grafana is currently storing dashboards in InfluxDB which caused InfluxDB to create a massive number of shards that aren't currently being used, combined with the fact that the no of open files is low and there is no limit on the open shards in your configuration the process ran out of open files. We will try to address this issue sometime today. To work around this issue, you can either limit the number of shards in the config file, bump the limit or delete shards that aren't being used.

@vsviridov
Copy link
Author

I guess I could move the grafana config back to elasticsearch.

@jvshahid
Copy link
Contributor

you can try that, but as i said before the shards are already created. you need to use one of the workarounds mentioned earlier

@jvshahid
Copy link
Contributor

can someone post the shards information in json, curl 'localhost:8086/cluster/shards?u=root&p=root'

@perqa
Copy link

perqa commented Sep 22, 2014

Sorry, I can't restart InfluxDB. It crashes immediately.
How can I delete unused shards?

@jvshahid
Copy link
Contributor

then try to bump the limit of the process or limit the number of shards

@vsviridov
Copy link
Author

I had to delete the database. Also I tried using the lev node app to try to read from the leveldb files, but it's not able to open them.

@vsviridov
Copy link
Author

I looked at the url. grafana did create a ridiculous amount of shards. But this time I've put the grafana config into a separate namespace. I guess if I delete them - it should not affect the main database.

@jvshahid
Copy link
Contributor

What I pushed to master will stop the automatic creation of shards. That said, if you are already suffering from this problem, you can work around it by doing the following to start influxdb and prevent from crashing:

  1. Increase the nofiles limit of the process
  2. Set max-open-shards in the config file

In order to get rid of the extra shards, you can do the following:

  1. Drop the database that has the grafana dashboards
  2. Manually delete the shards that look suspicious and outside the range of the data that you are inserting in InfluxDB

@antage
Copy link

antage commented Sep 22, 2014

I confirm the issue is due to grafana creates too many shards.

@vsviridov
Copy link
Author

Some notion of that in the logs might be nice to have.

@jvshahid
Copy link
Contributor

It is clear in the log that the process can't open any more files. Normally the process shouldn't die with some random error, but this sometimes happened because we didn't check the error returned from a method call which was fixed in 78f8c39

@pauldix
Copy link
Member

pauldix commented Sep 22, 2014

The file limits are still going to be a problem even with the number of shards from the Grafana DB being dropped down. At least that's what I see from @selbyk's file limits. The soft and hard limits should be set to infinity.

@perqa
Copy link

perqa commented Sep 23, 2014

@pauldix: How do I set the soft and hard file limits?

@jvshahid
Copy link
Contributor

There's a good post on how to change the system wide max number of open files (which is a hard limit set for the entire system) as well as user level limits which @pauldix mentioned above. Please don't set the nofiles limit to unlimited since this is not portable, in fact on my linux mint setting nofiles to unlimited doesn't work. You can set the nofiles limit to the maximum number of open files in the kernel or the output of this command cat /proc/sys/fs/file-max on linux.

@perqa
Copy link

perqa commented Sep 23, 2014

@jvshahid: the other question is, does increasing the file limit actually solve the problem? Or does it merely postpone the crash?

I followed the instructions in this post:
https://groups.google.com/d/msg/influxdb/dVKvmHwXEo4/vylFV22nJlIJ
where it seems like in Ubuntu Infinity corresponds to 32768

@pauldix
Copy link
Member

pauldix commented Sep 24, 2014

32k is still too low. You should be able to set it much higher. Where did
you find that infinity maps to 32k?

On Tue, Sep 23, 2014 at 2:46 PM, perqa [email protected] wrote:

@jvshahid https://github.com/jvshahid: the other question is, does
increasing the file limit actually solve the problem? Or does it merely
postpone the crash?

I followed the instructions in this post:
https://groups.google.com/d/msg/influxdb/dVKvmHwXEo4/vylFV22nJlIJ
where it seems like in Ubuntu Infinity corresponds to 32768


Reply to this email directly or view it on GitHub
#954 (comment).

@perqa
Copy link

perqa commented Sep 24, 2014

@pauldix: I might have jumped to conclusions there. I thought they were addressing the same problem in the post I referred to. Do you have any recommendation on a minimum level?

@pauldix
Copy link
Member

pauldix commented Sep 24, 2014

You should set it to infinity, unless there's some compelling reason to do
otherwise. Short of that, maybe 100,000?

On Wed, Sep 24, 2014 at 1:50 AM, perqa [email protected] wrote:

@pauldix https://github.com/pauldix: I might have jumped to conclusions
there. I thought they were addressing the same problem in the post I
referred to. Do you have any recommendation on a minimum level?


Reply to this email directly or view it on GitHub
#954 (comment).

@perqa
Copy link

perqa commented Sep 24, 2014

I'm on Ubuntu 14, and from what I can find there is a maximum limit.
For example: http://viewsby.wordpress.com/2013/01/29/ubuntu-increase-number-of-open-files/
"65535 is maximum number of files we can open in any Linux operating system, the number should not exceed 65535."

jvshahid added a commit that referenced this issue Sep 24, 2014
This was causing InfluxDB to create a new shard in the grafana db every
ten minutes. Also we talked about getting rid of this feature a while
ago, so here we go.

Fix #954

Conflicts:
	cluster/cluster_configuration.go
@ThiruKumar
Copy link

ThiruKumar commented Aug 20, 2016

Hello Friends,

i have single influxdb node having data over 170GB of data. things were working well. but today it crashed suddenly. after restart using "service influxdb restart" the status shows

sometimes "influxdb Process is not running [ FAILED ]"
and after 10 seconds "influxdb Process is not running [ OK ]"
and again after 10 seconds "influxdb Process is not running [ FAILED ]"
again after 10 seconds "influxdb Process is not running [ OK ]"

when i visit log file it shows the following error. please help me friends. i cant access the "8083","8086" ports.

panic: not ordered: 712 1455942607000000000 >= 1455942607000000000goroutine 676 [running]:
panic(0xa3c020, 0xc84380d9e0)
        /usr/local/go/src/runtime/panic.go:481 +0x3e6
github.com/influxdata/influxdb/tsdb/engine/tsm1.Values.assertOrdered(0xc857ed4000, 0x3db, 0x3db)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/encoding.gen.go:39 +0x2fc
github.com/influxdata/influxdb/tsdb/engine/tsm1.(_tsmKeyIterator).chunk(0xc84d7a3c00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:1026 +0x70
github.com/influxdata/influxdb/tsdb/engine/tsm1.(_tsmKeyIterator).combine(0xc84d7a3c00, 0xc85a6d1201, 0x0, 0x0, 0x0)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:953 +0x16f
github.com/influxdata/influxdb/tsdb/engine/tsm1.(_tsmKeyIterator).merge(0xc84d7a3c00)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:901 +0x12f
github.com/influxdata/influxdb/tsdb/engine/tsm1.(_tsmKeyIterator).Next(0xc84d7a3c00, 0xc803f39cef)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:868 +0xbff
github.com/influxdata/influxdb/tsdb/engine/tsm1.(_Compactor).write(0xc848fe9f40, 0xc8607bd090, 0x42, 0x7f32fa3b8288, 0xc84d7a3c00, 0x0, 0x0)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:607 +0x53b
github.com/influxdata/influxdb/tsdb/engine/tsm1.(_Compactor).writeNewFiles(0xc848fe9f40, 0x2d4, 0x4, 0x7f32fa3b8288, 0xc84d7a3c00, 0x0, 0x0, 0x0, 0x0, 0x0)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:554 +0x3a1
github.com/influxdata/influxdb/tsdb/engine/tsm1.(_Compactor).compact(0xc848fe9f40, 0x486200, 0xc8605cce40, 0x4, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:513 +0x4fe
github.com/influxdata/influxdb/tsdb/engine/tsm1.(_Compactor).CompactFull(0xc848fe9f40, 0xc8605cce40, 0x4, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:526 +0xfd
github.com/influxdata/influxdb/tsdb/engine/tsm1.(_Engine).compactTSMLevel.func1(0xc8531b32f0, 0xc8411b4180, 0x3, 0x0, 0x0, 0xc8605cce40, 0x4, 0x4)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:767 +0xfd7
created by github.com/influxdata/influxdb/tsdb/engine/tsm1.(_Engine).compactTSMLevel
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:786 +0x265
[run] 2016/08/20 13:37:44 InfluxDB starting, version 1.0.0-beta1, branch master, commit bf3c226

@ThiruKumar
Copy link

Sorry friends, the issue was rectified by my colleague(Thomas Kurz), by simple version update. now the influxdb up again. my influxdb was beta1.0 and the influxdb is upgraded to beta1.3 version. i wasn't aware of it. and it again working like a charm. thanks for the bug fixes.

@ThiruKumar
Copy link

Again the same issue is coming in "beta1.3 version" also.i should find another way to fix it. I thought everything was ok. but again the error persists.

Below is the output of "/var/log/influxdb/influxd.log"

panic: not ordered: 793 1455352203000000000 >= 1455352203000000000

goroutine 2026 [running]:
panic(0xa4ac60, 0xc87f73d340)
/usr/local/go/src/runtime/panic.go:481 +0x3e6
github.com/influxdata/influxdb/tsdb/engine/tsm1.Values.assertOrdered(0xc8c5980000, 0x3e8, 0x3e8)
/root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/encoding.gen.go:51 +0x2fc
github.com/influxdata/influxdb/tsdb/engine/tsm1.(_tsmKeyIterator).chunk(0xc855124100, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
/root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:1138 +0x70
github.com/influxdata/influxdb/tsdb/engine/tsm1.(_tsmKeyIterator).combine(0xc855124100, 0xc8c620d001, 0x0, 0x0, 0x0)
/root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:1065 +0x16f
github.com/influxdata/influxdb/tsdb/engine/tsm1.(_tsmKeyIterator).merge(0xc855124100)
/root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:1013 +0x12f
github.com/influxdata/influxdb/tsdb/engine/tsm1.(_tsmKeyIterator).Next(0xc855124100, 0xc80425fe9f)
/root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:980 +0xbff
github.com/influxdata/influxdb/tsdb/engine/tsm1.(_Compactor).write(0xc8a309ad20, 0xc8622a4fa0, 0x42, 0x7f98c1f0a3d8, 0xc855124100, 0x0, 0x0)
/root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:719 +0x53b
github.com/influxdata/influxdb/tsdb/engine/tsm1.(_Compactor).writeNewFiles(0xc8a309ad20, 0x17c, 0x4, 0x7f98c1f0a3d8, 0xc855124100, 0x0, 0x0, 0x0, 0x0, 0x0)
/root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:666 +0x3a1
github.com/influxdata/influxdb/tsdb/engine/tsm1.(_Compactor).compact(0xc8a309ad20, 0x0, 0xc822714a00, 0x4, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0)
/root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:602 +0x4fe
github.com/influxdata/influxdb/tsdb/engine/tsm1.(_Compactor).CompactFull(0xc8a309ad20, 0xc822714a00, 0x4, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0)
/root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:615 +0x115
github.com/influxdata/influxdb/tsdb/engine/tsm1.(_Engine).compactTSMLevel.func1(0xc8ba4c8ad0, 0xc893d49930, 0x3, 0xc8724aa800, 0x0, 0xc822714a00, 0x4, 0x4)
/root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:844 +0x113c
created by github.com/influxdata/influxdb/tsdb/engine/tsm1.(_Engine).compactTSMLevel
/root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:866 +0x28e
[run] 2016/08/20 14:42:52 InfluxDB starting, version 1.0.0-beta3, branch master, commit 30efa2d
[run] 2016/08/20 14:42:52 Go version go1.6.2, GOMAXPROCS set to 8

@shaunwarman
Copy link

Same issue.. all the sudden too. Not sure why. It's been working for a long time (>2 years)

Limits:

stack@influx-7197:/opt/influxdb/shared$ cat /proc/1799/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             120162               120162               processes
Max open files            65536                65536                files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       120162               120162               signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

Shards:
max_open_shards=0 // unlimited which I have grafana configs which creates about 20 shards

Queries:
I also have heka anomaly detection querying influx and I do see these iffy queries at startup of log.txt:

[2016/09/02 11:25:20 PDT] [INFO] (github.com/influxdb/influxdb/server.(*Server).ListenAndServe:139) Starting admin interface on port 8083
[2016/09/02 11:25:20 PDT] [INFO] (github.com/influxdb/influxdb/server.(*Server).ListenAndServe:156) Graphite input plugins is disabled
[2016/09/02 11:25:20 PDT] [INFO] (github.com/influxdb/influxdb/server.(*Server).ListenAndServe:178) Collectd input plugins is disabled
[2016/09/02 11:25:20 PDT] [INFO] (github.com/influxdb/influxdb/server.(*Server).ListenAndServe:187) UDP server is disabled
[2016/09/02 11:25:20 PDT] [INFO] (github.com/influxdb/influxdb/server.(*Server).ListenAndServe:187) UDP server is disabled
[2016/09/02 11:25:20 PDT] [INFO] (github.com/influxdb/influxdb/server.(*Server).ListenAndServe:216) Starting Http Api server on port 8086
[2016/09/02 11:25:21 PDT] [INFO] (github.com/influxdb/influxdb/coordinator.(*Coordinator).RunQuery:41) Start Query: db: npmjs, u: root, q: select count(status) from "request" where (time < 1472840700000000000) AND (time > 1472840580000000000) group by time(1m)
[2016/09/02 11:25:21 PDT] [INFO] (github.com/influxdb/influxdb/coordinator.(*Coordinator).RunQuery:41) Start Query: db: npmjs, u: root, q: select count(status) from "request" where (time < 1472840721000000000) AND (time > 1472840623000000000) group by time(1s)

Any ideas? Turned off heka and restarted influx and still a no go. Silently dies after about 30 seconds. Changed logging to debug in config.toml and still nothing obvious.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants