Queue full warning #121

altvnk · 2015-08-06T13:28:00Z

Hi! I'm getting some problems with cyanite after few minutes running. I'm using (graphite-stresser)[https://github.com/feangulo/graphite-stresser] to load some data. Heap is set to 512m. Here log fragment:

WARN [2015-08-06 13:22:23,844] nioEventLoopGroup-3-1 - io.netty.channel.DefaultChannelPipeline An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
java.lang.IllegalStateException: Queue full
    at java.util.AbstractQueue.add(AbstractQueue.java:98) ~[na:1.8.0_51]
    at java.util.concurrent.ArrayBlockingQueue.add(ArrayBlockingQueue.java:312) ~[na:1.8.0_51]
    at sun.reflect.GeneratedMethodAccessor156.invoke(Unknown Source) ~[na:na]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_51]
    at java.lang.reflect.Method.invoke(Method.java:497) ~[na:1.8.0_51]
    at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28) ~[cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.engine.queue$queue_engine$reify__1599.add_BANG_(queue.clj:40) ~[cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.engine.Engine.accept_BANG_(engine.clj:54) ~[cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.input.carbon$pipeline$fn__11523.invoke(carbon.clj:38) ~[cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.input.carbon.proxy$io.netty.channel.ChannelInboundHandlerAdapter$ff19274a.channelRead(Unknown Source) ~[cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.handler.timeout.ReadTimeoutHandler.channelRead(ReadTimeoutHandler.java:152) ~[cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:110) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) [cyanite-0.5.1-standalone.jar:na]
    at java.lang.Thread.run(Thread.java:745) [na:1.8.0_51]

Does it means that cyanite fails due to slow writes into cassandra?

The text was updated successfully, but these errors were encountered:

AnderEnder · 2015-08-09T08:24:02Z

+1

AnderEnder · 2015-08-10T21:06:20Z

Have the same error during high load with the heap 6Gb. Strange thing that graphite carbon manages to handle the load, while cyanite can not.

jeffpierce · 2015-08-12T16:37:54Z

Same thing, then falls over with an out of memory error in under 25 minutes in a low-volume staging environment.

gmlwall · 2015-08-26T11:17:42Z

I am also trying to P.O.C this and as soon as i send 1/8th of our prod traffic to the cyanite process i receive these errors straight away.
As mentioned by AnderEnder, graphite-carbon can deal with this load (actually 1/4 of total load) although the caveat to this is 1/4 over carbon-c-relay, relaying to 9 carbon processes.

Was this an issue in previous versions?

nherson · 2015-09-01T23:14:54Z

I am in the exact same situation as @gmlwall, this error starts right away almost immediately after cyanite starts up. Similarly, I know my standard graphite-carbon stack can handle this load. Also, I have used the cassandra-stress tool and I don't believe there are issues with my cassandra setup. Has anyone been looking into this and found anything helpful?

gmlwall · 2015-09-03T14:18:47Z

i added the following to the config to try to give the process a bit more headroom

queues:
default-poolsize: 24
queue-capacity: 100000

Where default-poolsize: (no. of CPU cores on your server)

to ensure it wasn't limited by CPU usage, it lasted a bit longer before the I/O errors appeared but not too long :(

pyr · 2015-10-02T14:24:27Z

@altvnk I will be starting to introduce graphite-stresser to validate further commits and resolve this issue, thanks.

pyr · 2015-10-02T14:44:18Z

@altvnk would you mind sharing the details of the -Xmx/-Xms params you used for cyanite and the stresser params you used to generate the issue ?

altvnk · 2015-10-04T14:05:58Z

Sure.
I used 1024M for both -Xmx/-Xms java startup line. And as i remember i used 100 hosts, 128 times and 01 sec reporting interval. Like this: java -jar stresser.jar 127.0.0.1 2003 100 128 10 true

altvnk · 2015-10-04T14:09:38Z

Off topic: Also i've run into issues with graphite-api, when no grafana cannot connect and reach any metrics. I forgot details, but if you guys are going to work on these projects again i would file issues with pleasure, because i'm seriously interested in this project. Unfortunately i don't have any experience with Clojure or Java at all so i cannot contribute.

pyr · 2015-10-05T07:21:48Z

@altvnk Hi thanks for the feedback. the latest commit helps a lot, but doesn't solve all issues. I'm working on fixing the behavior. The issue with graphite-api is known and registered, i'm currently changing the way cyanite interacts with grafana.

Cheers!

pyr · 2015-10-06T09:22:37Z

Hi @altvnk,

I'm nearing the end of my work on wip/instrumented. I haven't tested the index write-path just yet, but going to cassandra I can now push with java -jar stresser.jar 127.0.0.1 2003 100 128 2 true (so same config, but every 2 sec) with the following config:

engine:
  rules:
    default: [ "5s:1h" ]
api:
  port: 8080
input:
  - type: carbon
    port: 2003
index:
  type: empty
queues:
  defaults:
    ingestq:
      pool-size: 100
      queue-capacity: 2000000
    writeq:
      pool-size: 100
      queue-capacity: 2000000
store:
  cluster: 'localhost'
  keyspace: 'metric'
logging:
  level: info
  console: true

Note how you can now provide per-queue defaults. The trick here is to give the inputq some room to breathe.

There are now metrics exposed through JMX and /tmp/csv by default, which can help making sure the write-path and input-queues are in good shape, i'll add the ability to flush these to cassandra as well for good measure.

With this I can sucessfully run at -Xms512m -Xmx512m with no trouble. When a queue-full error occurs, it doesn't kill the daemon either. This is all in the wip/instrumented branch and I will modify and clean-up the commits before merging.

Once this is done, this will leave room to update the API part to be compatible with graphite-api and the release will be around the corner.

altvnk · 2015-10-06T09:25:03Z

Awesome! Will take a look into it again as soon as possible.

pyr · 2015-10-06T14:42:45Z

@AnderEnder @gmlwall @jeffpierce @nherson

#134 fixes this issue and should be satisfactory. Please pay attention to your index configuratin. I recommend opting for the elasticsearch index if you're in a real world scenario.

I'm leaving this issue open for now.

altvnk · 2015-10-09T13:04:02Z

Not sure what i'm doing wrong, but i have empty metrics in Cassandra.

cqlsh> SELECT * FROM metric.metric ;

 id | time | point
----+------+-------

(0 rows)

Schema is created from schema.cql
The most strange thing that cyanite receives metrics from python-diamond and stresser. So i'm wondering - where are they stored then?
PS: 127.0.0.1:8080/paths returns empty array as well.

pyr · 2015-10-09T17:20:41Z

Hi @altvnk,

With the above config, the index will not be provisioned since I use the "empty" indexer.
But I am seeing metrics pushed to cassandra, what is your flush period? (or better, can you paste the config that you use here?)

Cheers!

altvnk · 2015-10-09T18:03:17Z

Right, I changed index to memory, forgot to note this.
Anyway, I'll post my config here in a moment.

altvnk · 2015-10-09T18:37:02Z

Ok, i've changed in example above only index to memory and carbon listen address to 0.0.0.0. Cassandra settings are defaults. What i should verify?

altvnk · 2015-10-09T20:35:44Z

Sorry for confusion, tested on another cassandra cluster, metrics are writing.

Now, about original issue: performance is increased and now i'm not seeing error messages anymore.

Improve performance and Graphite compatibility. - [X] Refactor search interface - [X] Cassandra search implementation - [X] Graphite query parser - [X] Load test procedure Refactor search interface ------------------------- The search functionality now puts a lot less responsibility in the hands of implementers. Three functions are now expected from implementations: ```clojure (defprotocol MetricIndex (push-segment! [this pos segment path length]) (by-pos [this pos]) (by-segment [this pos segment])) ``` Based on these primitives, cyanite now builds a simple inverted index of the following structure, given the input paths: `collectd.web01.cpu` and `collectd.web02.cpu`: ```json { "segments": { 0: [ "collectd" ], 1: [ "web01", "web02" ], 2: [ "cpu" ] }, "paths": { [0, "collectd"]: [["collectd.web01.cpu", 3], ["collectd.web02.cpu", 3]], [1, "web01"]: [["collectd.web01.cpu", 3]], [1, "web02"]: [["collectd.web02.cpu", 3]], [2, "cpu"]: [["collectd.web01.cpu", 3], ["collectd.web02.cpu", 3]] } } ``` Given this structure, cyanite will now split paths into segments, and perform globbing queries on segments. - `push-segment!`: Register a new path. - `by-pos`: Yield all segments at a given position - `by-segment`: Yield paths for a position and segment tuple The globbing implementation is somewhat naive and leaves room for improvement, implementations should aim to sort segments. Subsequent commits will bypass `by-pos` whenever possible and perform prefix lookups up to the first wildcard to further reduce lookup times. Two implementations of this protocol are provided: - `AgentIndex` stores segments and paths in memory, updates go through an agent. - `CassandraIndex` provides Cassandra-backed storage for paths and segments. The ElasticSearch-backed index is now gone, but a compatible implementation will be provided as a subsequent improvement. Graphite Query Parser --------------------- A tokenizer for the Graphite syntax has been living in the tree for a while. A subset of the syntax is now handled, and may be translated to an AST. A `run-query!` function is also provided which will walk tokens to extract paths, query paths, handling globs and will then be reduced to the result of the operation if successful. The following operations are already implemented: - `sumSeries` - `divideSeries` - `scale` - `absolute` Globbing is handled by https://github.com/pyr/globber and adheres to globbing rules as available in common shells. Load testing procedure ---------------------- Cyanite now integrates https://github.com/feangulo/graphite-stresser for development, the baseline against which it is tested is a workload of 200000 metrics per second, flushed at at 5 second interval with a maximum heap-size of 512m. Remaining work -------------- These changes fix & improve the following on-going issues: - #119 - The path indexing part of #121. - Most of the work needed for #136.

pyr · 2015-10-19T07:49:00Z

Hi @altvnk,

Thanks for your help in testing this. I will close this ticket for now and focus on finishing the direct grafana integration.

Sounds as if the release is getting really close now :-)

This was referenced Oct 6, 2015

engine: simplify in-memory aggregation logic. add instrumentation. #134

Merged

Scaling Suggestions for 2M+ metrics #127

Closed

io.netty.channel.DefaultChannelPipeline An exceptionCaught() #131

Closed

pyr added the bug label Oct 6, 2015

pyr self-assigned this Oct 6, 2015

pyr mentioned this issue Oct 15, 2015

WARN io.cyanite.engine.queue queue is full, no metrics #135

Closed

pyr mentioned this issue Oct 15, 2015

Bring in Graphite query parser and Cassandra metric index #137

Merged

4 tasks

pyr closed this as completed Oct 19, 2015

pyr mentioned this issue Oct 20, 2015

Problems with indexes on ElasticSearch #142

Closed

pyr mentioned this issue Oct 26, 2015

Better details on pool and queues in the logs #148

Closed

dancb10 mentioned this issue Dec 8, 2017

Queue full error when trying to send data to Cyanite #284

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Queue full warning #121

Queue full warning #121

altvnk commented Aug 6, 2015

AnderEnder commented Aug 9, 2015

AnderEnder commented Aug 10, 2015

jeffpierce commented Aug 12, 2015

gmlwall commented Aug 26, 2015

nherson commented Sep 1, 2015

gmlwall commented Sep 3, 2015

pyr commented Oct 2, 2015

pyr commented Oct 2, 2015

altvnk commented Oct 4, 2015

altvnk commented Oct 4, 2015

pyr commented Oct 5, 2015

pyr commented Oct 6, 2015

altvnk commented Oct 6, 2015

pyr commented Oct 6, 2015

altvnk commented Oct 9, 2015

pyr commented Oct 9, 2015

altvnk commented Oct 9, 2015

altvnk commented Oct 9, 2015

altvnk commented Oct 9, 2015

pyr commented Oct 19, 2015

Queue full warning #121

Queue full warning #121

Comments

altvnk commented Aug 6, 2015

AnderEnder commented Aug 9, 2015

AnderEnder commented Aug 10, 2015

jeffpierce commented Aug 12, 2015

gmlwall commented Aug 26, 2015

nherson commented Sep 1, 2015

gmlwall commented Sep 3, 2015

pyr commented Oct 2, 2015

pyr commented Oct 2, 2015

altvnk commented Oct 4, 2015

altvnk commented Oct 4, 2015

pyr commented Oct 5, 2015

pyr commented Oct 6, 2015

altvnk commented Oct 6, 2015

pyr commented Oct 6, 2015

altvnk commented Oct 9, 2015

pyr commented Oct 9, 2015

altvnk commented Oct 9, 2015

altvnk commented Oct 9, 2015

altvnk commented Oct 9, 2015

pyr commented Oct 19, 2015