Skip to content
This repository has been archived by the owner on Sep 2, 2020. It is now read-only.

Queue full warning #121

Closed
altvnk opened this issue Aug 6, 2015 · 20 comments
Closed

Queue full warning #121

altvnk opened this issue Aug 6, 2015 · 20 comments
Assignees
Labels

Comments

@altvnk
Copy link

altvnk commented Aug 6, 2015

Hi! I'm getting some problems with cyanite after few minutes running. I'm using (graphite-stresser)[https://github.com/feangulo/graphite-stresser] to load some data. Heap is set to 512m. Here log fragment:

WARN [2015-08-06 13:22:23,844] nioEventLoopGroup-3-1 - io.netty.channel.DefaultChannelPipeline An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
java.lang.IllegalStateException: Queue full
    at java.util.AbstractQueue.add(AbstractQueue.java:98) ~[na:1.8.0_51]
    at java.util.concurrent.ArrayBlockingQueue.add(ArrayBlockingQueue.java:312) ~[na:1.8.0_51]
    at sun.reflect.GeneratedMethodAccessor156.invoke(Unknown Source) ~[na:na]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_51]
    at java.lang.reflect.Method.invoke(Method.java:497) ~[na:1.8.0_51]
    at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93) ~[cyanite-0.5.1-standalone.jar:na]
    at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28) ~[cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.engine.queue$queue_engine$reify__1599.add_BANG_(queue.clj:40) ~[cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.engine.Engine.accept_BANG_(engine.clj:54) ~[cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.input.carbon$pipeline$fn__11523.invoke(carbon.clj:38) ~[cyanite-0.5.1-standalone.jar:na]
    at io.cyanite.input.carbon.proxy$io.netty.channel.ChannelInboundHandlerAdapter$ff19274a.channelRead(Unknown Source) ~[cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.handler.timeout.ReadTimeoutHandler.channelRead(ReadTimeoutHandler.java:152) ~[cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:110) [cyanite-0.5.1-standalone.jar:na]
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) [cyanite-0.5.1-standalone.jar:na]
    at java.lang.Thread.run(Thread.java:745) [na:1.8.0_51]

Does it means that cyanite fails due to slow writes into cassandra?

@AnderEnder
Copy link

+1

@AnderEnder
Copy link

Have the same error during high load with the heap 6Gb. Strange thing that graphite carbon manages to handle the load, while cyanite can not.

@jeffpierce
Copy link

Same thing, then falls over with an out of memory error in under 25 minutes in a low-volume staging environment.

@gmlwall
Copy link

gmlwall commented Aug 26, 2015

I am also trying to P.O.C this and as soon as i send 1/8th of our prod traffic to the cyanite process i receive these errors straight away.
As mentioned by AnderEnder, graphite-carbon can deal with this load (actually 1/4 of total load) although the caveat to this is 1/4 over carbon-c-relay, relaying to 9 carbon processes.

Was this an issue in previous versions?

@nherson
Copy link

nherson commented Sep 1, 2015

I am in the exact same situation as @gmlwall, this error starts right away almost immediately after cyanite starts up. Similarly, I know my standard graphite-carbon stack can handle this load. Also, I have used the cassandra-stress tool and I don't believe there are issues with my cassandra setup. Has anyone been looking into this and found anything helpful?

@gmlwall
Copy link

gmlwall commented Sep 3, 2015

i added the following to the config to try to give the process a bit more headroom

queues:
default-poolsize: 24
queue-capacity: 100000

Where default-poolsize: (no. of CPU cores on your server)

to ensure it wasn't limited by CPU usage, it lasted a bit longer before the I/O errors appeared but not too long :(

@pyr
Copy link
Owner

pyr commented Oct 2, 2015

@altvnk I will be starting to introduce graphite-stresser to validate further commits and resolve this issue, thanks.

@pyr
Copy link
Owner

pyr commented Oct 2, 2015

@altvnk would you mind sharing the details of the -Xmx/-Xms params you used for cyanite and the stresser params you used to generate the issue ?

@altvnk
Copy link
Author

altvnk commented Oct 4, 2015

Sure.
I used 1024M for both -Xmx/-Xms java startup line. And as i remember i used 100 hosts, 128 times and 01 sec reporting interval. Like this: java -jar stresser.jar 127.0.0.1 2003 100 128 10 true

@altvnk
Copy link
Author

altvnk commented Oct 4, 2015

Off topic: Also i've run into issues with graphite-api, when no grafana cannot connect and reach any metrics. I forgot details, but if you guys are going to work on these projects again i would file issues with pleasure, because i'm seriously interested in this project. Unfortunately i don't have any experience with Clojure or Java at all so i cannot contribute.

@pyr
Copy link
Owner

pyr commented Oct 5, 2015

@altvnk Hi thanks for the feedback. the latest commit helps a lot, but doesn't solve all issues. I'm working on fixing the behavior. The issue with graphite-api is known and registered, i'm currently changing the way cyanite interacts with grafana.

Cheers!

@pyr
Copy link
Owner

pyr commented Oct 6, 2015

Hi @altvnk,

I'm nearing the end of my work on wip/instrumented. I haven't tested the index write-path just yet, but going to cassandra I can now push with java -jar stresser.jar 127.0.0.1 2003 100 128 2 true (so same config, but every 2 sec) with the following config:

engine:
  rules:
    default: [ "5s:1h" ]
api:
  port: 8080
input:
  - type: carbon
    port: 2003
index:
  type: empty
queues:
  defaults:
    ingestq:
      pool-size: 100
      queue-capacity: 2000000
    writeq:
      pool-size: 100
      queue-capacity: 2000000
store:
  cluster: 'localhost'
  keyspace: 'metric'
logging:
  level: info
  console: true

Note how you can now provide per-queue defaults. The trick here is to give the inputq some room to breathe.

There are now metrics exposed through JMX and /tmp/csv by default, which can help making sure the write-path and input-queues are in good shape, i'll add the ability to flush these to cassandra as well for good measure.

With this I can sucessfully run at -Xms512m -Xmx512m with no trouble. When a queue-full error occurs, it doesn't kill the daemon either. This is all in the wip/instrumented branch and I will modify and clean-up the commits before merging.

Once this is done, this will leave room to update the API part to be compatible with graphite-api and the release will be around the corner.

@altvnk
Copy link
Author

altvnk commented Oct 6, 2015

Awesome! Will take a look into it again as soon as possible.

@pyr
Copy link
Owner

pyr commented Oct 6, 2015

@AnderEnder @gmlwall @jeffpierce @nherson

#134 fixes this issue and should be satisfactory. Please pay attention to your index configuratin. I recommend opting for the elasticsearch index if you're in a real world scenario.

I'm leaving this issue open for now.

@altvnk
Copy link
Author

altvnk commented Oct 9, 2015

Not sure what i'm doing wrong, but i have empty metrics in Cassandra.

cqlsh> SELECT * FROM metric.metric ;

 id | time | point
----+------+-------

(0 rows)

Schema is created from schema.cql
The most strange thing that cyanite receives metrics from python-diamond and stresser. So i'm wondering - where are they stored then?
PS: 127.0.0.1:8080/paths returns empty array as well.

@pyr
Copy link
Owner

pyr commented Oct 9, 2015

Hi @altvnk,

With the above config, the index will not be provisioned since I use the "empty" indexer.
But I am seeing metrics pushed to cassandra, what is your flush period? (or better, can you paste the config that you use here?)

Cheers!

@altvnk
Copy link
Author

altvnk commented Oct 9, 2015

Right, I changed index to memory, forgot to note this.
Anyway, I'll post my config here in a moment.

@altvnk
Copy link
Author

altvnk commented Oct 9, 2015

Ok, i've changed in example above only index to memory and carbon listen address to 0.0.0.0. Cassandra settings are defaults. What i should verify?

@altvnk
Copy link
Author

altvnk commented Oct 9, 2015

Sorry for confusion, tested on another cassandra cluster, metrics are writing.

Now, about original issue: performance is increased and now i'm not seeing error messages anymore.

pyr added a commit that referenced this issue Oct 15, 2015
Improve performance and Graphite compatibility.

- [X] Refactor search interface
- [X] Cassandra search implementation
- [X] Graphite query parser
- [X] Load test procedure

Refactor search interface
-------------------------

The search functionality now puts a lot less responsibility in
the hands of implementers. Three functions are now expected from
implementations:

```clojure
(defprotocol MetricIndex
  (push-segment! [this pos segment path length])
  (by-pos        [this pos])
  (by-segment    [this pos segment]))
```

Based on these primitives, cyanite now builds a simple inverted
index of the following structure, given the input paths:
`collectd.web01.cpu` and `collectd.web02.cpu`:

```json
{
  "segments": {
    0: [ "collectd" ],
    1: [ "web01", "web02" ],
    2: [ "cpu" ]
  },
  "paths": {
    [0, "collectd"]: [["collectd.web01.cpu", 3], ["collectd.web02.cpu", 3]],
    [1, "web01"]:    [["collectd.web01.cpu", 3]],
    [1, "web02"]:    [["collectd.web02.cpu", 3]],
    [2, "cpu"]:      [["collectd.web01.cpu", 3], ["collectd.web02.cpu", 3]]
  }
}
```

Given this structure, cyanite will now split paths into segments, and perform
globbing queries on segments.

- `push-segment!`: Register a new path.
- `by-pos`: Yield all segments at a given position
- `by-segment`: Yield paths for a position and segment tuple

The globbing implementation is somewhat naive and leaves room for improvement,
implementations should aim to sort segments. Subsequent commits will bypass
`by-pos` whenever possible and perform prefix lookups up to the first wildcard
to further reduce lookup times.

Two implementations of this protocol are provided:

- `AgentIndex` stores segments and paths in memory, updates go through an agent.
- `CassandraIndex` provides Cassandra-backed storage for paths and segments.

The ElasticSearch-backed index is now gone, but a compatible implementation will
be provided as a subsequent improvement.

Graphite Query Parser
---------------------

A tokenizer for the Graphite syntax has been living in the tree for a while.
A subset of the syntax is now handled, and may be translated to an AST.

A `run-query!` function is also provided which will walk tokens to
extract paths, query paths, handling globs and will then be reduced
to the result of the operation if successful.

The following operations are already implemented:

- `sumSeries`
- `divideSeries`
- `scale`
- `absolute`

Globbing is handled by https://github.com/pyr/globber and adheres to
globbing rules as available in common shells.

Load testing procedure
----------------------

Cyanite now integrates https://github.com/feangulo/graphite-stresser for
development, the baseline against which it is tested is a workload
of 200000 metrics per second, flushed at at 5 second interval with
a maximum heap-size of 512m.

Remaining work
--------------

These changes fix & improve the following on-going issues:

- #119
- The path indexing part of #121.
- Most of the work needed for #136.
@pyr
Copy link
Owner

pyr commented Oct 19, 2015

Hi @altvnk,

Thanks for your help in testing this. I will close this ticket for now and focus on finishing the direct grafana integration.

Sounds as if the release is getting really close now :-)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants