Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for StatsD style aggregator #39

Closed
pauldix opened this issue Jul 2, 2015 · 41 comments
Closed

Add support for StatsD style aggregator #39

pauldix opened this issue Jul 2, 2015 · 41 comments

Comments

@pauldix
Copy link
Member

pauldix commented Jul 2, 2015

We should support the StatsD protocol and aggregation. However, unlike StatsD, the metric names should follow the conventions of the key section of the InfluxDB line protocol.

The StatsD values should be output as a single field called value. This should be able to flush to any of the output sinks like what is mentioned in #35.

This means that a single Telegraf instance could serve as a StatsD aggregator that works with the InfluxDB schema design of measurements and tags.

@nstott
Copy link

nstott commented Jul 11, 2015

Looking at the statsd spec from here:

https://github.com/b/statsd_spec

@pauldix are you thinking of a line format something like this?

cpu_load_short,host=server01,region=us-west:2.34|g
cpu_load_short,host=server01,region=us-west:3.42|g
errors,host=server01,region=us-west:1|c

where the server adds the timestamp either when it receives the message, or perhaps in the case of counters, adding the timestamp when it flushes to a sink might be more appropriate

@liyichao
Copy link

It may be good if telegraf can add hostname as a tag instead of application sending hostname, because application may run in a container.

@pauldix
Copy link
Member Author

pauldix commented Jul 16, 2015

@nstott yeah, that's exactly what I was thinking. Telegraf should specify timestamps when it flushes. In general when writing to InfluxDB it's best to specify timestamps. That way if there is a partial write in a cluster, you can just write again and it's idempotent.

@liyichao the issue is that you'd have one telegraf server collecting all the metrics for all of your hosts (like what you do with StatsD). Essentially one of your telegraf installs would become your statsd server.

@nstott
Copy link

nstott commented Jul 16, 2015

I'll see if i can knock something out in the next few days for this

@alvaromorales
Copy link
Contributor

+1

@skyrocknroll
Copy link

This is one of the awesome feature to have 👍

@zp-markusp
Copy link

+1

@rvrignaud
Copy link

+1

@caquino
Copy link

caquino commented Sep 20, 2015

+1, having a replacement for StatsD/datadog-agent-statsd will make the migration from other services way easier.

@ranjib
Copy link
Contributor

ranjib commented Sep 21, 2015

@pauldix is anyone actively working on it. if not i can take a stab at it. this will be a really useful feature. Im currently running an additional statsd agent (statsdaemon) along side telegraf for this.
@sparrc comments?

@sparrc
Copy link
Contributor

sparrc commented Sep 21, 2015

@ranjib I am hoping to work on this today

@pauldix
Copy link
Member Author

pauldix commented Sep 21, 2015

With the 0.9.5 release coming we'll have support for many fields and we'll stop pushing people to only have a single field per measurement. We should support writing data to multiple fields. I'm thinking that we can support the StatsD protocol like I mentioned above, but we should also make it possible to write values into different fields. I'm thinking it should look exactly like the line protocol.

@skyrocknroll
Copy link

+1 @pauldix #39 (comment)

@skyrocknroll
Copy link

does somebody working on this ? Is the any ETA or target release ?

@ranjib
Copy link
Contributor

ranjib commented Oct 5, 2015

@skyrocknroll #237

@sparrc
Copy link
Contributor

sparrc commented Oct 5, 2015

It's something I'm working on right now. At the moment I have counters, gauges, and sets working. I still have a ways to go with timers, as they're a bit more complicated.

I'm hoping to have timers working by the end of the week, life permitting ;-)

@skyrocknroll
Copy link

Thank you @ranjib

@sparrc
Thank you for your kind update. Right now just to maintain the count we are inserting lot of records. If influxdb statsd is there then our No of records will reduce to 1/1000 th :) and performance will improve a lot.

Eagerly waiting for the release :)

@sparrc
Copy link
Contributor

sparrc commented Oct 5, 2015

@skyrocknroll Since InfluxDB is a bit more powerful than Graphite, the default behavior is going to be a little different than a typical statsd server.

to give you a little preview, counters would look something like this:

Metrics sent:

$ echo "deploys.test.myservice:1|c" | nc -C -w 1 -u localhost 8125
[10s later...]
$ echo "deploys.test.myservice:1|c" | nc -C -w 1 -u localhost 8125

Telegraf debug output:

> [] statsd_deploys_test_myservice_counter value=1
2015/10/05 11:49:25 Cranking default (10s) interval, gathered 1 metrics from 1 plugins in 142.169µs
> [] statsd_deploys_test_myservice_counter value=2
2015/10/05 11:49:35 Cranking default (10s) interval, gathered 1 metrics from 1 plugins in 99.549µs
> [] statsd_deploys_test_myservice_counter value=2
2015/10/05 11:49:45 Cranking default (10s) interval, gathered 1 metrics from 1 plugins in 59.998µs

As you can see, counters will be maintained and reported at each collection interval, and they will not be cleared by default.

Since I've never used statsd in production, I'd love to hear what you (and anyone else in this thread) thinks of that behavior.

Thanks a bunch!

@skyrocknroll
Copy link

@sparrc wherever i have used , counters are always associated with time. Like requests per second.
Some actions per second. So it would be better if we clear of counter values after each flush. For gauge maintaining values across each flush does make sense.

So default behavior

@sparrc
Copy link
Contributor

sparrc commented Oct 5, 2015

My problem resetting the counter is this: InfluxDB provides you with the ability to calculate rates of change on counters that are always-increasing (like this: SELECT non_negative_derivative(value, 1s) FROM statsd_deploys_myservice_counter)

If the counter reset, this obviously wouldn't work, and calculating rates of change on the counter requires knowledge of the flushing interval. This also means that the flushing interval can never be changed once the data starts being collected. With an ever-increasing counter, you are able to change the collection interval completely arbitrarily, because you simply have timestamps associated with different points in the counters' upward trajectory.

To me this makes more sense because it is also generally how OS-level counters work, ie: network bytes & packets received and sent, CPU ticks, etc.

Let me know what you think, the general idea here is that working with InfluxDB is less limited than working with Graphite since it's query language is more featured. Statsd was a protocol built with graphite in mind, and I'd like our implementation to support InfluxDB better.

@skyrocknroll
Copy link

@sparrc I agree with you. one more question. How we are planning to write data using this ?
Pointing influxdb client to telegraf statsd or we should use separate influx-statsd client which supports tags & fields along with measurement .

@sparrc
Copy link
Contributor

sparrc commented Oct 6, 2015

It will be a "plugin" on one of your telegraf instances. That telegraf instance will open up a port and listen for UDP packets, where you can send your normal statsd-style packets. On the regular telegraf interval, the statsd server will be flushed and all data will be sent to InfluxDB.

@skyrocknroll
Copy link

@sparrc Will the line format support tags & fields of influxdb ? Right now we are not using any of statsd influxdb writer because those doesn't understand influxdb tags & fields.

@sparrc
Copy link
Contributor

sparrc commented Oct 6, 2015

yes, it will support a way to create a mapping of a statsd "bucket" to an influxdb measurement with tags: https://github.com/influxdb/telegraf/blob/statsd/plugins/statsd/README.md

@zp-markusp
Copy link

Why don't you take advantages of influxdb and use the line protocol syntax? So that you are able to define tags on the fly and don't rely on any hardcoded dot separated order?

Regards, Markus

@skyrocknroll
Copy link

@sparrc as @zp-markusp said we were looking exactly the same feature. We see influx tags & fields unbeatable feature. If we use the same line protocol then we get all the dynamism of tags and filed and also counters & gauge at the telegraf level.

Or may be we need both of it . Plain statsd for statsd protocol and statsd features with the line protocol.

Plain statstd strips away all the awesomeness of tags & fields.

Datatog has both plain statsd and also datadog-statsd which supports tags.

@justin8
Copy link

justin8 commented Oct 6, 2015

It would be very useful to support both. Being able to use it as a drop in replacement for things like datadog would be really useful, with the added benefit that you can alter your apps to utilize tags afterwards. It would make the barrier for entry incredibly low.

@sparrc
Copy link
Contributor

sparrc commented Oct 6, 2015

Thanks everyone for the input, especially for the datadog-statsd link, that is very useful and it seems like they have created a good system for adding tags to statsd lines.

As I see it, there are two options we can support: datadog-statsd is closer to plain statsd and simply adds a list of tags after a |# character. influx-statsd would be similar to what @nstott wrote above. It is less similar to plain statsd but more similar to the InfluxDB line protocol.

I'm leaning towards only supporting datadog-statsd because then users can more easily migrate between influxdb and datadog, and it also allows people to use existing datadog statsd clients. If we create our own statsd protocol, we're contributing to this problem

@justin8 @skyrocknroll @zp-markusp @pauldix @nathanielc What would you prefer between these two tag formatting options? should we support both?

datadog-statsd

cpu.load.short:2.34|g|#host:server01,region:us-west

influx-statsd

cpu.load.short,host=server01,region=us-west:2.34|g

@skyrocknroll
Copy link

@sparrc I would like to go with influx-statsd because it will give us consistency across whole influxdb ecosystem.It looks very similar to influxdb line protocol. Also @pauldix #39 (comment) was mentioning about supporting multiple values. If we are going to design influxdb-statstd lets provision a way to support multiple field values also.

But right now i don't see strong importance on supporting multiple field values. But others may help on this.
I am thinking of something like this if we support multiple field values.

temperature,machine=unit42,type=assembly internal=32|g,external=100|c

@zp-markusp
Copy link

From a gut feeling perspective I would prefer influx-statsd as this could be implemented without changing the statsd library on the application side as it follows the pattern string{identifier}{value}{statsd type}. So just the identifier has to be exchanged.

@skyrocknroll
Copy link

@zp-markusp +1 One way is we can try to parse the identifier on the telegraf side and if it has tags then lets use it as measurment & tags otherwise we can use whole identifier as measurement in influxdb.

@zp-markusp
Copy link

For example the standard statsd output from logstash could be used.

@nathanielc
Copy link
Contributor

I say influx-statsd since its a subset of the statsd protocol, like @zp-markusp said. It won't require a new client.

I think you should also do something similar to the graphite plugin in InfluxDB that allows you to transform a metric name into a measurement, fields, and tags set. See https://github.com/influxdb/influxdb/tree/master/services/graphite#templates

This will allow for users that already have lots of tag data in the metric name,
i.e us-west.server01.cpu.short.load:2.34|g

@sparrc
Copy link
Contributor

sparrc commented Oct 6, 2015

okay, good point @skyrocknroll about supporting multiple fields, how about this:

measurement[,tag1=key1,tag2=key2]:[field=]value[,field2=value2]|type

so an example would look like:

cpu.usage,host=server01,region=us-west:idle=10.0,user=50.0,system=40.0|g
=> statsd_cpu_usage_gauge,host='server01',region='us-west' idle=10,user=50,system=50

field names and tags are optional, so you could also just do this:

cpu.usage.idle:10.0|g
=> statsd_cpu_usage_idle_gauge value=10

@nathanielc thanks for pointing me to that, I did not realize that we already had a graphite template transformation setup, I was going to have telegraf have a configuration table for transforming the statsd bucket into tags like this: https://github.com/influxdb/telegraf/blob/statsd/plugins/statsd/README.md#statsd-bucket---influxdb-mapping, but I may want to borrow from the influxdb graphite template instead.

@zp-markusp
Copy link

@sparrc does it make sense to hard code the gauge as suffix to the name?
I would propose to either ignore it or add it as a tag (statsd-type=gauge)

@nathanielc
Copy link
Contributor

I don't see a strong need to support multiple fields either. StatsD is an event counter, seems odd to want to send multiple fields for a single event. But as long as it is backwards compatible with the StatsD protocol (like your example) I don't see an issue supporting it.

@sparrc
Copy link
Contributor

sparrc commented Oct 6, 2015

@zp-markusp I like that idea more too, I'll change the behavior to add a metric_type tag 👍

@justin8
Copy link

justin8 commented Oct 6, 2015

Bit late to reply to this one now; but the way it seems to be heading sounds great! Backwards compatible with extra features/tags 👍

@sparrc sparrc closed this as completed in 6977119 Oct 15, 2015
@sparrc
Copy link
Contributor

sparrc commented Oct 15, 2015

This is now in master and can be gotten by building from source, see README here for documentation and usage details: https://github.com/influxdb/telegraf/tree/master/plugins/statsd

more feedback is much appreciated, thanks all

@penguincp
Copy link

penguincp commented Mar 11, 2017

According to #1876 (commented by sparrc on Oct 11, 2016), multiple field support (e.g. cpu.usage,host=server01,region=us-west:idle=10.0,user=50.0,system=40.0|g) was removed and will not be supported in the future, why?

@danielnelson
Copy link
Contributor

@penguincp The statsd protocol is incompatible with multiple fields, we do support multiple tags and you can use a stat for each field. If you would like to discuss this further please open a new issue or ask a questions at the InfluxData Community site.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests