I figured I would throw this out to the wolves, before working on any of the
ideas, in hope of collecting experience from any who have worked with this
idea before me. I’m likely to implement at least a proof of concept unless
someone points out a glaring show-stopping logic flaw.
Idea
Collect Graphite (or Graphite-style) centralized metrics (anything you can think of) via syslog. We don’t need a new way to send and collect small UDP messages.
Why
“Why not?” is the question really.
“Why?” is easy:
- Uses as much omnipresent tooling as possible.
- Uses established system library calls instead of requiring developers to write, install, or copy/paste new ones.
- Because I found myself asking the following this past week: “Why am I about to build and package GNU netcat, for distribution to all of our Solaris 10 boxes, to get remarkably simple frigging UDP packets sent from shell commands?”
How: Client (Ideas)
Standard syslog configuration:
well-known-facility.well-known-severity @metricshost
Developers use standard syslog calls specifying the well-known-facility
and well-known-severity
For non-real-time metrics from “shell land”, just use logger(1)
How: Server (Ideas)
Tweak rsyslog or other well-known syslog “collector” product to:
- Parse basic Graphite or Graphite-like metric messages and perform RRD and/or Whisper writes. Additionally could implement built-in stats aggregation stuff like StatsD quite easily.
- Ignore all non-metric-conforming syslog data
Thanks for any thoughts below or via the original thread that links here.
Hi,
This is basically what is described in this presentation from RubyConf Argentina:Â https://vimeo.com/38628915
You could write to a pipe from syslog and have your parser read that.
Hi Jeff,
As I said on twitter, your idea is very sound, especially for those of us who already have large syslog infrastructures in play. One thing to consider is that depending on the number of metrics, you may be placing a large amount of load on infrastructure that was not necessarily designed for that purpose.
This  may be a particular case, but I saw network routers have a particularly hard time with routing large numbers of UDP packets because they were at the bottom of the filter list, but this was a case of a packet storm, not a well defined infrastructure. I would be monitoring the use of said infrastructure as you scale up the number of metrics you are collecting.Â
You would want to ensure that the packets remain less than 1536 bytes as I do not think fragmentation would be a particularly pretty scenario either.
Very eager to see how it turns out as it is a very viable alternative to polling SNMP through a firewall.Â
Nathan DietschÂ
We already do this at Heroku and it’s quite popular with our engineers. We’re in the process of streamlining many of the emitters and aggregators in our event stream (e.g. syslog is an emitter, Splunk and Graphite are consumers).
In fact you can already do this today via Logster (https://github.com/etsy/logster) and Logstash (http://logstash.net/docs/1.1.0/outputs/graphite).
Hi Jeff,
If you want to get metrics out of logs, I second the mentioning of Logster (run on all of your servers) or Logstash (groking against you centralised logs.Â
Have you seen this? http://joemiller.me/2011/09/21/list-of-statsd-server-implementations/
Still, syslog-ng approach sounds good and certainly better than installing Node all over the place for no other purpose.
Regards,
Brian
Hey Jeff,
definitely a sound idea and something I’ve look at in the past. I’ve actually done something similar with syslong-ng piping to a script, but it won’t scale. The alternative is to rewrite something that understands syslog messages, but then you’re back to running non standard components in your infra even tho you’re limiting the surface to aggregating nodes.
About a logster-alike solution, which basically means tailing log files, we’ve had that scaling pretty well using a daemon mode and is a pretty solid choice. Depending how hard you wanna push the envelope a better choice is, like you suggested, to tweak rsyslog which to me would translate to writing a new module. There’s already a variety of plugins, to write to DBs, to write to mongo, etc, so I can easily see how you could write something that writes to graphite with statsd like features (rsyslog supports stuff like batching messages which would suit this use case).
So the question to me becomes: why would I want something tightly integrated with rsyslog Vs something generic (logster) that can feed off any text(log) file?
@spikelab:twitter
My cohort Eric does something similar. He’s released it as a Ruby gem called metriks: http://bitmonkey.net/post/18854033582/introducing-metriks (gem https://github.com/eric/metriks).
He’s using it to send metrics about Papertrail’s operations to Librato Metrics and graphite. Aside from sending metrics elsewhere, the Ruby process has access to the computed rate and it can update the process name. Having ps show “someworker: 123/second” has been handy.
Troy, thanks for the reply. Â Metriks seems to me to be an improved type of StatsD but with multiple possible output destinations instead of just Graphite. Â I don’t see where syslog plays any real role in Metriks as the transport mechanism.Â
Am I overlooking something?
Thanks for the reply, Spike.
I will definitely look into Logster and consider some ways to implement what I am thinking about with it. We have no special feelings toward rsyslog, a plugin for it (or whatever terminology they may use) was just a first idea for how something like this might be done. Â I didn’t know about Logster.
You’ll note that I played around with this idea in the past:Â http://www.kickflop.net/blog/2010/12/10/any-metric-graphing-with-cron-some-code-syslog-and-splunk/
Sadly (?), I didn’t really take it any further due to Splunk not providing zero-configuration graphs + dashboards like Graphite-web does. There is also the question of whether or not it really makes sense *for us* to be loading down the host with Splunk indexing of a large volume of metric data. Additionally with Splunk, the product cost is based on the amount of daily indexed data, so you end up paying for metrics. [ Don’t get me wrong, we really like Splunk and have been using it for 4 years. ]
Anyway, I’m rambling now. Â Thanks again!
Thanks for the reply, Brian. Â I’ll definitely dig into Logster.
I have in fact seen Joe’s statsd server implementation index. Â I commented there with one implementation that he put on the list ;) Â However, it is a good reminder that I need to make sure I get myself up to speed with all of the various offerings in this problem domain. Maybe it’s time to revisit the “monitoringsucks” github repo as well.
Thanks Jason. Nice to hear it’s being done and is well-liked. I hadn’t considered Logstash as we use Splunk and like it, but it may very well be that we’ll use it just as the Graphite writer role.
Thanks for the reply, Nathan. Anything we implement will certainly be in stages, increasing in metric count and frequency over time, while taking note of the impact.
Thanks for the link! I’ll be watching this later tonight.
I hadn’t considered this, Andrew. Â Great idea. Â If I wanted, as you said, I could keep the parser–>Graphite part completely out of the rsyslog code.
I haven’t had a chance to blog about it yet, but here’s the missing link:
https://github.com/eric/metriks_log_webhook
It’s a webhook for Papertrail (https://papertrailapp.com/ ) that I wrote to take the output of the Metriks Logger reporter (Metriks::Reporter::Logger, documented here:Â https://github.com/eric/metriks/wiki/Reporters ) and sends it to Librato Metrics (http://metrics.librato.com/ ).