Real-Time Log Collection with Fluentd and MongoDB

nl · on Dec 5, 2011

Very nice & all, but... MongoDB?

MongoDB is very popular, but all the (limited) criticisms of it seem to related to insert performance once the dataset it too big to fit in RAM.

Normally the easy-of-development arguments make up for that, but log files is one of those areas that has a tendency to expand quickly beyond any expectations.

There is a reason why most companies are using HDFS and/or Cassandra for structured log file storage.

kzk_mover · on Dec 5, 2011

fluentd's greatest advantage is, it's written in Ruby. So, it's really easy to write the plugin for any datastores.

This is HDFS (Hoop, the HDFS REST gateway) plugin.

* https://github.com/tagomoris/fluent-plugin-hoop

And also, Cassandra plugin are now in under development.

* https://github.com/tomitakazutaka/fluent-plugin-cassandra

You can see the user contributed plugin list here.

* https://github.com/tomitakazutaka/fluent-plugin-cassandra

seany · on Dec 5, 2011

If you were going from logs -> hdfs, flume would be a much better chose imho

boredandroid · on Dec 5, 2011

This is SUPER helpful! Just the other day I was wondering how someone like me could get involved in the hard scalability problems I read so much about here on the hackers news. But how to make my boring old highly cachable read-only web traffic into a major scalability problem? Then I read this blog entry, and wow, now each log entry on my site turns into a random btree update in MongoDB made while holding a global write lock. Thanks again hackers news, and thanks again BIG DATA!

viraptor · on Dec 5, 2011

Or think about it in a different way - instead of adding disk IO on the server itself, you're offloading the log processing to another server which does delay writes (you don't usually need immediate sync for remote logging) and gives you better log processing capabilities (semi-structured data).

If your workload cannot be handled this way - that's another thing. But how did we get from "mongo is webscale" to "mongo cannot be used for anything at all"? What happened to benchmarking and taking serious decisions backed by real data?

jallmann · on Dec 5, 2011

Syslog works nicely over the network in a client-server configuration, and has done so for ages.

viraptor · on Dec 5, 2011

For write-only logging from stateless, single machine bound processes - yes. For analytics, automated tracking stateful sessions across many nodes, preserving context, dumping binary fragments... no, at least for me it did not always work.

jallmann · on Dec 5, 2011

You can reconstruct almost any system flow with good logging. While that's not always ideal (especially if you need to query the data), the more structured your data gets, the less it is a simple log. When you increase the specificity of your tools, they becomes less useful in the general case, turing tarpit notwithstanding.

bluesmoon · on Dec 5, 2011

How does fluentd resume tailing the apache log if it crashes? Does it maintain the current file position on disk? What if logs are rotated between a fluentd crash and recovery?

I've had to solve this problem for Yahoo!'s performance team, and ended up setting a very small log rotation timeout, and only parsing rotated logs. There's a 5-30 minute delay in getting data out of logs (depending on how busy the server is), but since we're batch processing anyway, it doesn't matter.

The added advantage, is that you just maintain a list of files that you've already parsed, so if the parser/collector crashes, it just looks at the list and restarts where it left off. Smart key selection (ie, something like IP or userid+millisecond time) is enough to ensure that if you do end up reprocessing the same file (eg, if a crash occurs mid-file), then duplicate records aren't inserted (use the equivalent of a bulk INSERT IGNORE for your db).

This scales to billions of log entries a day.

ngokevin · on Dec 5, 2011

I have a syslog-ng -> MongoDB project that I've been working on at my university.

github.com/ngokevin/netshed

It is written in Python current parses out fields from several types of logs (such as dhcpd). It is initially set up to read from named pipes (it has a tail function as well). Each type of log is dumped to its own database, and each date has its own collection. I have it set up with a master/slave configuration to overcome the global write lock. It has functions to simulate capped collections by days. It is followed with a Django frontend for querying via PyMongo.

This version is several weeks old and I will push out a new one soon.

alexchamberlain · on Dec 5, 2011

Have you got more details about overcoming the global write lock?

ngokevin · on Dec 5, 2011

Oh sorry, when I mean overcome the global write lock, I don't mean getting rid of it, but simply allowing me to querying a replicated slave database while the master database is getting hundreds of writes a second...so the writes don't block the reads.

kordless · on Dec 5, 2011

I'd also suggest looking at both Logstash and Greylog2. They both can use MongoDB as the storage engine for logs, and can also do the field extractions.

ashish_0x90 · on Dec 5, 2011

FWIW, Graylog2 will be switching to ElasticSearch backend from the existing MongoDB, citing lack of performance constraints(and lack of better FTS functionality) specific to MongoDB. Find the entire comment here - http://groups.google.com/group/graylog2/browse_thread/thread...

This is something I am working on right now, which is to have a centralized logging system in place for the production servers. Logs will get indexed in ElasticSearch(pretty awesome project, imho!!), where I can run search queries against the indexes. I am using logstash for parsing, routing logs from production servers to elasticsearch instance.

doryokujin · on Dec 5, 2011

Great! If we use Fluentd and MongoDB, we can collect realtime event without writing some codes, but only configuration setting. I also think about more flexible aggregation system using them: "An Introduction to Fluent & MongoDB Plugins" http://www.slideshare.net/doryokujin/an-introduction-to-flue... . Please tell me if there exists more powerful use-case using Fluentd & Mongo!

nodesocket · on Dec 5, 2011

Do you have a parser for nginx and/or lighttpd? Would like push logs from these to MongoDB.

hoop · on Dec 5, 2011

For lighttpd, try the following in your config to log in a format identifical to that of Apache's combined log format.

    accesslog.format = "%h %V %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\""

See the docs on ModAccessLog for more information: http://redmine.lighttpd.net/wiki/1/Docs:ModAccesslog

kzk_mover · on Dec 5, 2011

as hoop suggested, it can parse Apache's combined format :-) And the contribution is welcome!