MongoDB is very popular, but all the (limited) criticisms of it seem to related to insert performance once the dataset it too big to fit in RAM.
Normally the easy-of-development arguments make up for that, but log files is one of those areas that has a tendency to expand quickly beyond any expectations.
There is a reason why most companies are using HDFS and/or Cassandra for structured log file storage.
This is SUPER helpful! Just the other day I was wondering how someone like me could get involved in the hard scalability problems I read so much about here on the hackers news. But how to make my boring old highly cachable read-only web traffic into a major scalability problem? Then I read this blog entry, and wow, now each log entry on my site turns into a random btree update in MongoDB made while holding a global write lock. Thanks again hackers news, and thanks again BIG DATA!
Or think about it in a different way - instead of adding disk IO on the server itself, you're offloading the log processing to another server which does delay writes (you don't usually need immediate sync for remote logging) and gives you better log processing capabilities (semi-structured data).
If your workload cannot be handled this way - that's another thing. But how did we get from "mongo is webscale" to "mongo cannot be used for anything at all"? What happened to benchmarking and taking serious decisions backed by real data?
For write-only logging from stateless, single machine bound processes - yes. For analytics, automated tracking stateful sessions across many nodes, preserving context, dumping binary fragments... no, at least for me it did not always work.
You can reconstruct almost any system flow with good logging. While that's not always ideal (especially if you need to query the data), the more structured your data gets, the less it is a simple log. When you increase the specificity of your tools, they becomes less useful in the general case, turing tarpit notwithstanding.
How does fluentd resume tailing the apache log if it crashes? Does it maintain the current file position on disk? What if logs are rotated between a fluentd crash and recovery?
I've had to solve this problem for Yahoo!'s performance team, and ended up setting a very small log rotation timeout, and only parsing rotated logs. There's a 5-30 minute delay in getting data out of logs (depending on how busy the server is), but since we're batch processing anyway, it doesn't matter.
The added advantage, is that you just maintain a list of files that you've already parsed, so if the parser/collector crashes, it just looks at the list and restarts where it left off. Smart key selection (ie, something like IP or userid+millisecond time) is enough to ensure that if you do end up reprocessing the same file (eg, if a crash occurs mid-file), then duplicate records aren't inserted (use the equivalent of a bulk INSERT IGNORE for your db).
I have a syslog-ng -> MongoDB project that I've been working on at my university.
github.com/ngokevin/netshed
It is written in Python current parses out fields from several types of logs (such as dhcpd). It is initially set up to read from named pipes (it has a tail function as well). Each type of log is dumped to its own database, and each date has its own collection. I have it set up with a master/slave configuration to overcome the global write lock. It has functions to simulate capped collections by days. It is followed with a Django frontend for querying via PyMongo.
This version is several weeks old and I will push out a new one soon.
Oh sorry, when I mean overcome the global write lock, I don't mean getting rid of it, but simply allowing me to querying a replicated slave database while the master database is getting hundreds of writes a second...so the writes don't block the reads.
I'd also suggest looking at both Logstash and Greylog2. They both can use MongoDB as the storage engine for logs, and can also do the field extractions.
FWIW, Graylog2 will be switching to ElasticSearch backend from the existing MongoDB, citing lack of performance constraints(and lack of better FTS functionality) specific to MongoDB.
Find the entire comment here - http://groups.google.com/group/graylog2/browse_thread/thread...
This is something I am working on right now, which is to have a centralized logging system in place for the production servers. Logs will get indexed in ElasticSearch(pretty awesome project, imho!!), where I can run search queries against the indexes.
I am using logstash for parsing, routing logs from production servers to elasticsearch instance.
Great! If we use Fluentd and MongoDB, we can collect realtime event without writing some codes, but only configuration setting. I also think about more flexible aggregation system using them: "An Introduction to Fluent & MongoDB Plugins" http://www.slideshare.net/doryokujin/an-introduction-to-flue... . Please tell me if there exists more powerful use-case using Fluentd & Mongo!
MongoDB is very popular, but all the (limited) criticisms of it seem to related to insert performance once the dataset it too big to fit in RAM.
Normally the easy-of-development arguments make up for that, but log files is one of those areas that has a tendency to expand quickly beyond any expectations.
There is a reason why most companies are using HDFS and/or Cassandra for structured log file storage.