Firewall Wizards mailing list archives
Re: parsing logs ultra-fast inline
From: Chuck Swiger <chuck () codefab com>
Date: Thu, 02 Feb 2006 13:00:14 -0500
Marcus J. Ranum wrote:
OK, based on some offline discussion with a few people, about doing "large amounts" of system log processing inline at high speeds, I thought I'd post a few code fragments and some ideas that have worked for me in the past.
Thank you, there is good stuff lurking here for those who spend the time to think about your posting carefully. One of the most important points you've made is that you need to pay attention and handle the common cases efficiently, but you should also look for and notice exceptional events which are not expected. Those should stand out from the routine information even if they only happen once. Perhaps especially if the event only happens once, uniqueness can be quite interesting. I have one analysis tool to offer which can do histograms: http://www.pkix.net/~chuck/histogram.py This makes a really good tool to point at a large mass of logdata and see what kind of patterns appear how often. If you've got, say, an httpd logfile in combined format, try the following: histogram.py -F\" -f 2 -n 20 httpd-access.log histogram.py -F\" -f 6 -n 20 httpd-access.log histogram.py -F\" -f 2 -n 20 -P jpg httpd-access.log histogram.py -F\" -f 2 -n 20 -X HTTP/1.0 httpd-access.log histogram.py -F\" -f 2 -n 20 -X HTTP/1.1 httpd-access.log Well, I'm sure you get the idea. Add the -r flag, and you can see the least common events instead.
[... ] If you put some simple counters in your analysis routines (hits versus misses) you can "load balance" your first tree-branch appropriately using a flat percentage. Also, remember, if you standardize your processing, it doesn't matter where it happens; it can happen at the edge/source or back in a central location or any combination of the two. Simply design your analysis as an always 3-stage process consisting of: - weeding out and counting instances of uninteresting events - selecting, parsing sub-fields of, and processing interesting events - retaining events that fell through the first two steps as "unusual"
This is a fine explanation of event parsing. Note that the order is not necessarily best done in sequential fashion: ie, while you want to weed out common uninteresting events quickly, do the most common matches first and have tests matching common interesting events, before you test for and discard uncommon uninteresting events. [ ... ]
// if we fell through to here we have a new message structure // we have never seen before!! vector message to interesting folder; }
...and then you go back a week later and add matching cases for the messages you've never seen before. Lather, rinse, repeat, until you're done for the moment, or until a new version of the software comes out, or someone decides to make syslogd format the date field differently.
Variant messages are a massive pain in the butt; you need to decide whether to deal with variants as separate cases or to make the sub-parser smarter in order to deal with them. This is one of the reasons I keep saying that system log analysis is highly site specific!
I think I can tell which software has been written by people who have had to update log analysis tools, and which has not: the former never change logging output, especially of important messages, between minor program versions. Other software seems to fiddle with the output details from line to line and from version to version. Add fields at the end or add new log lines instead. Please! [ ... ]
A philosophical digression: Why not use regexps? Other than the fact that they look like modem line noise (kids these days probably don't even know what that looks like, I bet... arrgh!) it's hard to look quickly at a regexp and a string and tell if it'll match. So most of the log processing approaches that I've seen resort to having a loop like: while( get a message) { if(regexp #1 matches) { sub_parse format #1; } else if(regexp #2 matches) { ... } What you wind up doing is parsing the string twice or, worse, having loads of regexps in the main loop, which makes your parse tree too broad to execute quickly. The worst case of this is really really ugly because you wind up executing every regexp in your pattern-set against every line of the input, which can suck quite badly if your pattern-set is large.
Absolutely, you should try to use static string matching instead of regex'es for your most common test cases if at all possible. You can make life easy for your own log analysis to do this if your program log messages of something like: EVENTTYPE: ...message... ...where "...message..." is either a human-readable string, or perhaps a set of key-value pairs. Most of your logging messages should correspond to the stuff the program is supposed to be doing. An individual log line should provide enough context to be understandable or at least integratable with related log messages if the logfile is truncated or rotated, or if the log data is processed in a non-sequential fashion. Sendmail queue IDs or PIDs in general logfiles are common examples of data which can be used to make such correlations. The closer your events are to the significant real things or changes in state that are happening in your program, the more effective your logging will be towards the people using the software. (I only repeat the obvious because there is a vast amount of software which misses this point.) If you need to add debugging code to figure out what's going on in some weird or exceptional case that you don't understand, use an event type of "DEBUG", "INFO", "FATAL" or whatever is appropriate followed by a human-oriented message. Once you understand the situation better, classify it and change the event type to the message and add it to your parser. If you use this consistently, you can easily process the messages your code generates into discrete events, and you can easily separate out "normal" logging output from the stuff that a human should review out-of-band. [ ... ]
What _is_ scary is if you wanted to handle _all_ the variant forms of log messages. My summer project 2 summers ago ("what I did on my summer vacation, by Marcus Ranum) was going out to San Diego Supercomputer Center and plowing around in SDSC's syslogs with Abe and a couple bottles of tequila and a ton of gummi bears. The goal of the project was to attempt to automatically generate a parse tree engine that could recognize all the variants of syslog messages seen. So first I wrote something that attempted to enumerate the total # of variants seen. After it ran for a day (much data, but the computer I was using was fast!) my analysis program concluded there were 50,000+ different forms of syslog messages in 10 years of syslog data. Arrrgh. That's a lot. What the hell impels programmers to keep changing the structure of their log messages? There were lots of variants of sendmail messages (dang it!) - the biggest culprit. Anyhow, you might only want a few of them so it's not like you'd need parse rules for all 50,000+ forms of message. The important part of the process is the sitting down and seeing what you've got and deciding what you want to do with it.
Agreed, agreed. But it's never too late to try to become more consistent. (Starting with not breaking things which work now.) -- -Chuck _______________________________________________ firewall-wizards mailing list firewall-wizards () honor icsalabs com http://honor.icsalabs.com/mailman/listinfo/firewall-wizards
Current thread:
- parsing logs ultra-fast inline Marcus J. Ranum (Feb 02)
- Re: parsing logs ultra-fast inline Chuck Swiger (Feb 02)
- RE: parsing logs ultra-fast inline Tina Bird (Feb 02)
- Re: parsing logs ultra-fast inline Adrian Grigorof (Feb 03)
- Re: parsing logs ultra-fast inline Chuck Swiger (Feb 07)
- Re: parsing logs ultra-fast inline Marcus J. Ranum (Feb 07)
- Re: parsing logs ultra-fast inline Brian Loe (Feb 08)
- Message not available
- Re: parsing logs ultra-fast inline Marcus J. Ranum (Feb 08)
- Re: parsing logs ultra-fast inline John Adams (Feb 09)
- Re: parsing logs ultra-fast inline Adrian Grigorof (Feb 03)
- RE: parsing logs ultra-fast inline Paul Melson (Feb 15)
- Re: parsing logs ultra-fast inline Anton Chuvakin (Feb 07)
- Re: parsing logs ultra-fast inline Adrian Grigorof (Feb 07)