Firewall Wizards mailing list archives

Re: parsing logs ultra-fast inline


From: Chuck Swiger <chuck () codefab com>
Date: Thu, 02 Feb 2006 13:00:14 -0500

Marcus J. Ranum wrote:
OK, based on some offline discussion with a few people, about
doing "large amounts" of system log processing inline at high
speeds, I thought I'd post a few code fragments and some ideas
that have worked for me in the past.

Thank you, there is good stuff lurking here for those who spend the time to
think about your posting carefully.  One of the most important points you've
made is that you need to pay attention and handle the common cases efficiently,
but you should also look for and notice exceptional events which are not expected.

Those should stand out from the routine information even if they only happen
once.  Perhaps especially if the event only happens once, uniqueness can be
quite interesting.  I have one analysis tool to offer which can do histograms:

   http://www.pkix.net/~chuck/histogram.py

This makes a really good tool to point at a large mass of logdata and see what
kind of patterns appear how often.  If you've got, say, an httpd logfile in
combined format, try the following:

  histogram.py -F\" -f 2 -n 20 httpd-access.log
  histogram.py -F\" -f 6 -n 20 httpd-access.log
  histogram.py -F\" -f 2 -n 20 -P jpg httpd-access.log
  histogram.py -F\" -f 2 -n 20 -X HTTP/1.0 httpd-access.log
  histogram.py -F\" -f 2 -n 20 -X HTTP/1.1 httpd-access.log

Well, I'm sure you get the idea.  Add the -r flag, and you can see the least
common events instead.

[... ]  If you put some simple counters in your analysis routines
(hits versus misses) you can "load balance" your first tree-branch
appropriately using a flat percentage. Also, remember, if you
standardize your processing, it doesn't matter where it happens;
it can happen at the edge/source or back in a central location
or any combination of the two. Simply design your analysis as
an always 3-stage process consisting of:
- weeding out and counting instances of uninteresting events
- selecting, parsing sub-fields of, and processing interesting events
- retaining events that fell through the first two steps as "unusual"

This is a fine explanation of event parsing.

Note that the order is not necessarily best done in sequential fashion: ie,
while you want to weed out common uninteresting events quickly, do the most
common matches first and have tests matching common interesting events, before
you test for and discard uncommon uninteresting events.

[ ... ] 
      // if we fell through to here we have a new message structure
      // we have never seen before!!
      vector message to interesting folder;
}

...and then you go back a week later and add matching cases for the messages
you've never seen before.  Lather, rinse, repeat, until you're done for the
moment, or until a new version of the software comes out, or someone decides to
make syslogd format the date field differently.

Variant messages are a massive pain in the butt; you need to decide
whether to deal with variants as separate cases or to make the
sub-parser smarter in order to deal with them. This is one of the
reasons I keep saying that system log analysis is highly site
specific!

I think I can tell which software has been written by people who have had to
update log analysis tools, and which has not: the former never change logging
output, especially of important messages, between minor program versions.  Other
software seems to fiddle with the output details from line to line and from
version to version.

Add fields at the end or add new log lines instead.  Please!

[ ... ]
A philosophical digression: Why not use regexps? Other than the
fact that they look like modem line noise (kids these days probably
don't even know what that looks like, I bet... arrgh!) it's hard to look
quickly at a regexp and a string and tell if it'll match. So most of
the log processing approaches that I've seen resort to having a
loop like:
while( get a message) {
      if(regexp #1 matches) {
              sub_parse format #1;
      } else
      if(regexp #2 matches) {
              ...
}

What you wind up doing is parsing the string twice or, worse,
having loads of regexps in the main loop, which makes your
parse tree too broad to execute quickly. The worst case of this
is really really ugly because you wind up executing every
regexp in your pattern-set against every line of the input,
which can suck quite badly if your pattern-set is large.

Absolutely, you should try to use static string matching instead of regex'es for
your most common test cases if at all possible.  You can make life easy for your
own log analysis to do this if your program log messages of something like:

   EVENTTYPE: ...message...

...where "...message..." is either a human-readable string, or perhaps a set of
key-value pairs.

Most of your logging messages should correspond to the stuff the program is
supposed to be doing.  An individual log line should provide enough context to
be understandable or at least integratable with related log messages if the
logfile is truncated or rotated, or if the log data is processed in a
non-sequential fashion.  Sendmail queue IDs or PIDs in general logfiles are
common examples of data which can be used to make such correlations.

The closer your events are to the significant real things or changes in state
that are happening in your program, the more effective your logging will be
towards the people using the software.  (I only repeat the obvious because there
is a vast amount of software which misses this point.)

If you need to add debugging code to figure out what's going on in some weird or
exceptional case that you don't understand, use an event type of "DEBUG",
"INFO", "FATAL" or whatever is appropriate followed by a human-oriented message.
 Once you understand the situation better, classify it and change the event type
to the message and add it to your parser.

If you use this consistently, you can easily process the messages your code
generates into discrete events, and you can easily separate out "normal" logging
output from the stuff that a human should review out-of-band.

[ ... ]
What _is_ scary is if you wanted to handle _all_ the variant
forms of log messages. My summer project 2 summers ago
("what I did on my summer vacation, by Marcus Ranum)
was going out to San Diego Supercomputer Center and
plowing around in SDSC's syslogs with Abe and a couple
bottles of tequila and a ton of gummi bears. The goal of the
project was to attempt to automatically generate a parse
tree engine that could recognize all the variants of syslog
messages seen. So first I wrote something that attempted
to enumerate the total # of variants seen. After it ran for a
day (much data, but the computer I was using was fast!)
my analysis program concluded there were 50,000+ different
forms of syslog messages in 10 years of syslog data.
Arrrgh. That's a lot. What the hell impels programmers to
keep changing the structure of their log messages? There
were lots of variants of sendmail messages (dang it!) - the
biggest culprit. Anyhow, you might only want a few of them
so it's not like you'd need parse rules for all 50,000+
forms of message. The important part of the process is
the sitting down and seeing what you've got and deciding
what you want to do with it.

Agreed, agreed.  But it's never too late to try to become more consistent.
(Starting with not breaking things which work now.)

-- 
-Chuck
_______________________________________________
firewall-wizards mailing list
firewall-wizards () honor icsalabs com
http://honor.icsalabs.com/mailman/listinfo/firewall-wizards


Current thread: