A methodology for logging in production
31 January 2008 | Uncategorized | 1 Comment
This isn’t about how Ops deals with giant syslog servers or logrotate, that stuff is covered elsewhere by people qualified to speak of it. This is a quick crash course on how we as application developers can do logging in a sane way that won’t drive Ops crazy (hopefully) and will help everyone debug issues in production with minimum pain in stressful situations.
The heart of useful logging is using log levels correctly. The reason you have levels is to enable message routing/handling based on importance and also to facilitate easier grepping when it comes to that (and it will). Classic Unix syslog defines these decreasing levels or “priorities”:
- EMERG: I don’t use this one ever. It’s for system-level problems like maybe a RAID event or something.
- ALERT: I don’t know what this is for.
- CRIT: This is a super-bad error I guess. We don’t use this but I could see using it when a service doesn’t start or something.
- ERR: Now we’re getting to the juice.
This is a bread-and-butter level for backend applications. Most every exception should be logged at error level. ERR is where you should see your database integrity violations, runtime type violations, runtime assertion failures, stuff like that. Pretty much anything unexpected should land here, so it’s hopefully closely related to how you are using exceptions.
A good rule of thumb is to figure that anything logged at error level will end up going to somebody’s pager in the middle of the night. If you get that escalation in the night and things are reported “broken”, the error-level logs are the first place to look and they ought to have something for you to at least start with or you’ve failed.
You don’t want a lot of clutter in there though there can easily be cascading errors that are completely natural given particular failure modes. You absolutely have to see something like “database exception: sequence can’t be increased” or “couldn’t create tmp file $filename: No space on device” or similar type of base causes, not a bunch of extraneous junk and not total silence on such an issue. There are plenty of other levels for junkier stuff…
- WARNING: another bread-and-butter level, this one is a little more ambiguous.
Warnings are unexpected events that don’t necessarily mean processing a unit of work was aborted, but that something funky was noticed and might be important in a diagnostic situation. Perl has a built-in warn() function so there’s already a built-in culture there for what proper warnings are. In classic unix apps it’s the type of thing that gets written to stderr but then the program doesn’t exit, it keeps chugging along.
A warning you might see in (say) a spam analysis application is an unrecognized character set that the application is therefore defaulting to treating as Latin1. It’s not important enough to page somebody in the night that one particular spam out of (say) millions couldn’t be processed 100% correctly, but it’s worth noting that there was this recoverable problem in case it helps establish a pattern that is worth addressing (“hey we’re blowing the decode on GB2312 a whole lot, let’s work on that next release”).
- NOTICE: a wishy-washy one, like INFO but “more important”. We use notice for application startup/shutdown and per-run application timing messages. You could definitely just ignore it and live a full life IMHO.
- INFO: this is the standard vanilla almost-every-time log level.
Something totally normal happened and you want to say something in the logs that everyone can see, reach for INFO first unless you have a good reason to pick another level. You processed a message, you handled a web request, whatever. Maybe these logs will be used for post-processing and reporting, maybe they’ll be used for troubleshooting the conditions leading up to an error.
- DEBUG: Our production environments do not generally log debug-level messages to disk unless there are very exceptional circumstances. The audience for these messages are purely QA and Dev and the volume can be very large.
DEBUG is where you can chat mindlessly about every little file you’re opening or closing, every SQL query you’re running, intermediate results for the business logic or any similar crap. Debug level is great for giving QA and Dev “internal views” on what is going on in the system but would be a highly distracting (and possibly taxing for the log server) firehose of unstructured chatter in production.
So those are the levels, but maybe you’re starting from total scratch and you’ve just been scribbling everything into one big file somewhere. Which levels do you really need to start making things way better?
I think ERR and INFO are the minimum you need to be thinking about when you’re writing your app. What are the normal processing messages that show things are flowing around correctly and in timely fashion (INFO) and what are the major problems and showstoppers that are preventing normal processing of your program (ERR).
You’ll quickly realize that DEBUG is handy for helping detect and verify bugs in QA/dev. WARN might or might not make sense for your environment or might be used very sparingly. You might or might not find uses for the other levels based on the culture of your Ops team, these rules are not hard and fast.
That’s all I know. Go forth and log wisely.
-
http://blog.cbcg.net/ Toby DiPasquale