Thursday 24 May 2007

Log Me Gently, Baby!

Hi Guys,

I haven't been here for almost a month. That's kind of a bad thing because I have plans to keep my blog updated each week. Let's fix this a little bit and talk about something strange... for example logging!

You may say there is nothing to discuss. It is not a big deal. Almost every platform and/or framework has a logging infrastructure - just use that. It seems like all you need to do is to insert a call to a particular logging method/function with an appropriate informative level and a message all over the code you might find reasonable and ... that it! Enjoy the results! And you are absolutely right. Get yourself busy with the business part of the application and don't waste your time inventing the wheel. The end of story.

For those who don't believe in the happy end let's try to add some concrete environments restrictions and see what we really have:
  1. The web application I work on is a heavily loaded one with a lot of simultaneous requests (roughly speaking hundreds per second per machine in the web cluster) - this is the first piece.
  2. One of the primary architecture goals is to serve each request as quickly as possible to get the maximum throughput. According to the last measurements on the testing platform the average response time for the majority of functions is about 50ms (yes, I don't believe in this also ... and this makes me read the performance testing repost again and again every day - but it is always quite the same :) ). To telling the truth we have a strict limit at 150ms per request. And of course I don't want to spoil the achieved results - the second piece seems to be in place.
  3. Big applications are complex - complex and expensive to develop, to install, to maintain. That's why we are always struggling hard to escape unnecessary infrastructure elements, especially if they require deep understanding from each team member in order to be used :). And this is the third piece.

That's what I have... Now let's take a look at what I need.

When you create a highly available solution you need a lot of things you probably don't care much in other application types. These things include but not limited to:
  1. live hardware and software reconfiguration,
  2. live data migration,
  3. non-stoppable deployment of new application versions,
  4. carefully planned and quick database updates,
  5. live system patches,
  6. etc.
There are a lot of excellent technical challenges to face as well as a growing overall system complexity. And not surprisingly it is necessary to monitor all this complexity - the most critical parts even can require real-time monitoring. What choice do we have? Can you think at least about a few options here? Probably the conclusion is too obvious. It is just a simple axiom - logs are closest friends in the situations like this. The more complex system you have - the more time and money logs will save you if applied right.
Here is a brief list of questions which should be addressed in a highly available application with a serious load to create a good logging subsystem. And even if you answered all of them - you might still not be sure that your logging is 100% okay...

Apply each question to the particular situation you have and try to think carefully about any consequences which may follow if you skip the corresponding logging functionality:
  1. How many steps do I need to setup basic logging? Do I need to bring in the complex configuration? How dump the system administrators could be?
  2. Is my logging synchronous or asynchronous? Can I make the time consuming logging operations (for example log to a database or a file) asynchronous?
  3. Is my logging friendly for multiple concurrent threads (or even processes)? Does it use intensive locking of any data structure which may lead to threads blocking? Is it possible to use none-locking versions of the same structure instead?
  4. Can I use batch logging? Is the logging subsystem flexible enough to reach the balance between the speed of batch logging and loses of log information which is stored in memory between batch writes in case of unpredictable failure?
  5. Can I change the logging configuration without affecting the application (at least without restarting it)?
  6. Can I add some pre-processing logic before the messages are logged, i.e. write filters that keep logs clean from duplicate error notifications?
  7. Is logging subsystem flexible enough to deliver logs over the networks to a specially selected machine (or even a cluster of machines) dedicated for logs storage and processing?
  8. Are there good tools for viewing logs of the format I choose? Do they support real-time log acquisition and visualization?
  9. How graceful may the logging subsystem fail? Can it switch to different log targets in case of failure? Is it self-repairing as soon as the main problem is resolved (for example if the database temporarily goes down can it write to a backup location then redirect messages to the database once its back online)?
  10. Which "standard" configuration possibilities of logging subsystem do I have (changeable log message format; thresholds for messages coming from concrete places of the application; pluggable log targets such as files, system eventlog, e-mail, database, network connection, group communication software connections; log-levels that can be easily adjusted depending on the situation)?

As you can see there are a lot of stuff for such a simple thing as logging, isn't it?
I hope this post helps you to think about your old acquaintance in a new way and let him reveal its true power.

Talk to you later!

No comments: