Thursday 24 May 2007

Log Me Gently, Baby!

Hi Guys,

I haven't been here for almost a month. That's kind of a bad thing because I have plans to keep my blog updated each week. Let's fix this a little bit and talk about something strange... for example logging!

You may say there is nothing to discuss. It is not a big deal. Almost every platform and/or framework has a logging infrastructure - just use that. It seems like all you need to do is to insert a call to a particular logging method/function with an appropriate informative level and a message all over the code you might find reasonable and ... that it! Enjoy the results! And you are absolutely right. Get yourself busy with the business part of the application and don't waste your time inventing the wheel. The end of story.

For those who don't believe in the happy end let's try to add some concrete environments restrictions and see what we really have:
  1. The web application I work on is a heavily loaded one with a lot of simultaneous requests (roughly speaking hundreds per second per machine in the web cluster) - this is the first piece.
  2. One of the primary architecture goals is to serve each request as quickly as possible to get the maximum throughput. According to the last measurements on the testing platform the average response time for the majority of functions is about 50ms (yes, I don't believe in this also ... and this makes me read the performance testing repost again and again every day - but it is always quite the same :) ). To telling the truth we have a strict limit at 150ms per request. And of course I don't want to spoil the achieved results - the second piece seems to be in place.
  3. Big applications are complex - complex and expensive to develop, to install, to maintain. That's why we are always struggling hard to escape unnecessary infrastructure elements, especially if they require deep understanding from each team member in order to be used :). And this is the third piece.

That's what I have... Now let's take a look at what I need.

When you create a highly available solution you need a lot of things you probably don't care much in other application types. These things include but not limited to:
  1. live hardware and software reconfiguration,
  2. live data migration,
  3. non-stoppable deployment of new application versions,
  4. carefully planned and quick database updates,
  5. live system patches,
  6. etc.
There are a lot of excellent technical challenges to face as well as a growing overall system complexity. And not surprisingly it is necessary to monitor all this complexity - the most critical parts even can require real-time monitoring. What choice do we have? Can you think at least about a few options here? Probably the conclusion is too obvious. It is just a simple axiom - logs are closest friends in the situations like this. The more complex system you have - the more time and money logs will save you if applied right.
Here is a brief list of questions which should be addressed in a highly available application with a serious load to create a good logging subsystem. And even if you answered all of them - you might still not be sure that your logging is 100% okay...

Apply each question to the particular situation you have and try to think carefully about any consequences which may follow if you skip the corresponding logging functionality:
  1. How many steps do I need to setup basic logging? Do I need to bring in the complex configuration? How dump the system administrators could be?
  2. Is my logging synchronous or asynchronous? Can I make the time consuming logging operations (for example log to a database or a file) asynchronous?
  3. Is my logging friendly for multiple concurrent threads (or even processes)? Does it use intensive locking of any data structure which may lead to threads blocking? Is it possible to use none-locking versions of the same structure instead?
  4. Can I use batch logging? Is the logging subsystem flexible enough to reach the balance between the speed of batch logging and loses of log information which is stored in memory between batch writes in case of unpredictable failure?
  5. Can I change the logging configuration without affecting the application (at least without restarting it)?
  6. Can I add some pre-processing logic before the messages are logged, i.e. write filters that keep logs clean from duplicate error notifications?
  7. Is logging subsystem flexible enough to deliver logs over the networks to a specially selected machine (or even a cluster of machines) dedicated for logs storage and processing?
  8. Are there good tools for viewing logs of the format I choose? Do they support real-time log acquisition and visualization?
  9. How graceful may the logging subsystem fail? Can it switch to different log targets in case of failure? Is it self-repairing as soon as the main problem is resolved (for example if the database temporarily goes down can it write to a backup location then redirect messages to the database once its back online)?
  10. Which "standard" configuration possibilities of logging subsystem do I have (changeable log message format; thresholds for messages coming from concrete places of the application; pluggable log targets such as files, system eventlog, e-mail, database, network connection, group communication software connections; log-levels that can be easily adjusted depending on the situation)?

As you can see there are a lot of stuff for such a simple thing as logging, isn't it?
I hope this post helps you to think about your old acquaintance in a new way and let him reveal its true power.

Talk to you later!

Tuesday 1 May 2007

Cloning and YouTube.com

A couple of months ago we started a new project. I took a noble position of Software Architect and a small team of developers to create a big and complex solution - although it's not quite true but let's name it a clone of YouTube and FlickR at the same time :). Yes, I know - the internet is full of them and constructing just another one should be a boring thing... But for me it was an excellent possibility! Why? Let me explain...

After the tremendous business success of first web2.0 applications a lot of customers of our company became ordering their clones. It was a slow process with a lot of hesitation and afterthought. After Google acquired www.youtube.com every internet business owner start having nightmares. All of a sudden everyone understood how incredibly profitable their business could be even without any visible business model if they had been more active. As a result everybody rushed forward but unfortunately with the same level of stupidity - the vast majority decided to catch the same train. Yes, they start dreaming about YouTube clones! It is not hard to construct something similar to YouTube (and even easier to order its construction to outsourcing company like one I work for - you don't even need to get your hands dirty managing developers or coding directly). The task could be easily fulfilled by a team of four web developers in two-three months period (plus a tester to make the application stable). So if it is so reachable and cheap why not to try? And they tried and tried a lot... The only problem here is that this estimation remains true as long as the load of a site is small - in other words when you have hundreds (not thousands) visitors per day and they don't upload much stuff. But this was sufficient. In spite of the fact of willing to succeed almost all customers we had until now wanted to bite off just a tiny piece of YouTube market. Truly speaking they merely dreamed about becoming at most middle-sized internet company in the distant future. They preferred to be born small, hope to grow big and die huge (just kidding ... sure all we want to get on top and be there forever :) ). So we were making blind copies of the same site again and again but on a very small scale. We were copying the design, functionality, trying to use the same open source software as YouTube does and so on. It was a boring stupid work usually assigned to teams formed mostly from students as they performed work like this for the first time and didn't feel sick... Of course the quality of solutions were not high but well enough to work most of the time and made customers happy.

But finally one of the customers decided to be born huge at once. He came to a wise and simple conclusion - not to create dump clones of famous systems but provide some flexible instrumentation which helps in construction of media sites. Unfortunately I can't provide the full description of the business idea but it is not necessary. The only thing that worth mentioning is that due to the specific market orientation such none-trivial requirements as scalability, high-availability (99.999% up-time) and total visibility became the core features of the architecture.

Why did I write such a long story? Each technical solution has the appropriate business context. I'm not only planning to share the most difficult technical parts and possible solutions on the pages of this blog but also want to discuss their applicability to a concrete business areas...

Stay tuned!

Guy Kawasaki - The Art of The Start

When this man speaks everybody listens! This time Guy Kawasaki is talking to entrepreneurs about The Art of The Start - both the book and the strategy of achievement. If you haven't seen this video yet I highly recommend you to do so. It is smart and funny as everything I heard from Guy so far ...

As The Matter of Testing

Okay, a few words about testing now...

As I wrote before I'm working on a huge project involving a lot of technology and hardware:
  1. database clusters,
  2. web clusters,
  3. computational clusters for concrete algorithms,
  4. all mentioned above distributed across a few data centers,
  5. mix of Linux and Windows platforms etc.
To my mind the project is not typical for an outsourcing company. As you might suspect outsourcing companies usually don't have resources and processes established for development of big systems. Also they don't bother much about quality (no matter what they tell you and which certificates they have) what is crucial in construction of a 24x7 solution. But this is the core of the business - they do not sell products, they sell people (more on this in separate posts). Nevertheless every company has a QA department. So do we. The only problem is that the QA staff is not technical in nature. In other words we don't have software engineers in testing at all. To be honest there are few projects which really need something more advanced than just clicking through the user interface to make sure it works and meet the specification. A small numbers of exceptional projects which needs sophisticated testing tend to have some developers assigned to this activity inside the project team. Sounds not bad - at least we have testing in place. The only problem is that there are a few questions each team has to answer first:
  1. how to setup the testing process?
  2. what and how to test?
  3. how to accumulate the results?
  4. what conclusions can be made on the basis of the data collected?
As you probably understood a team needs to create the whole testing process from scratch knowing literally nothing about how to do it efficiently. In our case the additional question arose: how to do it right for a 24x7 system?

Fortunately we are not alone on the planet :). Here is the excellent video from Google which answers some of the questions above. Enjoy!