Yes, I know that I am cheating here. The way of implementation described in this post is probably not the best one. Firstly it is for big guys only. Secondly it can be significantly improved.
Don't try this method unless you have a lot of time and millions of users :). You've been warned!
Let's start from a few simple questions. Why to discuss Tag Cloud at all? It seems like a common feature and it shouldn't be tricky to implement. But have you ever think about how the most noble applications (such as www.flickr.com - yep, I'm a worshiper of this product) construct Tag Cloud and keep it up-to-date taking into account the huge amount of data moving forth and back?
Here are just two problems you could unfold trying to find the answer to this question (in fact there is much more to discuss but it could take a lot of time - more then I have):
1. Relational database.
Yes, relational database is a problem in case of a normalized data model usage (anyone wants a piece of me now? :)). Why?
- Let assume you have 10,000,000 users. They form the first table - "user".
- Let further assume each user upload at least 1,000 photos (not a big deal in the age of digital photography). So we have 10,000,000,000 photos - cool isn't it? And it is the second table - "photo".
- And finally let's add some tags - I usually have from 3 to 6 tags per item. If anyone had been like me we would have about 45,000,000,000 tags in average. These are the third and the forth tables - "tag" and "photo_tags" relationship respectively.
In order to extract the information we would need to make ... JOINS! Could anyone say how many records the Cartesian product of all four tables will have? ;)
2. The system restart.
Okay. There are a lot of relational algebra gurus to hire and you might have bought a database engine capable of performing JOINs of tables with infinite number of rows in a finite amount of time. So far so good.
Let's assume your system was up and running until you decided to upgrade the database schema (for example to reach the 13th normal form :)). You planned a downtime, announced the date to your users and stopped the system. Upgrade was done quickly. You started the application again and ... your application now need to calculate the tag cloud! I can even make things "easier" for you - your system has to calculate a tag cloud for everyone from those 10,000,000 people to build personal tag clouds! According to the data model we saw above the calculation like this requires an intensive rows counting in the relationship table(s). And taking into account the number of rows it seems like your database requires additional super ability now - it should be able to count rows as quickly as it JOINs tables ;).
For those of you who are suspicious of my words I suggest to make a little experiment - take your favorite database engine, generate a table with at least 100,000,000 records, perform the following statement: "SELECT COUNT(*) FROM Table;" and measure the time of execution. I promise you will be surprised. Also I suggest to play with your data a little. Try to add some conditions to the statement above, add indexes and check the same statements again, measure the size of indexes, time of their construction and defragmentation etc. This should give you good feeling of the problem.
(By the way do you know my favorite joke from Chuck Norris Facts? - "Chuck Norris counted to infinity - twice.")
Let's move on now and try to think about alternative ways of tags cloud manipulation and persistence.
Firstly there is no need in relational data storage. It doesn't mean you should throw away that thing you payed so much money for - just leave it for more adequate data. A standard hash table with tags and their respective counters of usage persisted as a binary stream will do the trick. It is pretty much enough to represent the entire tag cloud (as well as a personal tag cloud). Of course you shouldn't keep all tags which exist in the system in a single hash table. The tag cloud consists of the most popular tags only (approximately 200 tags in our system) and it is adjusted all the time to be up-to-date.
You can still use the database by storing hash tables as BLOBs (which is probably the easiest way as you don't need an additional type of data storage).
Secondly there is no need to recalculate the tag cloud (as well as personal tag clouds). It can be created once and adjusted all the time according to changes in tags applied to all entities in the system. Moreover these adjustments can be infrequent. Tag cloud is usually rather static. Just look at any tag cloud of the popular photo sharing applications and you can notice that words like "art", "cool", "wedding" etc. are always somewhere near the top. Unfortunately you still have to have all tags and counters of their usage somewhere in the database (perhaps in the separate table) and update those counters all the time to give "new" tags a chance to enter the tag cloud. But it is nothing comparing to the process of full tag cloud recalculation.
The personal tag clouds can be treated similarly. The same representation of a tag cloud as a persisted hash table works here as well. Although the situation is much simpler as you deal with lower number of tags (a vocabulary of an ordinary person is about 5000 words - the real number of words used as tags will be considerably lower). As a result you can have all tags of a person in a single hash table - not just the most popular ones as in the case of the global tag cloud . Put this persisted hash table in the "user" table in an additional column and you will always have a personal tag cloud at your disposal. Don't forget to update it each time the user changes its tags.
So what did we get so far? We managed to replace the extremely slow process of tag cloud calculation (that must be performed only when tag cloud is requested) to the process of continuous tag cloud adjustment (that should be done each time tags change is requested). In other words the infrequent requests (such as getting the tag cloud) are made very quick but frequent requests (such as adding/deleting/changing tags) are made slower. Strange result, isn't it? Don't worry. In the second part I'll show a concrete distributed algorithm you can use to make tag cloud adjustment quick to the point it wouldn't slow the ordinary tags operations. And we will have the only major drawback - the complexity of the implementation.
That's all for now :). I hope you find these ideas useful.
Don't hesitate to ask questions and express your opinion.