This article does not even contain the words database or MySQL. I still believe it is somewhat interesting.
Mail has, for some reason, always been playing a big role in my life. I have been running mail for two, my girlfriend and me, in 1988. I have been running mail for 20 and 200 people in 1992, setting up a citizens network. Later I designed and built mail systems for 2 000 and 20 000 person corporations, and planned mail server clusters for 200 000 and 2 million users. And just before I became a consultant at MySQL I was working for a shop that did mail for a living for 20 million users.
Mail is a very simple and well defined collection of services. You accept incoming messages to local users, you implement relaying for your local users with POP-before-SMTP and SMTP AUTH, you build POP, IMAP and webmail accesses, and you deploy spam filter systems and virus scanners for incoming and outgoing messages. This services collection does hardly change when you go from 2 to 20 million users – maybe the larger systems will also provide additional services such as portal services, a news server or other more directed stuff, but that is just fluff outside of the scope of the mail system. The solutions, though, are very different, and very much dependent on the scale of your operations.
It all starts with a single server that you run just because you can, and hardly any defined processes around it. Ok, so you get The Silent Treatment from your girl if you broke mail again (!) because you changed over from sendmail to exim without telling her before and maybe you lost a mail or two in the conversion process, but that is likely easily fixed with a dinner and doing the dishes.
Larger systems such as the 20 and 200 people systems are technologically related to the two person system, but require more serious process and planning. You’ll likely have established testing before deployment at this level, and you’ll have a service window and planned and announced deployments for such systems. You are likely to have more than one person being responsible for the postmaster account, and so hopefully you have aquired a ticket system and built a process around it that prevents double work and lost tickets. The postmasters will hate you for introducing a layer of red tape.
Hardware-wise you may be changing as well, but more slowly than process changes: Maybe you have a second server that does only the virus and spam scanning, maybe your storage is of higher quality than on the two person system and you have aquired knowledge of RAID and the hardware to match.
When you cross over into the realm of 2 000 and 20 000 users, hardware and technology change a lot. You are likely to learn about HA, and you will have a fully redundant mail cluster, often with shared storage or maybe DRBD. Mail tends to be quite vital to corporations, and usually the money for a full HA cluster including support for machines and software is easy to come by after the first three day outage of the mail system due to failed servers and missing support contracts. You’ll have institutionalised restores, and they have working backups as a precondition, and you are likely to have some kind of load balancing mechanism as part of your HA solution as well.
Process and structure changes as well, and maybe even more so. That is because you’ll need not only to split up work across multiple persons, but now you’re also dealing with specialization: No longer will everybody be able to do every job. There will be a storage and backup team, there will be that cluster guy, there will be legal and abuse specializing in nontechnical issues, and then there will be the operating system people providing the platform for your mail servers and the network team providing connectivity, routing and name services as well as firewalling. You are part of an organisation. If it is a good and modern organisation, you’ll not only have testing and planned deployments, but also asset and configuration management and a proper change management cycle and despite the fact that you are more people than ever before you are less likely to break the service. You’ll also be much slower to roll out a change than ever before as well, and your users will probably hate you for that.
Going to 200 000 and 2 million users, hardware and architecture will change again. You’ll learn that there cannot be one single machine that is large enough to cater your need and that even with functional separation – dedicated mail, imap, web, spam and virus servers – it is not enough any more. You are hitting the vertical limit and you need understand that you need more than one of everything, and a load balancing mechanism in front of that.
You’ll need a middleware to standardize inter-service communication between your different services and a load balancer that can handle your choice of a middleware protocol – that is the “we do web services” (or JNI or whatever) moment in your architecture. You’ll also want to standardize authentication and authorization as well, if you haven’t done already before, so welcome to the world of directory services. If you haven’t standardized and automated installation of machines yet, this is where you get the install server just to stay alive.
Process and organisation do not change a lot at this level – you’ll finally need a dedicated mail help desk, and if you are running services for end users, it will actually be a technical help desk and a billing/contract management help desk internally, because these are different problems managed by entirely different teams.
At 2 million users and with the introduction of middleware you are also slowly hitting a size where standard solutions may no longer fit you, and if you are an open source user you will have people inside your organisation who start changing the infrastructure software you are using and start contributing patches to mailers, spam filter software and the like – at this point, latest, you’ll split off feature development and infrastructure development as R&D from IT operations. You’ll need a process to clear source to leave your house to be able to release IP from your company to the outside world, and it better be fast enough to keep up with OSS development. Your legal team will hate you, and the developers will hate your legal team.
At that level, going from 2 million users to 20 million users, there will be the point where you have to realize that everything you did so far was wrong™ and needs to be done differently. Even the middleware thingie is no longer going to cut it, because by doubling the number of servers you do not gain twice the speed – mostly, you have been adding waits, network latency or other delays.
You’ll need to change the way your application components talk to each other: So far, you have been taking your localized software that ran on a single machine, cut it up and distributed the parts across many machines, then put in the middleware to glue the parts back together to form a single, distributed system. This has been helping a bit, but not changed the fundamental way you think about systems. You start converting synchronous communication to asynchronous and parallel operations, which is a hard thing to do and you’ll run into a load of locking issues. R&D will hate you, because “if you had told us to do it that way from the start we wouldn’t need to basically reengineer everything from scratch now”. As if.
If you are successful, you are still offering the same basic set of services in the end. At the same time, you have crossed through seven powers of ten in magnitude and hopefully without your users noticing any major service interruption inbetween. On the inside, though, you have been rebuilding your system more or less from scratch multiple times.
This kind of change is inevitable. It is impractical and ridiculous to build systems with the construction and communication overhead of the 20 million user system for any smaller sized deployment. It is also nonsensical to create the organisational and procedural overhead that goes with that kind of system for smaller deployments. On the other hand it is impossible to cater 20 million users with a system and an organisation that has been built for a smaller set of users – the architectural decisions and the organisational workflows do not apply.
So if you are an organisation that’s growing, you will go through that kind of change and you will have to work hard for your users not to notice. To make matters even more difficult for you, you’ll have to go through that transformation on a deadline, because if you are growing and you are not ready for the next level when it is there, then your organisation will be in very serious trouble. Growth on time always is a crisis for a company.
Another lesson is that there IS a vertical limit. It is a hard limit. There simply are no larger machines, so even if your software is written in a way that it could utilize more CPUs, more memory and more disks, there simply will be no hardware for you to run it on. If you grow, ultimately you will become an architectural clone of Amazon, eBay, Yahoo, Flickr and Google. That does not invalidate the intermediate steps at all – none of architectures of these large deployments will work on a smaller scale.
So the actual lesson to take away from this is that there is a direction and a target architecture to develop to, but you may not be there, yet, because you are still to small. You can develop with the target architecture in mind, but you cannot develop FOR it, yet. And at every level for the actual deployed architecture and the actual offer of machines available, there is an optimal machine size, taking CPU, memory, disks and price into consideration.
And finally, all of this is at least as much about structural organisational change as it is about hardware and software architectural change. Any organisation going through these levels of growth will have to change process and organisation to match it. If the growth is fast enough, these changes will not even have time to mature – in terms of a process capability maturity model, you will be forever stuck at the “performed informally” level or slightly above. This, too, is contributing a lot to the experience of growth as a crisis.