This is a translation of a
german language article I wrote two weeks ago for my german language blog.
In 2004, when I was still working for
web.de, I gave a little talk on
Scaleout on Linuxtag. Even back then
one major message of the talk was "Every read problem is a cache problem" and "Every write problem is a problem of distribution and batching":
To scale, you have to partition your application into smaller subsystems and replicate the data. The replication should be asynchronous, not using two phase commit (2pc), or the gain will be smaller to nothing. Writes must be delayed and batched, so they can be handled more efficiently. To avoid bottlenecks, data storage should be decentralized and a central database should be avoided.
That is often contradictory to what traditional database classes at university teach: It is insecure, you are working with false or outdated data and you don't even know at all time if everything worked out the way you imagined it should have.
It has an important advantage: It works. And it scales.
Services
At
Amazon they are telling the same story:
One way that we've made this much easier for ourselves here at Amazon is that internally we are a services-oriented architecture, and I don't mean that in terms of the buzzword of the last five years. We've been this long before that word became public.
What I mean by that is that within Amazon, all small pieces of business functionality are run as a separate service. For example, this availability of a particular item is a service that is a piece of software running somewhere that encapsulates data, that manages that data. And when, for example, we hit the gateway page of Amazon, it will go out to somewhere between 100 and 150 different services to construct this page for you.
That means that we've localized the notion of availability and consistency towards each service because each service is responsible for its data encapsulation. And then, depending on what kind of service it is and what kind of consistency guarantees they require, you use different technologies.
But it must be absolutely clear I think to everyone that we're not using two phase commit to update distributed resources all over the world.
This is very similar to what web.de is using: To the end user the system appears more or less as a single system. Internally, the system is built from ~150 services talking to each other through a middleware layer. The construction sometimes looks a bit weird, by traditional OSS architectural standards.
The incoming mailer at web.de was at one time an Exim 3. It has been forked internally, because somebody married it to Mico, a Corba layer. The mailer effectively is no longer a singular machine. Instead, it is connected to a bunch of machines at the backend talking to the middleware layers.
Incoming mails require a lot of communication:
- The Userservice will confirm if the destination users exists and if it is a paying customer.
- The Quotaservice will tell if the destination user still is capable of receiving mail.
- The User Profile Service will tell what other preferences for mail reception are set by the destination user.
- The IP Blacklist Service will tell if we even talk to the source IP address.
- The Spamfilter Service will judge the mail header and body and classify the mail as spammy or not.
- The Virus Scanner Service will check the mail and judge if it is known dangerous content or not.
- Finally, the Hierarchical Storage Management Service will decide on a storage location for the mail body.
The Diamond Pattern
All of these services are
not part of the mailer. The mailer instead has network connections to middleware servers, which perform their services independently of the mailer, and offer them to the mailer, but also to other frontend services. All middleware services do not just exist in a single instance, but in reality the mailer is talking to a load balancer which then redirects the request to some of multiple instances of that service.
The result is the diamond pattern: Incoming mail goes to a loadbalancer pair and will be distributed out to some three dozen incoming mailer instances. All of these send their requests to middleware loadbalancer pairs, which concentrate and then spread out the requests to the appropriate middleware instances. The middleware machines concentrate database access in connection pools, which then are dispersed again across the replicated database copies.
The pattern (spread out, concentrate, spread out, concentrate, ...) allows for each layer of the architecture to fail independently without affecting functionality of the entire systems. Faults are fenced, and contained to very small subsystems.
Also, load problems can be handled at the operating level without disturbing higher layers in IT or even development: If one service is overloaded, operating can simply add a few servers and everything is okay. "We are pope?" "Diana crashed into a tunnel wall?" "Not a problem. Just push a few blades into the news portal."
Complexity
When building applications according to these patterns, there are a few inconvenient truths to deal with:
Old and inexact data for example.
When using asynchronous replication and shunning 2PC, it is possible for example for the Quota Service to return numbers that are a bit off on the good or the bad side. This can happen whenever a user is receiving a second mail or just has been deleting mail at the very moment the system is receiving a mail for that particular user. The second change already is somewhere in the system, but has not been distributed yet to all instances that deal with Quota. These other instances also have not been locked as "Within an update", because we do not lock and do not use 2PC.
That is a good thing: Any asynchronous system can be made synchronous by adding waits. With 2PC we make the system more exact, but also a lot slower.
Do we have to? That depends a lot on the requriements. Is it necessary that a User stays below Quota all of the time? If it was a credit card account and we would be dealing with money, probably yes. But with mail and storage, the answer is clearly "no". For web.de as a whole it is mandatory that all users
on the average are below Quota so that the storage management can plan allocations properly. But for any single user it is completely okay to return a Quota value that is "approximately" correct, i.e. one or two transactions late.
This is a much weaker requirement, because it allows us to do away with 2PC and change the system to asynchronous, lockless operation. We spare the waits and the locks and add a lot of scalability.
Another inconvenient fact of live is asynchronous remote procedure calls. To stay with the example, consider a situation where the mailer has to accept a mail. To do so, it has to query six to ten external services, over the network. Doing this in sequence, each stage could use information from the previous stage. But we would be piling round trip times, timeouts and other waits on top of each other and mail delivery times would be unacceptably high. From the mailers POV the sequence of operations is call, wait, read answer. Call, wait, read answer. ...
The alternative pattern is to shoot a broadside of requests to all services in parallel. This will activate the entire middleware array at once, returning the answers as quickly as possible back to the mailer. The sequence of operations is now "Call, call, call, ... wait, read answer, read answer, ..." This is a much faster pattern, and it the advantage is bigger the more services you have to query in order to get the job done.
It also has problems. For example, there are dependencies between the queries. The Hierarchical Storage Management would return different results depending on the mail being spammy or not, and the receiver being freemail customer or paying customer. One solution is to synchronize these dependencies - the query to the HSM would wait until the queries from the Userservice and the Spamservice have been returning.
But "Waits Are Evil" we have learned, so we are looking at a different solution: Instead of returning
the answer to the mailer, we change the HSM to return
all possible answers and then have the mailer logic sort things out internally, throwing all unneeded information away and comitting back the decision which possible storage location has actually been used to the HSM. Coolness:
Speculative Execution.
We now have achieved 100% Buzzword-Compliance: "Our webmail application is implemented as a distributed service oriented architecture that leverages replicated data, asynchronous remote procedure calls and speculative execution techniques to achieve resilency and scalability at every level of its architecture." ("Bingo!"). Indeed we now have a system that can go into load regions where no monolithic system has been before.
At what price? Complexity!
Exim is, even without Mico inside, a truly fat mailer and putting Corba in there approximately doubles the codebase in size. If you are now switching to async procedure calls, you are suddenly leaving the realm of the well tested codebase and run into a number of quite exciting and interesting bugs.
In dealing with inexact or outdated information, you have to handle error conditions at the application level that would not be present in sychronous systems and which are sometimes hard to detect or reproduce. Imagine a system build from some 150 services not using 2P, in which a customer has upgraded himself from Freemail to paying customer. That's a transaction, but without 2PC you cannot handle it as such. There may be subsystems where the customer already is paying customer and others which have not yet picked up the change. Querying such a system state will yield inconsistent and contradictory results, which are also not repeatable.
Each and every subsystem, each piece of code, must be able to detect such a situation and handle it (for example, by waiting and retrying later). Contradictory results can in some circumstances not be cleared up, and so each service must have a "dead requests queue" that is handled by some higher intelligence after rerunning the request a suitable number of times after some waiting.
Do you want such complexity? Certainly not, if you don't absolutely require it. But to cross a certain load boundary or size boundary you do require it, because there will be no single monolithic, synchronous and locking system that can handle this kind of load.
Relaxing requirements and dealing with failure
So the key to growth is in relaxing requirements for certain subsystems of your application: By allowing asynchronous processing and outdated or approximated answers we are suddenly able to take a lot of short cuts in our implementation. We are excluding a number of boundary cases and change the contract between us and the services users from "guaranteed" to "best effort". We are essentially pushing the boundary cases up in the stack where they often can be handled more easily because at a global scope more information may be present. In doing this, we are then able to optimize for the common cases instead for the difficult cases.
How is development handling the added complexity? Well, development has help from another side: By building small and autonomous services with localized data storage, we are getting small services that can be developed and upgraded independently from each other, which have a clearly defined API and a clearly defined service level towards service users and upstream services, as well as a clearly defineable set of invariants and requirements internally.
The development of a service can now go on independently from the other teams. The contract they have to fulfill within the grander scheme of things is clearly defined and is independent of the actual implementation. The release cycle is no longer tied to other services, also, as long as the contract does not change in an incompatible fashion (and even that can often be relaxed by maintaining versioned interfaces). At Amazon they characterize it like this:
A service is not just a software component in that sense, a distributed software component. It is also the model we use for software development in terms of team structure.
So how the process goes is that you have a particular business problem or a technology problem that you want to solve. You build a small team. You give them this problem. And at Amazon, they're allowed to solve that problem in any way they see fit, as long as it is through this hardened API. They can pick the tools they want. They can do any design methodology they want as long as they deliver the actual functionality that they've been tasked with.
...
I think Pat Halens(sp?) uses the metaphor of using a town as an example. So in a big town, you have zoning requirements. You have some general laws about the roads and things like that, but the way that you build your house and the way that you operate your house is all up to you.
So this is a bit the way that Amazon functions, also. We have some requirements: that services has to be monitorable, that they have to be tractable in all sorts of different ways. But in essence, operation is all up to the service owners themselves.
This allows for a large-let's say controlled chaos-which actually works very well because everybody's responsible for their own services.
Service Oriented Architectures are just one instance of "relaxed requirements". By relaxing the requirements on services we allow the implementors more choice in their architecture and are giving them more room for growth. That is a good thing, if you accept it and are living it.
But from the POV of traditional software development or computer science this seems to be new and highly unusual:
Joel Spolsky talks about these relaxed requirements under the name of
Leaky Abstractions. Joels conclusion is different, though: "Leaky Abstractions are dragging us down" - for him, imperfection is a bug. For the architect of a service oriented architecture it is the key to success, though.
Other uses of the same pattern
Looking at a different level of the architecture, we are seeing the same pattern again ("relaxing requirements without changing the API"). MySQL for example is all about the same thing: It can handle the same SQL query in a traditional
ACID way, using a storage engine such as InnoDB, or it can relax the operating parameters of InnoDB in a way that it is no longer ACID compliant but much faster. Or it can even shoe in another storage engine such as MyISAM or Memory, which may be able to handle the same query in an even faster way, but at the expense of even more relaxed requirements.
Traditional ACID is secure in the way that you know that the data is consistent, persistent and complete when "COMMIT" returns to you. If you need that security, MySQL can deliver it. In many cases - actually, in most cases - you do not need that kind of security, and the waits that are implied by it are costing you performance ("COMMIT" is waiting for the disk, and with a single disk you are in general not getting more than 100-200 commits a second).
MySQL allows you to relax requirements for an entire server instance or at the table level, and in doing so to get rid of waits where you can afford it. You are getting a much higher rate of statements per second - without any change in your calls or the application. The API used by the application is invariant, it is still the same SQL sent in the same way. This kind of change can be done at operating level (but, there are of course other applications of the same pattern at the application or architecture level, which may have potential for additional gain).
Links
SOA in Wikipedia
ACM Queuecast mit Werner Vogels, CTO Amazon
Database War Stories, O'Reilly Radar, Series of 8 articles