A Blueprint for Designing Complex Architectures: Part 2
In a previous post, we discussed two architectural design principles for building complex systems: the importance of designing for security, and the importance of managing latency. As part of our continuing effort to share what we've learned building products and running operations at scale, we’ll examine two more design principles here.
Principle #3: Design for cloud, with its almost infinite scale and elasticity
Design for elasticity—don't assume your solution is elastic just because your cloud infrastructure is:
Partition horizontally, avoid funnels or single points of failure, and don't compete with the network when dealing with aggregation; divide the system into components that work together, but can keep working in isolation, at least for several minutes
Uniquely identify all system components, make them explicitly addressable, and don't hide them behind global addresses; if you know the explicit address of a system component, use it; don't use generic shared queues; and focus on semantically driven queues and efficient event brokers
Implement elasticity up to the solution level; by relying on infrastructure elasticity only, the elasticity will be based on the solution behavior observations and not the behaviors themselves
Always remember and apply the CAP theorem for each implemented functionality and system component
An initial solution implementation with scale limitations must be avoided; scalability must be fully integrated in the design of every system component. Cloud doesn’t mean hosting, which means that elasticity (up and down) must be part of the early design of a system component and follow the Twelve-Factor App methodology for building SaaS apps when deploying components.
Taking an API-centric approach, even at the infrastructure level, will lead to the development of solutions based on loosely coupled components running in multiple environments.
Principle #4: Instrument and log everything
Know what's happening before you try to improve anything:
Components must be designed to expose rich-enough run-time information and to measure performance (throughput, latency, and even jitter on latency)
Events must carry correlation information to provide a big-picture view of a system that is composed of different components; system components must implement simple rules about creating and propagating correlation information. The correlation can be horizontal (between system components of the same type, and vertical (going through infrastructure, OS, app, and service levels).
Instrumentation and logging activities must have minimum impact on the operation of the system components, by extracting the data out of the components as quickly as possible. When the impact of transporting the data becomes a problem, take a cascading approach of computing and transporting
Not everything needs to happen in real-time; flexible delivery options must be implemented to handle real time and non-real time logs
Marketing and product managers need data to understand the user experience and service usage. Without proper instrumentation and logging, the wrong product strategy decisions get made and users get frustrated. The same data can be used to solve "inexplicable" problems between multiple, apparently unrelated parts of the system; this data can also be used to detect, prevent, and avoid system deadlock, thereby avoiding shotgun debugging.
Since all the system components implement APIs and consume APIs, data generated at the API level is becoming a key source of information for business and engineering.
As we said earlier, these are common-sense solutions to common problems. There’s no proof that they should be followed without exception. But they have been observed to work in many cases.