A Blueprint for Designing Complex Architectures
Having dealt with a complex system architecture while working at Vodafone, helped large customers that run Apigee solutions on premises, and helped the Apigee engineering team on our own deployment as a SaaS solution for our customers, I’ve identified many recurring architecture principles to deal with the complexity of such systems.
These principles are not rules, laws, or perfect truths; they’re statements on the order of “an apple a day keeps the doctor away.” They give a name to a concept, and make it easier to talk and reason about it. They provide a place to hang the feelings we have about good and bad design.
They attempt to translate those feelings into concrete advice. In that sense, the principles are a kind of anodyne. Given some code or design that you feel bad about, you may be able to find a principle that explains that bad feeling and advises you about how to feel better.
More importantly, these principles are heuristics: common-sense solutions to common problems. But like any heuristic, they are empirical in nature. They have been observed to work in many cases; but there is no proof that they always work, nor is there any proof that they should always be followed. As part of our continuing effort to share what we've learned building products and running operations, we’ll examine two of these four principles here.
Principle #1: Design for security
Security should always be treated as a first-class citizen in every design and deployment:
The owner of the data and the data lifecycle must always be identifiable and traceable
The requestor of the data must be identifiable and its actions must be traceable and auditable
The multi-tenancy level (infrastructure, OS, app, service) must be clearly defined and deployed for each data element manipulated by the system
Self and delegated administration must be part of the original design
Security must be understood as an activity on each system component. The security defense needs to be practiced in-depth and built to contain intrusions. Don’t take a medieval moat approach—think more about modern individual house protection.
Employing security at the API level using well-known mechanisms and protocols (OAuth and OpenID Connect for example) and a multi-tenant platform is generally a very good way to start.
Principle #2: Don't ignore latency and fault handling
Especially when dealing with cloud resources, it’s important to manage failure and not assume that latency will be zero:
Latency (network, software component) must be managed (don't attempt to eliminate it); multi-level predictive caching (close to the consumption) must be in place to accelerate response time
Every component that implements an API must have a health dashboard
Move from "best effort" to "managed" by knowing that a system element is down before trying to use it
Avoid unnecessary intermediaries and aggregators by employing explicit addressing
Latency and fault handling are the responsibility of each system component
Current system architectures differentiate READ operations from CREATE, UPDATE, and DELETE and make READ operations as direct as possible, thus minimizing latency and fault.
Coming up in part two, we'll address two more architecture principles: the importance of designing for the cloud, and why you must instrument and log everything.
Photo: Jack Dorsey/Flickr