THE ARCHITECTURE IMPERATIVE | Micro G Labs

A Practitioner's Guide to Building Systems That Last

THOUGHT LEADERSHIP SERIES: System Design in Modern Cloud Environments

Why Most Systems Fail Before They're Built

The decisions made in the first 10% of a project determine 90% of its eventual cost. Not budget decisions. Architecture decisions.

There's a quiet crisis playing out inside engineering organisations right now. Teams are shipping faster than ever, yet systems are becoming harder to operate, costlier to change, and more fragile under load. The culprit isn't bad engineers, it's the absence of deliberate architecture.

System design isn't a phase. It isn't a whiteboard session before you "get to the real work." It is the work. Every architectural choice you make, or avoid making, compounds over time. The codebase you write in year one becomes the technical debt you're paying down in year three.

"Good system design is not about using the most technologies. It's about using the right ones, making tradeoffs deliberately, and building something your team can understand, operate, and evolve."

This guide is a practitioner's map. It doesn't pretend there are universal answers, because there aren't. What it offers instead is the vocabulary, the mental models, and the tradeoff frameworks that distinguish engineers who build things that last from those who build things that merely ship.

1. The Axes of Quality: What Does "Good" Actually Mean?

Before you pick a single technology, you need to know what you're optimising for. Most architectural mistakes stem from optimising the wrong thing.

A "good" system is not an absolute. It's a system that makes the right tradeoffs for its specific constraints. The eight dimensions below are the axes along which every design is measured. Crucially, they pull against each other: maximising one often compromises another.

Scalability: Can it handle 10x more users without redesigning everything? Scalability is not something you add later; it must be designed in from the start.
Fault Tolerance & Resilience: Does it survive failures gracefully? In distributed systems, failure is not an edge case; it is the default condition.
Performance: Is it fast enough for the use case? And critically: are you optimising latency (individual request speed) or throughput (total capacity)? These are different problems.
Availability: What's your uptime target? The difference between 99.9% and 99.99% is not 0.09%; it's 8.7 hours of downtime per year versus 52 minutes. Each additional nine is dramatically harder to achieve.
Consistency: Do all users always see the same data? The CAP theorem tells us we must choose between consistency and availability under network partitions. Most real systems choose a spectrum in between.
Maintainability: Can a team evolve it over years without it becoming a mess? The systems that survive longest are those that new engineers can understand and change without fear.
Security: Is it protected at every layer? Security is not a checkbox; it's a posture that must be designed into every component.
Cost Efficiency: Is it doing all of the above without burning unnecessary money? Engineering decisions are also financial decisions.

The Core Insight: The skill of system design is not knowing which of these properties is best. It's knowing which ones matter most for your specific context, and making deliberate tradeoffs accordingly.

2. Start With Requirements, Not Technology - the problem statement

The most common mistake in system design is reaching for tools before understanding the problem. Requirements aren't bureaucracy: they're the foundation everything else is built on.

Engineers are builders by nature. The instinct to reach for a tool, a framework, a database, it's almost reflexive. But architecture chosen before requirements are understood is just technical debt wearing a costume.

Every design begins with two categories of requirements:

Functional Requirements: What does the system actually do? User authentication, storing posts, sending notifications, processing payments. The 'what.'
Non-Functional Requirements (NFRs): How well does it do it? The 'how good.' These are the requirements that most directly drive architectural decisions.

The NFRs that matter most in practice:

Expected traffic: requests per second, concurrent users. A system serving 1,000 users looks completely different from one serving 10 million.
Data volume: gigabytes today, terabytes next year? Storage and retrieval strategies change dramatically at scale.
Read/write ratio: a read-heavy system (social feed) and a write-heavy system (IoT telemetry) require fundamentally different architectures.
Latency targets: is 100ms acceptable? 1 second? The answer determines whether you need edge computing, caching strategies, or a complete rethink.
Consistency requirements: can users tolerate brief periods of stale data, or must they always see the latest? Eventual consistency unlocks massive scalability.
Geographic distribution: single region or global? Global adds latency, compliance complexity, and data residency concerns.
Compliance: GDPR, HIPAA, SOC2. These aren't optional; they shape your data model, access controls, and audit requirements.

Estimating these numbers, even roughly, is not an academic exercise. It's the foundation every subsequent architectural decision rests on.

3. Architecture Patterns: Choosing Your System's Shape

The highest-level architectural decision determines everything that follows. There's no objectively correct choice: only choices that are correct or incorrect for your context.

The Monolith: Unfairly Maligned

The monolith has become a term of derision in engineering culture. It shouldn't be. Everything in one deployable unit means simpler development, easier debugging, and faster iteration. For early-stage products, teams of fewer than 20, or systems with well-understood bounded domains, the monolith is often the right architecture. Don't let hype push you away from it prematurely.

Microservices: Power With a Price Tag

Breaking a system into small, independently deployable services enables independent scaling, faster deployment cycles, and clear team ownership. But this power comes at a cost: you've traded a simple deployment problem for a distributed systems problem. Network calls replace function calls. Data consistency across services requires careful design. Operational complexity multiplies. Microservices are the right choice when the benefits of independent scaling and deployment outweigh this complexity, which typically means at significant scale.

Event-Driven Architecture: The Scalability Unlock

When services communicate via events rather than direct calls, something remarkable happens: producers and consumers become completely decoupled. Neither knows the other exists. This pattern enables async processing, smooths out traffic spikes, and allows new consumers to be added without touching existing code. Kafka is the backbone of most serious event-driven systems, not because it's simple, but because its guarantees (ordered, durable, replayable events) are uniquely powerful.

Serverless: The Operational Ideal (With Caveats)

No servers to manage, automatic scaling, pay-per-use pricing: serverless sounds like engineering utopia. For bursty, unpredictable workloads and event-driven functions, it often is. The constraints (cold starts, statelessness, execution time limits) are real but manageable with good design. The mistake is applying serverless to workloads that require persistent connections or low-latency responses at high volume.

CQRS: When Reads and Writes Live Different Lives

Command Query Responsibility Segregation separates the read model from the write model. This isn't abstract theory: it's a recognition that reads and writes have fundamentally different performance characteristics. Your write path cares about transactional integrity. Your read path cares about query speed. Treating them as the same problem is often the source of performance bottlenecks.

4. Getting Traffic In: Networking Fundamentals

Before a single line of your code executes, a request has navigated DNS, CDN edges, load balancers, and API gateways. Each layer is an opportunity for both performance and failure.

DNS: More Than a Phone Book

Most engineers think of DNS as simple name resolution. In cloud architectures, it's a powerful routing layer. Geographic routing sends users to the nearest region. Latency-based routing routes to the lowest-latency endpoint. Failover routing automatically redirects traffic when a region goes dark. AWS Route 53 and Cloudflare make all of this surprisingly accessible.

CDNs: The Most Underutilised Performance Lever

Putting a CDN in front of your application is one of the highest-leverage moves in system design. Static assets, cached API responses, and even personalised content can be served from edge nodes 10–50ms from the user, rather than 150–400ms from your origin. Cloudflare, AWS CloudFront, and Fastly have made edge computing accessible at every scale.

Load Balancers and API Gateways: Your System's Front Door

L4 load balancers route by TCP/IP, fast and dumb, in the best sense. L7 load balancers understand HTTP: they can route by URL path, set headers, enable A/B testing, and orchestrate canary deployments. API Gateways go further: authentication, rate limiting, request transformation, and observability in a single layer. The gateway is where cross-cutting concerns live, so your services don't have to.

5. Compute: Where Your Code Actually Lives

The compute layer has undergone three fundamental shifts in the last decade: from physical servers to virtual machines, from VMs to containers, and from containers to 'serverless' functions. Each shift traded control for operational simplicity.

Virtual Machines: Full OS instances. Maximum control, maximum overhead. Still the right choice when you need OS-level customisation or compliance-driven isolation.
Containers (Docker): Portable, lightweight application packaging. The standard unit of deployment for modern applications. Fast to start, consistent across environments.
Kubernetes: The de facto standard for orchestrating containers at scale. Scheduling, scaling, self-healing, service discovery, all automated. The learning curve is steep; the operational leverage at scale is enormous.
Serverless Functions: AWS Lambda, Google Cloud Functions. Run code without managing infrastructure. Auto-scale to zero. The operational ideal for event-driven and bursty workloads.
Auto Scaling: The ability to add or remove capacity automatically based on load is a fundamental cloud-native capability. Done correctly, it eliminates over-provisioning and prevents under-provisioning during demand spikes.

6. Data Storage: The Most Consequential Decision

There is no universally best database. There are databases that are well-suited or ill-suited to specific access patterns. Choosing the wrong one is one of the most expensive mistakes in system design, because it's one of the hardest to undo.

The proliferation of database options over the last decade is not complexity for complexity's sake. It reflects a genuine reality: different data shapes and access patterns have fundamentally different optimal storage strategies.

Relational Databases (SQL): The Proven Foundation

Postgres and MySQL have powered the majority of the world's applications for decades. ACID transactions, strong consistency, powerful querying via SQL: these properties are genuinely valuable. For data with clear relationships and where consistency is non-negotiable (financial transactions, user accounts), relational databases remain the gold standard. Their horizontal scaling limitations are real but often irrelevant at the scale of most applications.

NoSQL: Not a Replacement, a Complement

The NoSQL family encompasses radically different systems, united mainly by what they're not. Document stores like MongoDB excel at hierarchical, variable data. Key-value stores like Redis are the fastest data structures you can put a network in front of. Wide-column stores like Cassandra are purpose-built for massive write throughput. Graph databases like Neo4j make relationship traversal trivially fast. The question is never "NoSQL or SQL?" but "What are my access patterns, and which data model best serves them?"

Object Storage: Infinitely Scalable, Surprisingly Powerful

AWS S3, Google Cloud Storage: what started as blob storage for files has become the foundation of modern data architecture. Data lakes, ML training datasets, audit logs, media assets, object storage is infinitely scalable, incredibly cheap, and durably designed. The 11 nines of durability that S3 claims is not marketing; it reflects genuine engineering around redundancy and replication.

The Sharding Question

When a single database instance can no longer handle your data volume or write load, you face the sharding decision. Horizontal sharding (splitting data across multiple instances by key range or hash) is the standard approach. The operational complexity it introduces (cross-shard queries, rebalancing, hotspots) is substantial. Before sharding, exhaust vertical scaling, read replicas, and caching.

7. Caching: The Highest-Leverage Optimisation

Caching is not an optimisation you add when things are slow. It's an architectural pattern that, designed in from the start, determines whether your system can scale at all.

Every layer of a modern system is an opportunity for caching: the browser caches assets, the CDN caches responses, the application caches query results, and the database caches pages. The question is not whether to cache but where, what, and for how long.

The Cache Invalidation Problem

Phil Karlton's famous observation that there are only two hard problems in computer science (cache invalidation and naming things) remains painfully relevant. Stale data is a bug. Cache misses are a performance hit. The strategies (TTL-based expiry, event-driven invalidation, versioned cache keys) all make tradeoffs between consistency and performance.

Redis: Beyond Simple Caching

Redis is often introduced as "a fast cache," which significantly undersells it. Sorted sets enable real-time leaderboards. Pub/sub enables lightweight messaging. Streams provide a durable, replayable event log. Lua scripts enable atomic multi-step operations. Redis is a data structure server with caching as one of its many capabilities.

8. Async Messaging: Decoupling as a Superpower

Not everything needs to happen synchronously. The decision to make an operation asynchronous is often the single change that transforms a brittle, tightly-coupled system into a resilient, scalable one.

Synchronous request-response is intuitive but fragile. If the downstream service is slow, the upstream caller is slow. If it's down, the caller fails. Asynchronous patterns break this coupling.

Message Queues (SQS, RabbitMQ): Send a message, move on. The worker picks it up when ready. Traffic spikes become queue depth rather than cascading failures. Retry logic becomes a property of the queue, not the caller.
Event Streaming (Kafka, Kinesis): Unlike queues, streams retain events for a configurable period. Multiple consumers can independently read and replay the same events. This is the backbone of event-driven architecture and real-time data pipelines.
Pub/Sub (SNS, Google Pub/Sub): One event triggers multiple independent downstream actions. Adding a new reaction to an event requires no changes to the publisher.
Sagas: The pattern for managing distributed transactions across microservices without a two-phase commit. A sequence of local transactions, each triggering the next via events. When a step fails, compensating transactions roll back the previous steps. Complex to design correctly; essential for any multi-service business process.

9. Scalability: Patterns That Actually Work

Scalability is not a feature you turn on. It's a property you design for from the start, or pay dearly to retrofit.

The Statelessness Prerequisite

Horizontal scaling (adding more instances to handle more load) only works when your services are stateless. If an instance stores session data locally, you can't freely route requests to any instance. The solution is simple in principle: all state must live externally in databases, caches, or distributed stores. Simple in principle; requires discipline in practice.

Rate Limiting: Protecting Your System From Itself

Rate limiting is often thought of as a security concern. It's equally a scalability concern. Without it, a surge in legitimate traffic (or a single misbehaving client) can overwhelm your system. The Token Bucket algorithm (a bucket fills at a constant rate and each request consumes a token) is the standard approach, offering smooth rate limiting with burst tolerance.

Backpressure: The Signal That Saves Systems

When a downstream service is overwhelmed, the correct response is to signal upstream to slow down, not to silently drop requests or crash. Backpressure is the mechanism by which systems maintain stability under overload. Kafka consumers implement this naturally; HTTP APIs require explicit design.

10. Resilience: Designing for Inevitable Failure

In distributed systems, failure is not an exception. It is the baseline assumption. The question is not if components will fail, but how your system behaves when they do.

"The systems that survive the longest are not those that prevent failures. They're the ones that contain them."

Circuit Breakers: The Failure Firewall

The circuit breaker pattern is one of the most important resilience patterns in distributed systems. When a downstream service starts failing, stop sending it requests and return a fallback response immediately. This prevents one failing service from cascading failures through the entire system. Hystrix and Resilience4j implement this pattern; it should be standard in any service-to-service communication.

Chaos Engineering: Verifying Your Resilience Claims

Netflix's Chaos Monkey deliberately terminated production instances to verify that the system could survive them. This sounds terrifying; it's actually the most rigorous approach to resilience testing available. Resilience mechanisms that are never exercised are resilience mechanisms that may not work when you need them. Chaos engineering moves failure from being a surprise to being a known quantity.

Multi-Region Architecture: The Ultimate Resilience Investment

Active-active multi-region deployment (serving traffic from multiple regions simultaneously) is the highest tier of availability architecture. It eliminates regional failover delays, provides geographic performance improvements, and provides genuine protection against regional outages. The data consistency challenges it introduces are real and require careful design.

11. Performance: Making the Right Things Fast

Performance optimisation without measurement is guessing. The order of operations matters: measure first, identify the bottleneck, fix the bottleneck, measure again. Premature optimisation is a real failure mode; so is failing to optimise when you have a known bottleneck.

Connection Pooling: Opening a new database connection for every request is expensive. Connection pooling reuses existing connections. This is one of the most frequently overlooked and high-impact optimisations in application performance.
Async I/O: Blocking a thread while waiting for a database response is wasteful. Async/await, reactive frameworks, and event loops (Node.js, Netty) keep threads busy serving other requests during I/O wait. At high concurrency, this difference is transformative.
Database Indexes: The single biggest performance lever in database queries. A full table scan that takes 10 seconds on a million-row table takes milliseconds with the right index. Understanding B-tree and hash indexes, when composite indexes help, and the cost of over-indexing on write performance is essential knowledge.
Cursor-Based Pagination: Offset pagination (LIMIT 20 OFFSET 10000) requires scanning and discarding rows up to the offset. At scale, this degrades badly. Cursor-based pagination picks up exactly where the last request left off, with consistent performance regardless of depth.
Data Denormalisation: Normalisation is the correct default for data integrity. But read models often benefit from deliberate denormalisation (duplicating data to eliminate expensive joins). This is a conscious tradeoff: write complexity for read performance.

12. Observability: You Can't Fix What You Can't See

The best-designed systems still fail in unexpected ways. Observability is what transforms failure from a crisis into a diagnostic process.

The three pillars of observability are not optional components: they are the foundation of operating any production system.

Logs: Structured, searchable records of events. The key word is structured: machine-readable JSON logs can be queried, aggregated, and alerted on. Unstructured logs are archaeology. Centralise with the ELK stack, Datadog, or CloudWatch.
Metrics: Numerical measurements over time: request rate, error rate, latency percentiles (p50, p95, p99), CPU, memory. The p99 latency (the experience of your slowest 1% of users) is often more important than the mean. Prometheus + Grafana is the standard open-source stack.
Distributed Traces: A request that touches 12 services before returning a response is impossible to debug with logs alone. Distributed tracing (OpenTelemetry, Jaeger, AWS X-Ray) follows a request across all service boundaries, showing exactly where time is spent and where failures occur.

Alert on Symptoms, Not Causes

The most common alerting mistake is alerting on causes (CPU > 80%) rather than symptoms (users experiencing errors). An alert should answer: "Are users being harmed right now?" SLO burn rate alerting (alerting when your error budget is being consumed faster than acceptable) is the most sophisticated and effective approach. Pair every alert with a runbook documenting the response procedure.

13. Security: The Foundation, Not the Finishing Touch

Security added at the end is security theatre. Security designed in from the start is security that actually works.

Zero Trust: The Modern Security Posture

The perimeter model of security (trust everything inside the network, distrust everything outside) is dead. Modern architectures span multiple clouds, partner networks, remote employees, and third-party services. Zero Trust replaces implicit trust with explicit verification: every service-to-service call is authenticated, every request is authorised, network location confers no trust. Mutual TLS between services, fine-grained RBAC, and short-lived tokens are the building blocks.

Secrets Management: A Non-Negotiable

Secrets in code are breaches waiting to happen. Secrets in environment variables are marginally better but still problematic. The correct approach is a dedicated secrets manager (AWS Secrets Manager, HashiCorp Vault, GCP Secret Manager) with automatic rotation, fine-grained access control, and audit logging. This is not optional at any scale.

Encryption: Everywhere, Always

TLS in transit is table stakes. Encrypted volumes and object storage at rest is table stakes. Application-level encryption for especially sensitive fields (PII, payment data, health records) provides defence-in-depth against database compromise. The question is never whether to encrypt; it's whether you've encrypted at the right layer.

14. Data Pipelines: Moving Data at Scale

Modern applications are not just operational systems: they are data systems. The ability to move, transform, and analyse data at scale is a competitive capability, not an afterthought.

Batch Processing: Hadoop and Spark for processing large volumes of data in scheduled jobs. Efficient for non-time-sensitive analytics, model training, and nightly transformations. The workload that doesn't need to be real-time almost always costs less as batch.
Stream Processing: Apache Flink, Kafka Streams, and AWS Kinesis Data Analytics for processing data in real-time as it arrives. Essential for fraud detection (where a 10-second lag is too late), real-time personalisation, and live operational dashboards.
ETL/ELT Pipelines: Moving data from operational systems to analytical warehouses (Snowflake, BigQuery, Redshift) for analytics. The modern shift is towards ELT (loading raw data first, transforming in the warehouse) because warehouse compute is now cheap enough to make this preferable.
Change Data Capture (CDC): Streaming every database change as an event using Debezium. Enables real-time sync between systems, instant cache invalidation, and immutable audit logs. One of the most underused and powerful patterns in data architecture.

15. Deployment (DevOps Engine): Shipping With Confidence

How you deploy is as important as what you deploy. The ability to ship changes quickly, safely, and reversibly is a core engineering capability, not a DevOps concern separate from development.

Infrastructure as Code: The Only Way to Manage Infrastructure

Clicking through cloud consoles to provision infrastructure creates invisible, irreproducible state. Infrastructure as Code (Terraform, AWS CDK, Pulumi) defines infrastructure in version-controlled, reviewable, reproducible code. The question "what changed?" has a git diff as its answer. This is not a best practice; it's the baseline for any serious engineering organisation.

Deployment Strategies: Reducing the Blast Radius

Not all deployments are equal. Blue/green deployments maintain two identical environments and switch traffic instantly, enabling immediate rollback. Canary deployments gradually shift a percentage of traffic to the new version, allowing real-world validation before full rollout. Feature flags decouple deployment from release entirely, as code can be deployed dark and enabled for specific users or cohorts. These patterns collectively transform deployment from a high-stakes event into a routine operation.

GitOps: Git as the Source of Truth

GitOps uses Git as the authoritative source of cluster state. ArgoCD or Flux continuously reconciles the cluster to match what's in Git. Changes go through pull requests (auditable, reversible, reviewable). Rollback is a git revert. Compliance auditors love it. Operations teams sleep better.

16. The Design Process: A Framework for Good Decisions

Good system design is not inspiration. It's a repeatable process that starts with understanding and ends with deliberate choices.

When approaching any system design problem, the following sequence produces consistently better outcomes:

Clarify requirements: functional and non-functional. Estimate scale before designing for it.
Define the API: what are the interfaces? What does the client call? API design constrains the implementation.
Design the data model: what data exists, how is it related, and critically, how is it accessed? Access patterns determine storage choice.
Choose your storage: based on access patterns, consistency needs, and scale. Not based on familiarity or hype.
Design the high-level architecture: services, queues, caches, CDN, load balancers. The boxes and arrows.
Identify bottlenecks: where will this break under load? Where are the single points of failure?
Apply resilience patterns: circuit breakers, retries, redundancy, multi-AZ. Design for failure before it happens.
Add observability: logs, metrics, traces, alerting. If you can't see it, you can't operate it.
Address security: AuthN/AuthZ, encryption, secrets management. At every layer.
Iterate and refine: no design survives first contact with reality unchanged. The design process doesn't end at deployment.

The Closing Principle: Complexity is not sophistication. The most impressive systems are often the simplest designs that correctly address the actual constraints. Master the tradeoffs, resist the hype, and build things that last.