Vectoring AI

System Design Interview QA - 1

Vectoring AI — Thu, 21 May 2026 00:00:00 GMT

Introduction

This is Part 1 of our System Design Interview QA series, focusing on the foundational concepts that underpin every system design interview. System design is about designing the entire architecture of a software system — understanding how components fit together at scale, how failures are handled, and how trade-offs are made across scalability, reliability, performance, databases, APIs, networking, and security.

For infrastructure deep dives (load balancing, caching, Kubernetes, etc.), see System Design Interview QA - 2. For hands-on design problems (URL shortener, chat system, etc.), see System Design Interview QA - 3.

Q1: What Is Scalability and How Do You Scale a System?

Answer:

Scalability is the ability of a system to handle growing amounts of work by adding resources. There are two fundamental approaches: vertical scaling (bigger machines) and horizontal scaling (more machines).

graph TD
    subgraph Vertical["Vertical Scaling (Scale Up)"]
        V1["Small Server
4 CPU, 16GB RAM"]
        V1 -->|"Upgrade"| V2["Large Server
64 CPU, 512GB RAM"]
    end

    subgraph Horizontal["Horizontal Scaling (Scale Out)"]
        LB["Load Balancer"]
        LB --> S1["Server 1"]
        LB --> S2["Server 2"]
        LB --> S3["Server 3"]
        LB --> S4["Server N..."]
    end

    style Vertical fill:#6cc3d5,stroke:#333,color:#fff
    style Horizontal fill:#56cc9d,stroke:#333,color:#fff

Vertical vs Horizontal Scaling

Aspect	Vertical Scaling (Scale Up)	Horizontal Scaling (Scale Out)
Approach	Add more CPU/RAM/disk to one machine	Add more machines behind a load balancer
Complexity	Simple — no code changes	Complex — need stateless design, data partitioning
Cost	Exponential cost for high-end hardware	Linear cost — commodity hardware
Limit	Hard ceiling (largest machine available)	Practically unlimited
Downtime	Requires restart to upgrade	Zero downtime — add/remove nodes
Failure	Single point of failure	Fault tolerant — one node fails, others serve
Example	Upgrade PostgreSQL server from 32GB to 256GB RAM	Shard PostgreSQL across 8 nodes

Key Scaling Strategies

graph TD
    SCALE["Scaling Strategies"]
    SCALE --> LB["Load Balancing
(distribute traffic)"]
    SCALE --> CACHE["Caching
(reduce DB load)"]
    SCALE --> SHARD["Database Sharding
(split data across DBs)"]
    SCALE --> ASYNC["Async Processing
(message queues)"]
    SCALE --> CDN["CDN
(serve static content at edge)"]
    SCALE --> MS["Microservices
(scale services independently)"]

    style SCALE fill:#56cc9d,stroke:#333,color:#fff
    style LB fill:#ffce67,stroke:#333
    style CACHE fill:#6cc3d5,stroke:#333,color:#fff

Strategy	What It Does	When to Use
Load balancing	Distribute requests across servers	Always, for any multi-server setup
Caching	Store frequently accessed data in memory	Read-heavy workloads (80/20 rule)
Database sharding	Split data across multiple databases	Data too large for single DB, or write throughput limit hit
Async processing	Offload work to background queues	Long-running tasks (email, video transcoding)
CDN	Cache static assets at edge locations	Global user base, static content
Microservices	Break monolith into independently scalable services	When different components have different scaling needs

Stateless vs Stateful Services

Stateless (preferred for horizontal scaling):
  - Server holds NO user session data
  - Any server can handle any request
  - Session stored in external store (Redis, DB)
  - Easy to add/remove servers

Stateful (harder to scale):
  - Server holds user session in memory
  - Requests must be routed to same server (sticky sessions)
  - Server failure loses user state
  - Scaling requires state migration

Q2: How Do You Ensure Reliability and Fault Tolerance?

Answer:

Reliability means a system continues to work correctly even when things go wrong — hardware failures, software bugs, network issues, or traffic spikes. Fault tolerance is achieved through redundancy, replication, graceful degradation, and automatic recovery.

graph TD
    subgraph Redundancy["Redundancy at Every Layer"]
        LB1["Load Balancer
(Active)"]
        LB2["Load Balancer
(Standby)"]
        LB1 --> APP1["App Server 1"]
        LB1 --> APP2["App Server 2"]
        LB1 --> APP3["App Server 3"]
        APP1 --> DB_P["DB Primary"]
        APP2 --> DB_P
        APP3 --> DB_P
        DB_P -->|"Replication"| DB_R1["DB Replica 1"]
        DB_P -->|"Replication"| DB_R2["DB Replica 2"]
    end

    style LB1 fill:#56cc9d,stroke:#333,color:#fff
    style DB_P fill:#6cc3d5,stroke:#333,color:#fff
    style LB2 fill:#ffce67,stroke:#333

Reliability Patterns

Pattern	Description	Example
Replication	Keep multiple copies of data/services	3 DB replicas across availability zones
Failover	Automatically switch to backup when primary fails	Primary DB fails → promote replica
Health checks	Monitor component health, remove unhealthy nodes	Load balancer pings `/health` every 5s
Circuit breaker	Stop calling a failing service, fail fast	If payment API errors >50%, stop calling for 30s
Retry with backoff	Retry failed requests with increasing delay	Retry after 1s, 2s, 4s, 8s (exponential backoff)
Bulkhead	Isolate failures to prevent cascading	Separate thread pools for payment vs catalog
Graceful degradation	Serve partial functionality when subsystems fail	Show cached feed if recommendation service is down

Availability Levels

Level	Downtime/Year	Downtime/Month	Use Case
99% (two 9s)	3.65 days	7.3 hours	Internal tools
99.9% (three 9s)	8.76 hours	43.8 minutes	SaaS applications
99.99% (four 9s)	52.6 minutes	4.4 minutes	E-commerce, banking
99.999% (five 9s)	5.26 minutes	26.3 seconds	DNS, payment processing

Failure Handling Flow

Request comes in:
  1. Load balancer routes to healthy server
     - If server unreachable → try next server
  2. Server processes request
     - If downstream service fails → circuit breaker
       - Circuit CLOSED: forward request normally
       - Circuit OPEN: return cached/fallback response immediately
       - Circuit HALF-OPEN: try one request, if success → close
  3. Database write
     - Write to primary → replicate to replicas
     - If primary fails → promote replica (automatic failover)
  4. Return response
     - If timeout → client retries with exponential backoff
     - If persistent failure → graceful degradation (partial response)

Q3: What Are the Key Performance Optimization Strategies?

Answer:

Performance optimization reduces latency (time to respond) and increases throughput (requests handled per second). The key principle is to identify and eliminate bottlenecks at each layer: network, application, and database.

graph LR
    subgraph Latency["Latency Numbers Every Engineer Should Know"]
        L1["L1 cache: 0.5 ns"]
        L2["L2 cache: 7 ns"]
        L3["RAM access: 100 ns"]
        L4["SSD read: 150 μs"]
        L5["HDD seek: 10 ms"]
        L6["Same datacenter roundtrip: 0.5 ms"]
        L7["Cross-region roundtrip: 150 ms"]
    end

    style Latency fill:#56cc9d,stroke:#333,color:#fff

Performance Optimization by Layer

Layer	Strategy	Impact
Network	CDN for static assets	Reduces latency by 10-100x for global users
Network	HTTP/2 multiplexing, gzip compression	Fewer connections, smaller payloads
Network	Connection pooling	Avoid TCP handshake overhead per request
Application	Caching (Redis/Memcached)	Sub-millisecond reads vs 10-100ms DB queries
Application	Async processing (message queues)	Don’t block user on slow operations
Application	Pagination and lazy loading	Return only what user needs now
Database	Indexing	Speed up queries from O(n) to O(log n)
Database	Read replicas	Distribute read load across multiple DBs
Database	Query optimization	Avoid N+1 queries, use JOINs efficiently
Database	Denormalization	Trade storage for faster reads (avoid JOINs)

Caching Strategy Deep Dive

graph TD
    REQ["Request"]
    REQ --> APP["Application"]
    APP --> CHECK{"Cache
hit?"}
    CHECK -->|"Hit"| RETURN["Return cached data
(< 1ms)"]
    CHECK -->|"Miss"| DB["Query Database
(10-100ms)"]
    DB --> UPDATE["Update Cache"]
    UPDATE --> RETURN2["Return data"]

    style CHECK fill:#ffce67,stroke:#333
    style RETURN fill:#56cc9d,stroke:#333,color:#fff

Caching Pattern	How It Works	Best For
Cache-aside (lazy loading)	App checks cache → if miss, query DB → write to cache	General purpose, read-heavy
Write-through	Write to cache AND DB simultaneously	When reads immediately follow writes
Write-behind (write-back)	Write to cache first, async write to DB	Write-heavy, can tolerate brief inconsistency
Read-through	Cache itself fetches from DB on miss	Simpler app code, cache acts as primary interface

Cache Invalidation Strategies

Strategy	Description	Trade-off
TTL (Time-To-Live)	Cache expires after N seconds	Simple but may serve stale data
Event-driven invalidation	Invalidate on write/update event	Fresh data but more complex
Version-based	Key includes version number, bump on update	No stale data, slight overhead

Q4: How Do Distributed Systems Work and What Are the Key Challenges?

Answer:

A distributed system is a collection of independent computers that appear to users as a single coherent system. They are necessary when a single machine cannot handle the load, data, or availability requirements.

graph TD
    subgraph Challenges["8 Fallacies of Distributed Computing"]
        F1["1. The network is NOT reliable"]
        F2["2. Latency is NOT zero"]
        F3["3. Bandwidth is NOT infinite"]
        F4["4. The network is NOT secure"]
        F5["5. Topology DOES change"]
        F6["6. There is NOT one administrator"]
        F7["7. Transport cost is NOT zero"]
        F8["8. The network is NOT homogeneous"]
    end

    style Challenges fill:#ff7851,stroke:#333,color:#fff

CAP Theorem

Every distributed data store can provide at most two of three guarantees simultaneously:

Property	Definition	Example
Consistency	Every read receives the most recent write	All replicas return the same value
Availability	Every request receives a response (success or failure)	System never refuses a request
Partition Tolerance	System operates despite network failures between nodes	Nodes can’t communicate but keep serving

Since network partitions WILL happen, you must choose:

  CP (Consistency + Partition Tolerance):
    → During partition: reject requests rather than return stale data
    → Examples: MongoDB, HBase, Redis Cluster, ZooKeeper
    → Use for: Banking, inventory, leader election

  AP (Availability + Partition Tolerance):
    → During partition: serve requests even if data might be stale
    → Examples: Cassandra, DynamoDB, CouchDB
    → Use for: Social media feeds, shopping carts, analytics

Consensus Algorithms

Algorithm	Purpose	How It Works	Used In
Paxos	Agreement among nodes	Proposer → Acceptors → Learners; majority quorum	Google Chubby
Raft	Leader-based consensus	Elect leader → leader replicates log entries; easier to understand	etcd, CockroachDB
Gossip Protocol	Information dissemination	Nodes periodically exchange state with random peers	Cassandra, DynamoDB

Consistency Models

Model	Guarantee	Latency	Use Case
Strong consistency	Read always returns latest write	High (synchronous replication)	Banking transactions
Eventual consistency	Reads converge to latest write over time	Low (async replication)	Social media likes/counts
Causal consistency	Preserves cause-and-effect ordering	Medium	Comment threads, chat
Read-your-writes	User sees their own writes immediately	Medium	User profile updates

Q5: What Infrastructure Components Make Up a Production System?

Answer:

A production system is composed of multiple infrastructure layers working together. Understanding each component’s role and how they interact is essential for system design.

graph TD
    USERS["Users"]
    USERS --> DNS["DNS
(Route53, Cloudflare)"]
    DNS --> CDN["CDN
(CloudFront, Akamai)"]
    CDN --> LB["Load Balancer
(ALB, Nginx)"]
    LB --> API["API Gateway"]
    API --> SVC1["Service A"]
    API --> SVC2["Service B"]
    API --> SVC3["Service C"]
    SVC1 --> CACHE["Cache
(Redis / Memcached)"]
    SVC1 --> DB["Database
(PostgreSQL / MySQL)"]
    SVC2 --> MQ["Message Queue
(Kafka / RabbitMQ)"]
    MQ --> WORKER["Background Workers"]
    WORKER --> STORE["Object Storage
(S3)"]
    SVC3 --> SEARCH["Search Engine
(Elasticsearch)"]

    subgraph Observability
        LOG["Logging
(ELK Stack)"]
        METRIC["Metrics
(Prometheus / Grafana)"]
        TRACE["Tracing
(Jaeger / Zipkin)"]
    end

    style LB fill:#56cc9d,stroke:#333,color:#fff
    style CACHE fill:#ffce67,stroke:#333
    style DB fill:#6cc3d5,stroke:#333,color:#fff

Infrastructure Components Reference

Component	Purpose	Examples	When to Use
DNS	Domain → IP resolution, geographic routing	Route53, Cloudflare DNS	Always — entry point for all traffic
CDN	Cache static content at edge locations globally	CloudFront, Akamai, Fastly	Static assets, global user base
Load Balancer	Distribute traffic across servers	ALB/NLB (AWS), Nginx, HAProxy	Multiple app servers
API Gateway	Routing, auth, rate limiting, protocol translation	Kong, AWS API Gateway, Envoy	Microservices architecture
Cache	In-memory store for frequently accessed data	Redis, Memcached	Read-heavy workloads
Message Queue	Async communication, decouple producers/consumers	Kafka, RabbitMQ, SQS	Background processing, event-driven
Object Storage	Store blobs (images, videos, backups)	S3, GCS, Azure Blob	Media files, backups, data lake
Search Engine	Full-text search, analytics	Elasticsearch, OpenSearch	Product search, log analysis
Container Orchestration	Deploy, scale, manage containerized services	Kubernetes, ECS	Microservices deployment

Monolith vs Microservices

Aspect	Monolith	Microservices
Deployment	Single deployable unit	Independent services, independent deployments
Scaling	Scale everything together	Scale each service independently
Complexity	Simple to develop and deploy initially	Complex: service discovery, distributed tracing
Data	Single shared database	Database per service (data isolation)
Team	Single team, tight coupling	Small teams own individual services
Failure	One bug can crash entire system	Failure isolated to one service (with proper design)
Best for	Small teams, early-stage products	Large teams, complex domains, different scaling needs

When to Move from Monolith to Microservices

Start with a monolith. Split when:
  1. Team size > 10-15 engineers (coordination overhead)
  2. Different components have vastly different scaling needs
  3. Deployment of one feature blocks another team
  4. Different services need different tech stacks
  5. You need independent failure isolation

Do NOT split prematurely — microservices add operational complexity:
  - Service discovery
  - Distributed transactions (Saga pattern)
  - Network latency between services
  - Distributed debugging and tracing
  - Data consistency across service boundaries

Q6: How Do You Design RESTful APIs and Choose Between API Styles?

Answer:

APIs are the contracts between system components. Choosing the right API style and designing clean, consistent interfaces is a core system design skill.

API Style Comparison

Style	Protocol	Format	Best For
REST	HTTP	JSON	CRUD web services, public APIs
GraphQL	HTTP	JSON	Complex queries, frontend-driven data needs
gRPC	HTTP/2	Protobuf (binary)	Low-latency microservice communication
WebSocket	TCP (upgraded HTTP)	Any	Real-time bidirectional (chat, gaming)
Webhook	HTTP (push)	JSON	Event notifications (payment processed, build complete)

REST API Design Principles

Good REST API design:

  Resources (nouns, not verbs):
    ✅ GET    /api/v1/users              → List users
    ✅ GET    /api/v1/users/123          → Get user 123
    ✅ POST   /api/v1/users              → Create user
    ✅ PUT    /api/v1/users/123          → Update user 123
    ✅ DELETE /api/v1/users/123          → Delete user 123
    ❌ GET    /api/v1/getUser?id=123     → Verb in URL (bad)

  Nested resources:
    GET  /api/v1/users/123/orders       → Orders for user 123
    GET  /api/v1/users/123/orders/456   → Specific order

  Pagination:
    GET /api/v1/users?page=2&limit=20
    GET /api/v1/users?cursor=abc123&limit=20  (cursor-based, preferred)

  Filtering and sorting:
    GET /api/v1/users?role=admin&sort=-created_at

  Versioning:
    /api/v1/users  → URL path versioning (most common)
    Accept: application/vnd.api.v1+json  → Header versioning

HTTP Status Codes

Code	Meaning	When to Use
200	OK	Successful GET, PUT
201	Created	Successful POST (resource created)
204	No Content	Successful DELETE
400	Bad Request	Invalid input, validation error
401	Unauthorized	Missing or invalid authentication
403	Forbidden	Authenticated but insufficient permissions
404	Not Found	Resource doesn’t exist
409	Conflict	Duplicate resource, version conflict
429	Too Many Requests	Rate limit exceeded
500	Internal Server Error	Unhandled server error
503	Service Unavailable	Server overloaded or in maintenance

Idempotency

Method	Idempotent?	Safe?	Notes
GET	Yes	Yes	Retrieves data, no side effects
PUT	Yes	No	Same request produces same result
DELETE	Yes	No	Deleting same resource twice = same outcome
POST	No	No	Use idempotency keys (e.g., `Idempotency-Key: uuid`)
PATCH	No	No	Partial update — result depends on current state

Pagination: Offset vs Cursor

Approach	Pros	Cons
Offset (`?page=5&limit=20`)	Simple, can jump to any page	Slow on large datasets (OFFSET scans rows); inconsistent with inserts
Cursor (`?cursor=abc&limit=20`)	Consistent, fast (indexed seek); handles real-time inserts	Can’t jump to arbitrary page

Q7: How Do You Choose the Right Database?

Answer:

Database selection is one of the most impactful decisions in system design. The choice depends on data structure, access patterns, consistency requirements, and scale.

graph TD
    CHOOSE["Choose Your Database"]
    CHOOSE --> REL["Relational (SQL)"]
    CHOOSE --> DOC["Document Store"]
    CHOOSE --> KV["Key-Value Store"]
    CHOOSE --> COL["Wide-Column Store"]
    CHOOSE --> GRAPH["Graph Database"]
    CHOOSE --> TS["Time-Series DB"]
    CHOOSE --> SEARCH["Search Engine"]

    REL --> REL_EX["PostgreSQL, MySQL
ACID, complex queries, JOINs"]
    DOC --> DOC_EX["MongoDB, CouchDB
Flexible schema, nested data"]
    KV --> KV_EX["Redis, DynamoDB
Cache, session, simple lookups"]
    COL --> COL_EX["Cassandra, HBase
Write-heavy, time-series-like"]
    GRAPH --> GRAPH_EX["Neo4j, Neptune
Relationships, social networks"]
    TS --> TS_EX["InfluxDB, TimescaleDB
Metrics, IoT, monitoring"]
    SEARCH --> SEARCH_EX["Elasticsearch
Full-text search, analytics"]

    style CHOOSE fill:#56cc9d,stroke:#333,color:#fff
    style REL fill:#6cc3d5,stroke:#333,color:#fff
    style DOC fill:#ffce67,stroke:#333

Database Selection Guide

Requirement	Best Choice	Why
Complex relationships, ACID transactions	PostgreSQL / MySQL	Strong consistency, JOINs, mature tooling
Flexible schema, nested documents	MongoDB	Schema-less, easy horizontal scaling
Ultra-fast key-value lookups, caching	Redis	In-memory, sub-millisecond latency
Massive write throughput, append-only	Cassandra	Distributed, tunable consistency, linear scaling
Social graph, recommendations	Neo4j	Optimized for traversing relationships
Full-text search, log analytics	Elasticsearch	Inverted index, near real-time search
Time-series data (metrics, IoT)	TimescaleDB / InfluxDB	Optimized for time-bucketed queries
Globally distributed, strong consistency	CockroachDB / Spanner	Distributed SQL, serializable isolation

SQL vs NoSQL Trade-offs

Aspect	SQL (Relational)	NoSQL
Schema	Fixed schema, migrations required	Flexible / schema-less
Consistency	ACID (strong by default)	BASE (eventual, tunable)
Scaling	Vertical (primarily), read replicas	Horizontal (built-in sharding)
Queries	Complex JOINs, aggregations, SQL	Simple lookups, limited JOINs
Transactions	Multi-table transactions native	Limited (single-partition or Saga pattern)
Best for	Financial, e-commerce, complex relationships	High-scale, simple access patterns, flexible data

Database Scaling Strategies

graph TD
    DBSCALE["Database Scaling"]
    DBSCALE --> RR["Read Replicas
(scale reads)"]
    DBSCALE --> SHARD["Sharding
(scale writes + storage)"]
    DBSCALE --> PART["Partitioning
(split tables)"]
    DBSCALE --> POOL["Connection Pooling
(scale connections)"]

    RR --> RR_D["Primary handles writes
Replicas handle reads
Async replication"]
    SHARD --> SHARD_D["Split data by key
(user_id % N shards)
Each shard is a full DB"]

    style DBSCALE fill:#56cc9d,stroke:#333,color:#fff

Sharding Strategies

Strategy	How It Works	Pros	Cons
Hash-based	`shard = hash(key) % N`	Even distribution	Adding shards requires reshuffling
Range-based	`shard 1: A-M, shard 2: N-Z`	Range queries efficient	Hotspots if data is skewed
Directory-based	Lookup table maps key → shard	Flexible, no reshuffling	Lookup table is single point of failure
Consistent hashing	Hash ring, minimal key movement on changes	Add/remove nodes easily	Slightly uneven with few nodes (use virtual nodes)

Q8: How Does Networking Work in Distributed Systems?

Answer:

Understanding networking fundamentals is essential for system design — from how a request reaches your server to how services communicate internally.

How a Web Request Works (End-to-End)

graph LR
    BROWSER["Browser"]
    BROWSER -->|"1. DNS Lookup"| DNS["DNS Server
→ IP address"]
    DNS -->|"2. TCP Handshake"| LB["Load Balancer"]
    LB -->|"3. TLS Handshake
(HTTPS)"| APP["App Server"]
    APP -->|"4. Process Request"| DB["Database"]
    DB -->|"5. Response"| APP
    APP -->|"6. HTTP Response"| BROWSER

    style BROWSER fill:#6cc3d5,stroke:#333,color:#fff
    style LB fill:#56cc9d,stroke:#333,color:#fff
    style APP fill:#ffce67,stroke:#333

Step-by-step breakdown:
  1. DNS resolution: browser.com → 93.184.216.34  (~50ms first time, cached after)
  2. TCP handshake: SYN → SYN-ACK → ACK           (~1 RTT = 0.5ms same DC, 150ms cross-region)
  3. TLS handshake: Certificate exchange, key setup (~1-2 RTT additional for HTTPS)
  4. HTTP request: GET /api/users                   (headers + body)
  5. Server processes, queries DB, builds response
  6. HTTP response: 200 OK + JSON payload
  7. Browser renders response

Communication Protocols

Protocol	Layer	Use Case	Key Property
TCP	Transport	Most web traffic, databases	Reliable, ordered delivery
UDP	Transport	Video streaming, gaming, DNS	Fast, no handshake, unreliable
HTTP/1.1	Application	Traditional web APIs	Text-based, one request per connection
HTTP/2	Application	Modern web APIs	Multiplexing, header compression, binary
HTTP/3 (QUIC)	Application	Next-gen web	UDP-based, zero-RTT, faster handshake
WebSocket	Application	Real-time communication	Full-duplex, persistent connection
gRPC	Application	Microservice calls	HTTP/2 + Protobuf, streaming support

Real-Time Communication Patterns

Pattern	How It Works	Latency	Server Load	Best For
Short polling	Client sends HTTP request every N seconds	High (N sec delay)	High (many requests)	Simple status checks
Long polling	Client sends request, server holds until data available	Medium	Medium	Notifications, chat fallback
Server-Sent Events (SSE)	Server pushes events over single HTTP connection	Low	Low	Live feeds, dashboards
WebSocket	Full-duplex persistent TCP connection	Very low	Low	Chat, gaming, real-time collaboration

DNS and Load Balancing at Network Level

Level	Technology	Purpose
DNS-level	Route53, Cloudflare	Geographic routing, failover between data centers
L4 (Transport)	NLB, HAProxy (TCP mode)	Route based on IP/port, very fast, no content inspection
L7 (Application)	ALB, Nginx, Envoy	Route based on URL path, headers, content; SSL termination

Q9: How Do You Design for Security in Distributed Systems?

Answer:

Security must be designed into every layer of a system — from network perimeter to data at rest. In system design interviews, demonstrating security awareness distinguishes senior candidates.

graph TD
    subgraph Perimeter["Perimeter Security"]
        FW["Firewall / WAF"]
        DDOS["DDoS Protection
(Cloudflare, Shield)"]
    end

    subgraph Network["Network Security"]
        TLS["TLS / HTTPS
(encryption in transit)"]
        VPC["VPC / Private Subnets"]
        SG["Security Groups"]
    end

    subgraph Application["Application Security"]
        AUTH["Authentication
(OAuth 2.0, JWT)"]
        AUTHZ["Authorization
(RBAC, ABAC)"]
        VALID["Input Validation
(prevent injection)"]
        RL["Rate Limiting"]
    end

    subgraph Data["Data Security"]
        ENC["Encryption at Rest
(AES-256)"]
        HASH["Password Hashing
(bcrypt, argon2)"]
        MASK["Data Masking
(PII protection)"]
    end

    Perimeter --> Network --> Application --> Data

    style Perimeter fill:#ff7851,stroke:#333,color:#fff
    style Network fill:#ffce67,stroke:#333
    style Application fill:#56cc9d,stroke:#333,color:#fff
    style Data fill:#6cc3d5,stroke:#333,color:#fff

Authentication vs Authorization

Concept	Question	Mechanism	Example
Authentication (AuthN)	“Who are you?”	Username/password, OAuth, SSO, MFA	Login with Google
Authorization (AuthZ)	“What can you do?”	RBAC, ABAC, ACL, policy engines	Admin can delete users, viewer cannot

Token-Based Authentication Flow (OAuth 2.0 + JWT)

1. User logs in → Auth Server validates credentials
2. Auth Server issues:
   - Access token (JWT, short-lived: 15-60 min)
   - Refresh token (opaque, long-lived: 7-30 days)
3. Client sends Access token in header: Authorization: Bearer 
4. API Gateway / Service validates JWT:
   - Verify signature (no DB call needed)
   - Check expiration
   - Extract user ID, roles from claims
5. Token expired → client uses Refresh token to get new Access token
6. Refresh token expired → user must log in again

JWT Structure

Header.Payload.Signature

Header:  {"alg": "RS256", "typ": "JWT"}
Payload: {"sub": "user123", "role": "admin", "exp": 1716300000, "iat": 1716296400}
Signature: HMACSHA256(base64(header) + "." + base64(payload), secret)

Key design decisions:
  - Use RS256 (asymmetric) for microservices (public key verification, no shared secret)
  - Keep payload small (don't put entire user profile)
  - Set short expiration (15 min) + use refresh tokens
  - Never store sensitive data in JWT (it's base64, not encrypted)

Common Security Threats and Mitigations

Threat	Description	Mitigation
SQL injection	Malicious SQL in user input	Parameterized queries, ORM
XSS	Injecting scripts into web pages	Input sanitization, CSP headers
CSRF	Forged requests from authenticated browser	CSRF tokens, SameSite cookies
DDoS	Overwhelming system with traffic	Rate limiting, WAF, CDN, auto-scaling
Man-in-the-middle	Intercepting network traffic	TLS everywhere, certificate pinning
Broken authentication	Weak passwords, no MFA	bcrypt/argon2 hashing, MFA, account lockout
Data breach	Unauthorized data access	Encryption at rest, principle of least privilege
API abuse	Scraping, brute force	Rate limiting, API keys, OAuth scopes

Security Checklist for System Design

✅ HTTPS/TLS for all communication (internal and external)
✅ Authentication at the API gateway layer
✅ Authorization checks at the service level
✅ Input validation and sanitization at system boundaries
✅ Rate limiting per client/IP/API key
✅ Encryption at rest for sensitive data (AES-256)
✅ Password hashing with bcrypt or argon2 (never plain text or MD5)
✅ Secrets in vault (HashiCorp Vault, AWS Secrets Manager) — not in code
✅ Audit logging for security-relevant events
✅ Principle of least privilege for service accounts
✅ Network segmentation (private subnets for DBs, no public access)

Q10: What Is Back-of-the-Envelope Estimation and How Do You Do It?

Answer:

Back-of-the-envelope estimation is a quick calculation technique to estimate system capacity and requirements. Interviewers use it to test whether you can reason about scale and make informed design decisions.

Power of 2 Reference

Power	Exact Value	Approximate	Name
2^10	1,024	~1 Thousand	1 KB
2^20	1,048,576	~1 Million	1 MB
2^30	1,073,741,824	~1 Billion	1 GB
2^40	~1.1 × 10^12	~1 Trillion	1 TB
2^50	~1.1 × 10^15	~1 Quadrillion	1 PB

Common Data Sizes

Data Type	Typical Size
Character (ASCII)	1 byte
Character (UTF-8)	1-4 bytes
Integer	4-8 bytes
UUID	16 bytes
Timestamp	8 bytes
Short string (name)	~50 bytes
URL	~100 bytes
Tweet / SMS	~200 bytes
JSON API response	~1-10 KB
Compressed image thumbnail	~10-50 KB
Photo (high quality)	~2-5 MB
Short video (1 min)	~50-100 MB
Database row (typical)	~500 bytes - 2 KB

QPS (Queries Per Second) Estimation

Formula: QPS = DAU × queries_per_user / seconds_per_day

Example: Twitter
  - 500M DAU
  - Each user views feed 5 times/day, each feed = 10 API calls
  - Total queries/day = 500M × 50 = 25B
  - QPS = 25B / 86,400 ≈ 290,000 QPS
  - Peak QPS ≈ 2 × average ≈ 580,000 QPS

Quick shortcut:
  - Seconds in a day ≈ 100,000 (actual: 86,400)
  - 1M requests/day ≈ 10 QPS
  - 100M requests/day ≈ 1,000 QPS
  - 1B requests/day ≈ 10,000 QPS

Storage Estimation

Formula: Storage = records_per_day × record_size × retention_period

Example: Chat application
  - 500M DAU, 100 messages/user/day
  - Message size: ~100 bytes (text) + ~100 bytes (metadata) = 200 bytes
  - Daily: 500M × 100 × 200 bytes = 10TB/day
  - Yearly: 10TB × 365 = 3.65 PB/year
  - 5 years with replication (3x): ~55 PB total

Bandwidth Estimation

Formula: Bandwidth = QPS × avg_response_size

Example: Image serving
  - 100K QPS, average image = 200KB
  - Bandwidth = 100,000 × 200KB = 20GB/s = 160 Gbps
  - With CDN absorbing 90%: origin bandwidth ≈ 16 Gbps

Server Estimation

Rule of thumb:
  - 1 web server handles ~1,000-10,000 QPS (depends on complexity)
  - 1 DB server handles ~1,000-5,000 QPS (depends on query complexity)
  - 1 cache server (Redis): ~100,000-500,000 QPS

Example: 500K QPS API
  - App servers: 500K / 5,000 = 100 servers (with headroom: 150)
  - DB (with read replicas): 1 primary + 10 read replicas
  - Cache: 500K / 200K = 3 Redis nodes (with replication: 6)

Summary Table

#	Topic	Key Concepts
1	Scalability	Vertical vs horizontal scaling, stateless design, load balancing, caching, sharding
2	Reliability	Replication, failover, circuit breaker, bulkhead, graceful degradation, 99.99% availability
3	Performance	Caching strategies, latency numbers, CDN, indexing, read replicas, denormalization
4	Distributed Systems	CAP theorem, consistency models, consensus (Raft/Paxos), gossip protocol
5	Infrastructure	DNS → CDN → LB → API Gateway → Services → DB; monolith vs microservices
6	APIs	REST vs GraphQL vs gRPC, HTTP status codes, pagination, idempotency, versioning
7	Databases	SQL vs NoSQL, sharding strategies, read replicas, choosing the right DB
8	Networking	TCP/UDP, HTTP/2/3, WebSocket, SSE, DNS, L4 vs L7 load balancing
9	Security	AuthN/AuthZ, JWT/OAuth, TLS, encryption at rest, OWASP threats, zero trust
10	Estimation	QPS, storage, bandwidth, server count, powers of 2, latency numbers

What’s Next?

This article covered foundational system design concepts. Continue with:

Infrastructure deep dives: System Design Interview QA - 2 — load balancing, caching, message queues, Kubernetes, CI/CD, monitoring
Hands-on design problems: System Design Interview QA - 3 — URL shortener, chat system, news feed, video streaming, and more
Design patterns: Design Pattern Interview QA - 1
Enterprise patterns (Spring, CQRS): Design Pattern Interview QA - 2

System Design Interview QA - 2

Vectoring AI — Thu, 21 May 2026 00:00:00 GMT

Introduction

This is Part 2 of our System Design Interview QA series, focusing on infrastructure components and operational systems that power production-grade architectures. While Part 1 covered foundational concepts (scalability, CAP theorem, etc.), this article dives deep into how specific infrastructure components work and how to design them.

For foundational concepts (scalability, CAP theorem, APIs), see System Design Interview QA - 1. For hands-on design problems (URL shortener, chat system), see System Design Interview QA - 3.

Q1: How Does Load Balancing Work and How Do You Design a Load Balancer?

Answer:

A load balancer distributes incoming network traffic across multiple backend servers to ensure no single server is overwhelmed, improving availability, throughput, and fault tolerance.

graph TD
    CLIENTS["Clients"]
    CLIENTS --> DNS["DNS (Round Robin)
→ multiple LB IPs"]
    DNS --> LB_A["Load Balancer (Active)"]
    DNS --> LB_S["Load Balancer (Standby)
heartbeat monitoring"]

    LB_A --> S1["Server 1 ✅"]
    LB_A --> S2["Server 2 ✅"]
    LB_A --> S3["Server 3 ❌ (unhealthy)"]
    LB_A --> S4["Server 4 ✅"]

    LB_A -.->|"Health check fails"| S3

    style LB_A fill:#56cc9d,stroke:#333,color:#fff
    style LB_S fill:#ffce67,stroke:#333
    style S3 fill:#ff7851,stroke:#333,color:#fff

Layer 4 vs Layer 7 Load Balancing

Aspect	Layer 4 (Transport)	Layer 7 (Application)
Operates on	TCP/UDP packets (IP + port)	HTTP headers, URL path, cookies
Speed	Very fast (no content inspection)	Slower (must parse HTTP)
Routing decisions	IP hash, round robin, least connections	URL path, headers, content type
SSL termination	Passes through (or terminates)	Terminates SSL, inspects content
Use case	TCP services, databases, gaming	Web APIs, microservice routing
Examples	AWS NLB, HAProxy (TCP mode)	AWS ALB, Nginx, Envoy

Load Balancing Algorithms Deep Dive

graph TD
    subgraph RR["Round Robin"]
        RR1["Request 1 → Server A"]
        RR2["Request 2 → Server B"]
        RR3["Request 3 → Server C"]
        RR4["Request 4 → Server A"]
    end

    subgraph LC["Least Connections"]
        LC1["Server A: 5 active"]
        LC2["Server B: 2 active ← next request"]
        LC3["Server C: 8 active"]
    end

    subgraph WRR["Weighted Round Robin"]
        WRR1["Server A (weight 5): gets 5 of every 8"]
        WRR2["Server B (weight 2): gets 2 of every 8"]
        WRR3["Server C (weight 1): gets 1 of every 8"]
    end

    style RR fill:#56cc9d,stroke:#333,color:#fff
    style LC fill:#ffce67,stroke:#333
    style WRR fill:#6cc3d5,stroke:#333,color:#fff

Algorithm	Best For	Weakness
Round Robin	Equal-capacity servers, stateless services	Ignores server load
Weighted Round Robin	Mixed hardware capacities	Static weights, doesn’t adapt
Least Connections	Long-lived connections (WebSocket, DB)	May route to slow servers
Least Response Time	Latency-sensitive services	Requires constant measurement
IP Hash	Session affinity without sticky cookies	Uneven with few clients
Consistent Hashing	Cache distribution (Redis Cluster)	Complex implementation
Random	Large server pools, simplicity	Variance with few servers

Session Persistence (Sticky Sessions)

Problem: User state (shopping cart, login session) lives on one server.
         If next request goes to different server → state lost.

Solutions (from worst to best):
  1. Sticky sessions (cookie/IP-based routing to same server)
     - Simple but defeats load balancing purpose
     - Server failure = lost sessions

  2. Session replication (broadcast sessions to all servers)
     - Network overhead grows O(n²)
     - Memory wasted on every server

  3. Centralized session store (Redis/Memcached) ← RECOMMENDED
     - Any server can handle any request
     - Session stored in Redis with TTL
     - Server failure has zero impact on sessions
     - Scales independently

Health Check Design

Type	Mechanism	Interval	Use Case
TCP check	Can connect to port?	5-10s	Basic availability
HTTP check	`GET /health` returns 200?	5-10s	Application-level health
Deep health check	Checks DB connectivity, disk space, dependencies	30s	Comprehensive readiness

Health check state machine:
  HEALTHY → 3 consecutive failures → UNHEALTHY (remove from pool)
  UNHEALTHY → 2 consecutive successes → HEALTHY (add back to pool)
  
  Drain mode: stop sending new requests, wait for active to complete

Q2: How Do You Design a Caching System and What Caching Strategies Exist?

Answer:

Caching stores frequently accessed data in fast storage (memory) to reduce latency and database load. A well-designed caching strategy can reduce P99 latency from 100ms to <1ms.

graph TD
    subgraph Layers["Multi-Layer Caching"]
        CLIENT["Browser Cache
(HTTP cache headers)"]
        CDN["CDN Cache
(static assets, edge)"]
        APP["Application Cache
(Redis / Memcached)"]
        DB_CACHE["Database Cache
(query cache, buffer pool)"]
    end

    CLIENT --> CDN --> APP --> DB_CACHE --> DB["Database"]

    style CLIENT fill:#56cc9d,stroke:#333,color:#fff
    style CDN fill:#ffce67,stroke:#333
    style APP fill:#6cc3d5,stroke:#333,color:#fff

Caching Patterns (Implementation Detail)

graph LR
    subgraph CacheAside["Cache-Aside (Lazy Loading)"]
        A1["App checks cache"]
        A1 -->|"miss"| A2["App queries DB"]
        A2 --> A3["App writes to cache"]
    end

    subgraph WriteThrough["Write-Through"]
        B1["App writes to cache"]
        B1 --> B2["Cache writes to DB"]
        B2 --> B3["Confirm to app"]
    end

    subgraph WriteBehind["Write-Behind (Write-Back)"]
        C1["App writes to cache"]
        C1 --> C2["Return immediately"]
        C2 -.->|"async"| C3["Cache writes to DB later"]
    end

    style CacheAside fill:#56cc9d,stroke:#333,color:#fff
    style WriteThrough fill:#ffce67,stroke:#333
    style WriteBehind fill:#6cc3d5,stroke:#333,color:#fff

Pattern	How It Works	Pros	Cons	Best For
Cache-aside	App manages cache manually; check cache → miss → query DB → populate cache	Only caches hot data; cache failure non-fatal	Initial requests are slow (cold cache); possible stale data	General purpose, read-heavy
Write-through	Every write goes to cache AND DB synchronously	Cache always has latest data	Write latency increases (2 writes); caches data that may never be read	Read-after-write consistency
Write-behind	Write to cache, async flush to DB	Very fast writes; batch DB writes	Data loss risk if cache crashes before flush	Write-heavy workloads
Read-through	Cache fetches from DB on miss (cache is the data interface)	Simpler app code	Cache library must support it	When using cache frameworks
Refresh-ahead	Proactively refresh cache before TTL expires	No cache miss latency	Wastes resources on rarely accessed keys	Predictable access patterns

Cache Eviction Policies

Policy	Evicts	Best For
LRU (Least Recently Used)	Item not accessed longest	General purpose (most common)
LFU (Least Frequently Used)	Item accessed fewest times	Frequency-based workloads
FIFO (First In First Out)	Oldest inserted item	Simple, time-based freshness
TTL (Time-To-Live)	Items past expiration time	Data with known freshness window
Random	Random item	When access patterns are uniform

Redis vs Memcached

Feature	Redis	Memcached
Data structures	Strings, hashes, lists, sets, sorted sets, streams	Strings only
Persistence	RDB snapshots + AOF (append-only file)	None (pure cache)
Replication	Built-in master-replica	None (client-side)
Clustering	Redis Cluster (automatic sharding)	Client-side sharding
Pub/Sub	Yes	No
Lua scripting	Yes (atomic operations)	No
Memory efficiency	Moderate (overhead per key)	Slab allocator (efficient for uniform sizes)
Threads	Single-threaded (6.0+ has I/O threads)	Multi-threaded
Best for	Complex data, pub/sub, leaderboards, sessions	Simple high-throughput caching

Cache Stampede Prevention

Problem: Cache key expires → hundreds of requests simultaneously hit DB → DB overload

Solutions:
  1. Lock/mutex: Only one request fetches from DB, others wait
     cache_key = "user:123"
     lock_key = f"lock:{cache_key}"
     if not redis.get(cache_key):
         if redis.set(lock_key, "1", nx=True, ex=5):  # acquire lock
             data = db.query(...)
             redis.set(cache_key, data, ex=300)
             redis.delete(lock_key)
         else:
             wait_for_cache()  # spin until cache populated

  2. Probabilistic early recomputation:
     - Each read checks: should I refresh? (probability increases near TTL)
     - Spreads refresh across time window

  3. Background refresh (refresh-ahead):
     - Background job refreshes popular keys before expiry
     - No stampede possible

Q3: How Do Message Queues Work and When Should You Use Them?

Answer:

Message queues enable asynchronous communication between services by decoupling producers (senders) from consumers (receivers). They provide buffering, load leveling, and guaranteed delivery.

graph LR
    P1["Producer A
(Order Service)"]
    P2["Producer B
(Payment Service)"]
    P1 --> Q["Message Queue
(Kafka / RabbitMQ / SQS)"]
    P2 --> Q
    Q --> C1["Consumer 1
(Email Service)"]
    Q --> C2["Consumer 2
(Analytics Service)"]
    Q --> C3["Consumer 3
(Inventory Service)"]

    style Q fill:#56cc9d,stroke:#333,color:#fff

When to Use a Message Queue

Use Case	Without Queue	With Queue
Async processing	User waits for email to send (slow)	Return immediately, email sends in background
Load leveling	Traffic spike crashes service	Queue absorbs spike, consumers process at their pace
Decoupling	Service A calls Service B directly (tight coupling)	Service A publishes event, B consumes when ready
Retry/DLQ	Failed requests are lost	Failed messages retry with backoff, go to dead-letter queue
Fan-out	One service calls 5 downstream services	Publish once, 5 consumers process independently

Kafka vs RabbitMQ vs SQS

Feature	Apache Kafka	RabbitMQ	AWS SQS
Model	Distributed log (pub/sub + streaming)	Message broker (queues + exchanges)	Managed queue service
Ordering	Per partition (guaranteed)	Per queue (FIFO mode)	FIFO queues (limited throughput)
Throughput	Millions msgs/sec	Tens of thousands msgs/sec	Thousands msgs/sec
Retention	Configurable (days/weeks/forever)	Until consumed/TTL	14 days max
Consumer model	Pull (consumers poll partitions)	Push (broker delivers to consumers)	Pull (long polling)
Replay	Yes (consumers can re-read from any offset)	No (message gone after ACK)	No
Use case	Event streaming, logs, analytics pipeline	Task queues, RPC, routing	Simple async tasks, serverless
Complexity	High (ZooKeeper/KRaft, partitions, offsets)	Medium (exchanges, bindings)	Low (fully managed)

Kafka Architecture Deep Dive

graph TD
    subgraph Producers
        P1["Producer 1"]
        P2["Producer 2"]
    end

    subgraph Kafka["Kafka Cluster"]
        subgraph Topic["Topic: orders (3 partitions)"]
            PART0["Partition 0
[msg1, msg4, msg7...]"]
            PART1["Partition 1
[msg2, msg5, msg8...]"]
            PART2["Partition 2
[msg3, msg6, msg9...]"]
        end
    end

    subgraph ConsumerGroup["Consumer Group: order-processors"]
        C1["Consumer 1
← Partition 0"]
        C2["Consumer 2
← Partition 1"]
        C3["Consumer 3
← Partition 2"]
    end

    P1 --> PART0
    P2 --> PART1
    PART0 --> C1
    PART1 --> C2
    PART2 --> C3

    style Kafka fill:#56cc9d,stroke:#333,color:#fff
    style ConsumerGroup fill:#ffce67,stroke:#333

Delivery Guarantees

Guarantee	Description	How to Achieve	Trade-off
At-most-once	Message delivered 0 or 1 times	No retries, fire and forget	May lose messages
At-least-once	Message delivered 1 or more times	Retry on failure, ACK after processing	May have duplicates
Exactly-once	Message delivered exactly 1 time	Idempotent consumers + transactional writes	Complex, slower

Exactly-once in practice:
  - Kafka: Idempotent producer + transactions + consumer offset commit
  - Application-level: Idempotency key in each message
    → Consumer checks: "Have I processed message with ID X?"
    → If yes → skip (dedup)
    → If no → process + record ID in DB (same transaction)

Dead Letter Queue (DLQ)

Message processing flow:
  1. Consumer picks up message
  2. Processing fails → retry (exponential backoff: 1s, 5s, 30s, 5min)
  3. After max retries (e.g., 5 attempts) → move to Dead Letter Queue
  4. DLQ messages are inspected manually or by automated systems
  5. Fix the bug → replay DLQ messages back to original queue

Why DLQ matters:
  - Prevents poison messages from blocking the queue
  - Preserves failed messages for debugging
  - Allows retry after fix is deployed

Q4: How Do You Design a Microservices Architecture?

Answer:

Microservices architecture structures an application as a collection of loosely coupled, independently deployable services, each owning its own data and business logic.

graph TD
    CLIENT["Client"]
    CLIENT --> GW["API Gateway"]
    GW --> US["User Service
(PostgreSQL)"]
    GW --> OS["Order Service
(MySQL)"]
    GW --> PS["Payment Service
(MongoDB)"]
    GW --> NS["Notification Service
(Redis)"]

    OS -->|"Event: order_created"| MQ["Message Bus
(Kafka)"]
    MQ --> PS
    MQ --> NS
    OS -->|"gRPC"| US

    subgraph SD["Service Discovery"]
        REG["Service Registry
(Consul / Eureka)"]
    end

    US --> REG
    OS --> REG
    PS --> REG

    style GW fill:#56cc9d,stroke:#333,color:#fff
    style MQ fill:#ffce67,stroke:#333
    style SD fill:#6cc3d5,stroke:#333,color:#fff

Microservices Design Principles

Principle	Description	Example
Single responsibility	Each service does one thing well	User Service only handles user CRUD + auth
Database per service	No shared databases	Order Service has its own MySQL instance
API-first	Define contracts before implementation	OpenAPI spec agreed before coding
Decentralized governance	Teams choose their own tech stack	User svc in Go, Analytics in Python
Design for failure	Assume any service can fail	Circuit breakers, retries, fallbacks
Smart endpoints, dumb pipes	Logic in services, not in the message bus	Services process events, Kafka just delivers

Service Communication Patterns

Pattern	Type	Use Case	Example
REST/HTTP	Synchronous	Simple CRUD operations	GET /users/123
gRPC	Synchronous	Low-latency internal calls	Service-to-service with protobuf
Event-driven (async)	Asynchronous	Decouple services, eventual consistency	OrderCreated event → Payment, Notification
Saga	Choreography/Orchestration	Distributed transactions	Order → Payment → Inventory (compensating on failure)

Saga Pattern for Distributed Transactions

graph LR
    subgraph Happy["Happy Path"]
        O1["Create Order
(PENDING)"] --> P1["Reserve Payment"]
        P1 --> I1["Reserve Inventory"]
        I1 --> O2["Confirm Order
(CONFIRMED)"]
    end

    subgraph Compensate["Compensation (on failure)"]
        I_FAIL["Inventory fails"] --> P_COMP["Refund Payment"]
        P_COMP --> O_COMP["Cancel Order"]
    end

    style Happy fill:#56cc9d,stroke:#333,color:#fff
    style Compensate fill:#ff7851,stroke:#333,color:#fff

Saga Type	Coordination	Pros	Cons
Choreography	Each service listens to events and acts	Decoupled, no central coordinator	Hard to track overall flow, debugging complex
Orchestration	Central orchestrator directs the workflow	Easy to understand and monitor	Orchestrator is a single point of failure

Service Discovery

Problem: Services scale dynamically (pods come and go).
         How does Service A find Service B's current address?

Solution: Service Registry
  1. Service starts → registers itself (IP:port) with registry
  2. Service wants to call another → queries registry for addresses
  3. Registry health-checks registered services, removes dead ones
  4. Client-side load balancing across returned addresses

Tools:
  - Consul (HashiCorp) — service mesh + KV store + health checks
  - Eureka (Netflix) — Java-focused, Spring Cloud native
  - Kubernetes DNS — built-in (service-name.namespace.svc.cluster.local)
  - etcd — distributed KV store (used by Kubernetes internally)

Q5: What Are Database Replication and Partitioning Strategies?

Answer:

Replication and partitioning are the two fundamental mechanisms for scaling databases beyond a single machine, addressing read throughput, write throughput, storage capacity, and availability.

graph TD
    subgraph Replication["Replication (copies of same data)"]
        PRIMARY["Primary (writes)"]
        PRIMARY -->|"Async/Sync replication"| REP1["Replica 1 (reads)"]
        PRIMARY -->|"Async/Sync replication"| REP2["Replica 2 (reads)"]
        PRIMARY -->|"Async/Sync replication"| REP3["Replica 3 (reads)"]
    end

    subgraph Partitioning["Partitioning / Sharding (split data)"]
        ROUTER["Router"]
        ROUTER --> SHARD1["Shard 1
Users A-H"]
        ROUTER --> SHARD2["Shard 2
Users I-P"]
        ROUTER --> SHARD3["Shard 3
Users Q-Z"]
    end

    style PRIMARY fill:#56cc9d,stroke:#333,color:#fff
    style ROUTER fill:#ffce67,stroke:#333

Replication Strategies

Strategy	How It Works	Consistency	Latency	Use Case
Synchronous	Primary waits for all replicas to ACK	Strong	High (slowest replica)	Financial transactions
Semi-synchronous	Primary waits for at least 1 replica ACK	Strong (with 1 replica)	Medium	Critical data with some tolerance
Asynchronous	Primary doesn’t wait, replicas catch up	Eventual	Low	Read-heavy workloads, analytics

Replication Topologies

Topology	Description	Pros	Cons
Single-leader	One primary (writes), N replicas (reads)	Simple, no conflicts	Write bottleneck on primary
Multi-leader	Multiple primaries, each accepts writes	Write scaling, geo-distributed	Conflict resolution needed
Leaderless	Any node accepts reads/writes (quorum)	High availability, no failover	Complex, conflict resolution

Replication Lag Problems

Scenario: User updates profile (write to primary), immediately reads (from replica)
Problem: Replica hasn't received the update yet → shows stale data

Solutions:
  1. Read-your-writes consistency:
     → After write, read from primary for N seconds
     → Or track last-write timestamp, read from replica only if up-to-date

  2. Monotonic reads:
     → Always route same user to same replica (sticky reads)
     → Prevents seeing data go "backward"

  3. Causal consistency:
     → Track dependencies between writes
     → Replica only serves reads after all causal dependencies are applied

Sharding (Partitioning) Deep Dive

Shard Key Strategy	Example	Pros	Cons
Hash-based	`shard = hash(user_id) % 4`	Even distribution	Range queries span all shards
Range-based	`shard1: dates Jan-Mar`	Efficient range scans	Hot shards (recent data accessed most)
Geographic	`shard_us, shard_eu, shard_asia`	Data locality, compliance	Uneven if one region dominates
Directory	Lookup table: `user123 → shard2`	Maximum flexibility	Directory is bottleneck/SPOF

Cross-Shard Operations

Challenge: Query that spans multiple shards (e.g., "all orders > $100")

Approaches:
  1. Scatter-gather: Query all shards, merge results (expensive)
  2. Denormalize: Copy needed data into each shard (storage trade-off)
  3. Global index: Secondary index service spans all shards
  4. Avoid: Design schema so most queries hit single shard
     → Shard by user_id, and most queries are user-scoped

Q6: How Does Kubernetes Work and How Do You Design for Container Orchestration?

Answer:

Kubernetes (K8s) is a container orchestration platform that automates deployment, scaling, and management of containerized applications. It’s the de facto standard for running microservices in production.

graph TD
    subgraph ControlPlane["Control Plane"]
        API["API Server
(kube-apiserver)"]
        SCHED["Scheduler
(kube-scheduler)"]
        CM["Controller Manager"]
        ETCD["etcd
(cluster state store)"]
    end

    subgraph WorkerNode["Worker Node 1"]
        KUBELET["kubelet"]
        PROXY["kube-proxy"]
        POD1["Pod A
(Container 1)"]
        POD2["Pod B
(Container 2, Container 3)"]
    end

    subgraph WorkerNode2["Worker Node 2"]
        KUBELET2["kubelet"]
        POD3["Pod C"]
        POD4["Pod D"]
    end

    API --> SCHED
    API --> CM
    API --> ETCD
    API --> KUBELET
    API --> KUBELET2

    style ControlPlane fill:#56cc9d,stroke:#333,color:#fff
    style WorkerNode fill:#ffce67,stroke:#333
    style WorkerNode2 fill:#6cc3d5,stroke:#333,color:#fff

Core Kubernetes Objects

Object	Purpose	Example
Pod	Smallest deployable unit (1+ containers)	Single instance of your app
Deployment	Manages desired state of Pods (replicas, rolling updates)	“Run 3 replicas of user-service v2”
Service	Stable network endpoint for a set of Pods	Load-balanced IP for user-service Pods
Ingress	External HTTP routing to services	`api.example.com/users → user-service`
ConfigMap	Non-sensitive configuration	Database host, feature flags
Secret	Sensitive data (encrypted at rest)	DB passwords, API keys
HPA	Horizontal Pod Autoscaler	Scale Pods based on CPU/memory/custom metrics
PVC	Persistent Volume Claim	Attach storage to stateful Pods

Deployment Strategies in Kubernetes

graph LR
    subgraph Rolling["Rolling Update (default)"]
        R1["v1 v1 v1"] --> R2["v2 v1 v1"] --> R3["v2 v2 v1"] --> R4["v2 v2 v2"]
    end

    subgraph BlueGreen["Blue-Green"]
        BG1["Blue (v1) ← traffic"] --> BG2["Green (v2) ← traffic"]
    end

    subgraph Canary["Canary"]
        CAN1["v1: 90% traffic
v2: 10% traffic"] --> CAN2["v1: 0%
v2: 100%"]
    end

    style Rolling fill:#56cc9d,stroke:#333,color:#fff
    style BlueGreen fill:#ffce67,stroke:#333
    style Canary fill:#6cc3d5,stroke:#333,color:#fff

Strategy	How It Works	Rollback	Risk
Rolling update	Replace Pods one by one	Automatic rollback on failure	Brief period with mixed versions
Blue-Green	Run two full environments, switch traffic	Instant (switch back to blue)	2x resources during deployment
Canary	Route small % of traffic to new version	Instant (route all to old)	Complex routing rules
A/B testing	Route by user attributes (region, ID)	Instant	Requires feature flag infrastructure

Resource Management

# Pod resource specification
resources:
  requests:           # Guaranteed minimum
    cpu: "250m"       # 0.25 CPU cores
    memory: "256Mi"   # 256 MB RAM
  limits:             # Maximum allowed
    cpu: "1000m"      # 1 CPU core
    memory: "512Mi"   # 512 MB RAM

# HPA (Horizontal Pod Autoscaler)
# Scale between 3-10 pods when CPU > 70%
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 70

Kubernetes Networking

Concept	Purpose
ClusterIP	Internal service (only within cluster)
NodePort	Expose service on each node’s IP at a static port
LoadBalancer	Provision cloud LB (AWS ALB/NLB) for external traffic
Ingress	L7 routing rules (path-based, host-based)
Network Policy	Firewall rules between Pods (default: all-open)
Service Mesh	Sidecar proxy for mTLS, observability, traffic control

Q7: How Do You Design a CI/CD Pipeline?

Answer:

CI/CD (Continuous Integration / Continuous Delivery) automates the process of building, testing, and deploying software. A well-designed pipeline ensures rapid, reliable releases with minimal manual intervention.

graph LR
    DEV["Developer
pushes code"]
    DEV --> CI["CI Pipeline"]

    subgraph CI["Continuous Integration"]
        BUILD["Build
(compile, deps)"]
        LINT["Lint &
Static Analysis"]
        TEST["Unit Tests"]
        INT["Integration Tests"]
        SEC["Security Scan
(SAST, deps)"]
        IMG["Build Container
Image"]
    end

    CI --> CD["CD Pipeline"]

    subgraph CD["Continuous Delivery"]
        STAGE["Deploy to
Staging"]
        E2E["E2E Tests
(staging)"]
        APPROVE["Manual Approval
(optional)"]
        PROD["Deploy to
Production"]
        SMOKE["Smoke Tests
(production)"]
    end

    style CI fill:#56cc9d,stroke:#333,color:#fff
    style CD fill:#6cc3d5,stroke:#333,color:#fff

CI/CD Pipeline Stages

Stage	Purpose	Tools	Feedback Time
Lint / Format	Code style consistency	ESLint, Black, gofmt	< 30s
Unit Tests	Test individual functions/classes	pytest, JUnit, Jest	1-5 min
Build	Compile code, resolve dependencies	Maven, npm, pip	1-3 min
Integration Tests	Test service interactions	Testcontainers, docker-compose	5-15 min
Security Scan (SAST)	Find vulnerabilities in code	Snyk, SonarQube, Semgrep	2-5 min
Container Build	Build Docker image, push to registry	Docker, Buildah, Kaniko	2-5 min
Deploy to Staging	Deploy to pre-production environment	ArgoCD, Helm, Terraform	3-10 min
E2E Tests	Full user flow tests in staging	Playwright, Cypress, Selenium	10-30 min
Deploy to Production	Rolling update / canary / blue-green	ArgoCD, Spinnaker, Flux	5-15 min
Smoke Tests	Verify critical paths in production	Custom health checks, synthetic monitors	1-3 min

CI/CD Best Practices

Pipeline design principles:
  1. Fast feedback: fail early (lint → unit tests → integration)
  2. Immutable artifacts: build once, deploy to all environments
  3. Environment parity: staging mirrors production
  4. Infrastructure as Code: Terraform/Pulumi for infra changes
  5. GitOps: desired state in Git, reconciler applies it (ArgoCD)
  6. Feature flags: decouple deployment from release
  7. Rollback plan: every deployment has automated rollback trigger

Branch strategy:
  - Trunk-based development (preferred for fast teams):
    → Short-lived feature branches (< 1 day)
    → Merge to main frequently
    → Feature flags hide incomplete work
    → Main is always deployable

  - GitFlow (for teams needing release management):
    → develop → feature branches → release branches → main
    → More overhead, longer release cycles

GitOps with ArgoCD

GitOps workflow:
  1. Developer merges PR → main branch
  2. CI pipeline builds image → pushes to registry (e.g., v1.2.3)
  3. CI updates manifest repo (Helm values / kustomize with new image tag)
  4. ArgoCD detects drift between Git manifest and cluster state
  5. ArgoCD applies changes to Kubernetes cluster
  6. If deployment fails health checks → ArgoCD auto-rollback

Benefits:
  - Git is single source of truth
  - Full audit trail (who changed what, when)
  - Easy rollback (git revert)
  - Declarative (describe desired state, not imperative steps)

Q8: How Do You Design a Monitoring and Observability System?

Answer:

Observability is the ability to understand a system’s internal state by examining its external outputs. The three pillars are metrics, logs, and traces. Together they enable debugging, alerting, and performance optimization.

graph TD
    APPS["Applications / Services"]
    APPS -->|"Metrics"| PROM["Prometheus
(time-series DB)"]
    APPS -->|"Logs"| ELK["ELK Stack
(Elasticsearch + Logstash + Kibana)"]
    APPS -->|"Traces"| JAEGER["Jaeger / Zipkin
(distributed tracing)"]

    PROM --> GRAFANA["Grafana
(dashboards)"]
    ELK --> GRAFANA
    PROM --> ALERT["Alertmanager
(PagerDuty, Slack)"]

    style PROM fill:#56cc9d,stroke:#333,color:#fff
    style ELK fill:#ffce67,stroke:#333
    style JAEGER fill:#6cc3d5,stroke:#333,color:#fff

Three Pillars of Observability

Pillar	What	Why	Tools
Metrics	Numeric measurements over time (counters, gauges, histograms)	Alerting, capacity planning, SLOs	Prometheus, Datadog, CloudWatch
Logs	Structured event records	Debugging specific issues, audit trail	ELK, Loki, Splunk, CloudWatch Logs
Traces	Request journey across services	Find latency bottlenecks across microservices	Jaeger, Zipkin, AWS X-Ray, OpenTelemetry

Key Metrics (The Four Golden Signals)

Signal	What to Measure	Alert Threshold Example
Latency	Time to serve a request (P50, P95, P99)	P99 > 500ms for 5 minutes
Traffic	Requests per second (QPS)	Sudden drop > 50% (indicates failure)
Errors	Error rate (5xx, timeouts, failed operations)	Error rate > 1% for 3 minutes
Saturation	Resource utilization (CPU, memory, disk, connections)	CPU > 80% for 10 minutes

Structured Logging

{
  "timestamp": "2026-05-21T10:30:45.123Z",
  "level": "ERROR",
  "service": "order-service",
  "trace_id": "abc-123-def-456",
  "span_id": "span-789",
  "user_id": "user_42",
  "method": "POST",
  "path": "/api/v1/orders",
  "status_code": 500,
  "duration_ms": 2345,
  "error": "ConnectionRefusedError: payment-service:8080",
  "message": "Failed to process payment for order"
}

Distributed Tracing

Request: User places order
  
  ┌─ API Gateway (12ms) ─────────────────────────────────────┐
  │  ┌─ Order Service (45ms) ──────────────────────────────┐  │
  │  │  ┌─ User Service (8ms) ────┐                        │  │
  │  │  └─────────────────────────┘                        │  │
  │  │  ┌─ Payment Service (320ms) ← BOTTLENECK ─────────┐│  │
  │  │  │  ┌─ Stripe API (280ms) ───────────────────────┐││  │
  │  │  │  └────────────────────────────────────────────┘││  │
  │  │  └────────────────────────────────────────────────┘│  │
  │  │  ┌─ Inventory Service (15ms)──┐                     │  │
  │  │  └────────────────────────────┘                     │  │
  │  └─────────────────────────────────────────────────────┘  │
  └────────────────────────────────────────────────────────────┘
  Total: 392ms (Payment → Stripe is 71% of total time)

SLOs, SLIs, and SLAs

Term	Definition	Example
SLI (Service Level Indicator)	The metric you measure	P99 latency, availability %, error rate
SLO (Service Level Objective)	The target for the SLI	P99 latency < 200ms, 99.9% availability
SLA (Service Level Agreement)	Contract with consequences if SLO breached	99.9% uptime or customer gets credits
Error Budget	How much failure is allowed before violating SLO	99.9% = 43 minutes downtime/month budget

Alerting Strategy

Alert design principles:
  1. Alert on symptoms, not causes
     ✅ "Error rate > 5% for 3 min"  (symptom)
     ❌ "CPU > 90%"  (may not impact users)

  2. Severity levels:
     - P1 (Critical): Revenue impacting, page immediately
     - P2 (High): Degraded service, page during business hours
     - P3 (Medium): Non-urgent, ticket in queue
     - P4 (Low): Informational, dashboard only

  3. Reduce noise:
     - Group related alerts
     - Require duration threshold (not single spike)
     - Suppress during maintenance windows
     - Escalation: Slack → PagerDuty → phone call

Q9: How Does Event-Driven Architecture Work?

Answer:

Event-Driven Architecture (EDA) is a design pattern where services communicate by producing and consuming events (facts about something that happened). It enables loose coupling, real-time processing, and scalable async workflows.

graph TD
    subgraph Producers["Event Producers"]
        US["User Service
→ UserCreated"]
        OS["Order Service
→ OrderPlaced"]
        PS["Payment Service
→ PaymentProcessed"]
    end

    subgraph EventBus["Event Bus / Broker (Kafka)"]
        T1["Topic: user-events"]
        T2["Topic: order-events"]
        T3["Topic: payment-events"]
    end

    subgraph Consumers["Event Consumers"]
        EMAIL["Email Service"]
        ANALYTICS["Analytics Service"]
        INVENTORY["Inventory Service"]
        SEARCH["Search Indexer"]
    end

    US --> T1
    OS --> T2
    PS --> T3
    T1 --> EMAIL
    T1 --> ANALYTICS
    T2 --> INVENTORY
    T2 --> ANALYTICS
    T3 --> EMAIL
    T3 --> SEARCH

    style EventBus fill:#56cc9d,stroke:#333,color:#fff
    style Consumers fill:#ffce67,stroke:#333

Event Types

Type	Description	Example	Size
Domain Event	Something significant happened in the business	`OrderPlaced`, `UserRegistered`	Small (metadata + IDs)
Integration Event	Event shared between services (bounded contexts)	`PaymentCompleted` consumed by Order service	Small
Event-Carried State Transfer	Event contains full state (eliminates need to query source)	`OrderPlaced { items: [...], total: 99.50, address: {...} }`	Large
Change Data Capture (CDC)	Database changes streamed as events	Debezium captures INSERT/UPDATE/DELETE from DB binlog	Row-level

Event Sourcing

graph LR
    CMD["Command:
PlaceOrder"]
    CMD --> ES["Event Store
(append-only log)"]
    ES --> E1["OrderCreated"]
    ES --> E2["ItemAdded (x3)"]
    ES --> E3["PaymentReceived"]
    ES --> E4["OrderShipped"]

    ES -->|"Replay events"| STATE["Current State:
Order #123
Status: Shipped
Items: 3
Total: $59.99"]

    style ES fill:#56cc9d,stroke:#333,color:#fff
    style STATE fill:#6cc3d5,stroke:#333,color:#fff

Aspect	Traditional (CRUD)	Event Sourcing
Storage	Current state only	Full history of events
State	Mutable (UPDATE/DELETE)	Immutable (append-only)
Audit trail	Requires separate logging	Built-in (every change is an event)
Debugging	“Why is it in this state?”	Replay events to see exactly what happened
Complexity	Simple CRUD operations	Event replay, projections, eventual consistency
Best for	Simple domains	Financial systems, audit-heavy, undo/redo needed

CQRS (Command Query Responsibility Segregation)

graph TD
    CLIENT["Client"]
    CLIENT -->|"Write (Command)"| WRITE["Write Model
(normalized DB)"]
    CLIENT -->|"Read (Query)"| READ["Read Model
(denormalized views)"]

    WRITE -->|"Events"| PROJ["Projection Service"]
    PROJ --> READ

    style WRITE fill:#56cc9d,stroke:#333,color:#fff
    style READ fill:#6cc3d5,stroke:#333,color:#fff
    style PROJ fill:#ffce67,stroke:#333

Aspect	Description
Why CQRS	Reads and writes have different performance profiles and scaling needs
Write side	Normalized, optimized for consistency and validation
Read side	Denormalized, pre-computed views optimized for queries
Sync mechanism	Events from write side update read projections (async)
Trade-off	Eventual consistency between write and read models
Pairs with	Event Sourcing (events feed both write log and read projections)

Idempotent Event Processing

Problem: Network failures → events may be delivered multiple times.
         Consumer must handle duplicates safely.

Solutions:
  1. Idempotency key in every event:
     Event: { "id": "evt_abc123", "type": "PaymentReceived", "data": {...} }
     Consumer: 
       - Before processing, check: "Have I seen evt_abc123?"
       - If yes → skip
       - If no → process + record evt_abc123 in processed_events table

  2. Idempotent operations (naturally safe):
     - SET operations (overwrite): last write wins
     - Upsert with same data: same result regardless of count

  3. Transactional outbox pattern:
     - Write business data + event to same DB (single transaction)
     - Background process reads outbox table → publishes to Kafka
     - Guarantees: if data saved, event will eventually publish

Q10: How Do You Design for Service Mesh and Inter-Service Communication?

Answer:

A service mesh is an infrastructure layer that handles service-to-service communication, providing observability, security (mTLS), and traffic management without changing application code. It’s typically implemented as sidecar proxies alongside each service.

graph TD
    subgraph PodA["Pod: Order Service"]
        APP_A["Order Service
(application)"]
        PROXY_A["Envoy Sidecar
(proxy)"]
    end

    subgraph PodB["Pod: Payment Service"]
        APP_B["Payment Service
(application)"]
        PROXY_B["Envoy Sidecar
(proxy)"]
    end

    subgraph PodC["Pod: User Service"]
        APP_C["User Service
(application)"]
        PROXY_C["Envoy Sidecar
(proxy)"]
    end

    PROXY_A -->|"mTLS"| PROXY_B
    PROXY_A -->|"mTLS"| PROXY_C

    CP["Control Plane
(Istio / Linkerd)"]
    CP -->|"Config, certs"| PROXY_A
    CP -->|"Config, certs"| PROXY_B
    CP -->|"Config, certs"| PROXY_C

    style CP fill:#56cc9d,stroke:#333,color:#fff
    style PodA fill:#ffce67,stroke:#333
    style PodB fill:#6cc3d5,stroke:#333,color:#fff

What a Service Mesh Provides

Feature	Description	Without Mesh
mTLS	Automatic encryption + identity between all services	Each service manages certs manually
Traffic management	Canary releases, A/B testing, fault injection	Custom load balancer config per service
Observability	Automatic metrics, traces, access logs from proxy	Instrument every service manually
Retries & timeouts	Configurable retry policies per route	Each service implements retry logic
Circuit breaking	Auto-stop traffic to failing services	Library-based (Hystrix, resilience4j)
Rate limiting	Per-service traffic control	Centralized rate limiter service
Access control	Policy-based authorization (which service can call which)	Manual firewall rules / code checks

Service Mesh Comparison

Feature	Istio	Linkerd	Consul Connect
Proxy	Envoy	Linkerd2-proxy (Rust)	Envoy or built-in
Complexity	High (many CRDs)	Low (lightweight)	Medium
Performance	Moderate overhead	Low overhead	Low overhead
Features	Full-featured (traffic, security, observability)	Core features, simple	Service discovery + mesh
Best for	Large orgs needing full control	Teams wanting simplicity	HashiCorp ecosystem users

Traffic Management Patterns

Pattern	Purpose	Configuration
Canary	Route 5% traffic to v2, 95% to v1	Weight-based routing
Header-based routing	Internal testers get v2 via header `x-version: canary`	Match rules on headers
Fault injection	Inject 500ms delay to test resilience	Delay/abort rules for testing
Mirroring	Copy production traffic to test environment	Traffic shadowing (no impact to users)
Circuit breaking	Max 100 concurrent requests per service	Connection pool limits
Retry budget	Max 20% additional requests as retries	Prevent retry storms

When to Use (and NOT Use) a Service Mesh

Use a service mesh when:
  ✅ Running 10+ microservices in production
  ✅ Need mTLS between all services (zero trust)
  ✅ Want consistent observability without code changes
  ✅ Complex traffic routing (canary, A/B, fault injection)
  ✅ Need policy-based access control

Do NOT use when:
  ❌ Fewer than 5 services (overhead not worth it)
  ❌ Team doesn't have Kubernetes expertise
  ❌ Simple request-response with no special routing
  ❌ Latency-critical paths where sidecar overhead matters (~1-3ms)
  ❌ Monolith or early-stage product

Summary Table

#	Topic	Key Concepts
1	Load Balancing	L4 vs L7, algorithms (round robin, least connections, consistent hashing), health checks, sticky sessions
2	Caching	Cache-aside, write-through, write-behind, eviction policies, Redis vs Memcached, stampede prevention
3	Message Queues	Kafka vs RabbitMQ vs SQS, delivery guarantees, DLQ, partitions, consumer groups
4	Microservices	Service communication, Saga pattern, service discovery, database per service
5	Database Scaling	Replication (sync/async), sharding strategies, replication lag, cross-shard queries
6	Kubernetes	Pods, Deployments, Services, HPA, rolling/blue-green/canary deploys, resource limits
7	CI/CD	Pipeline stages, GitOps, ArgoCD, trunk-based development, immutable artifacts
8	Monitoring	Metrics/Logs/Traces, four golden signals, SLOs, alerting strategy, distributed tracing
9	Event-Driven Architecture	Event sourcing, CQRS, CDC, idempotent processing, transactional outbox
10	Service Mesh	Sidecar proxy (Envoy), mTLS, traffic management, Istio vs Linkerd, when to use

What’s Next?

This article covered infrastructure components and operational patterns. Continue with:

Foundational concepts: System Design Interview QA - 1 — scalability, CAP theorem, APIs, networking, security
Hands-on design problems: System Design Interview QA - 3 — URL shortener, chat system, news feed, video streaming
Design patterns: Design Pattern Interview QA - 1
Enterprise patterns (Spring, CQRS): Design Pattern Interview QA - 2

System Design Interview QA - 3

Vectoring AI — Thu, 21 May 2026 00:00:00 GMT

Introduction

This is Part 3 of our System Design Interview QA series, covering the 10 most frequently asked system design questions at FAANG+ companies. Each question follows the proven 4-step framework: Requirements → High-Level Design → Deep Dive → Trade-offs.

For foundational concepts, see System Design Interview QA - 1. For infrastructure deep dives, see System Design Interview QA - 2. For design patterns, see Design Pattern Interview QA - 1.

Q1: Design a URL Shortener (TinyURL)

Answer:

A URL shortener maps long URLs to short, unique aliases (e.g., https://tiny.url/a1b2c3) and redirects users to the original URL.

graph TD
    USER["User"]
    USER -->|"POST /shorten
{url: 'https://very-long-url.com/...'}"| API["API Service"]
    API --> KEYGEN["Key Generator
(Base62 encoding)"]
    API --> DB["Database
(short_code → original_url)"]
    API -->|"Returns: tiny.url/a1b2c3"| USER

    USER2["Visitor"]
    USER2 -->|"GET /a1b2c3"| LB["Load Balancer"]
    LB --> CACHE["Cache (Redis)
(hot URLs)"]
    CACHE -->|"miss"| DB
    LB -->|"301 Redirect"| USER2

    style API fill:#56cc9d,stroke:#333,color:#fff
    style CACHE fill:#ffce67,stroke:#333
    style DB fill:#6cc3d5,stroke:#333,color:#fff

Requirements

Type	Requirement
Functional	Shorten a URL → return short link; Redirect short link → original URL; Optional: custom aliases, expiration, analytics
Non-functional	Low latency redirects (<100ms); High availability (99.99%); 100M URLs/day write, 10:1 read-to-write ratio
Capacity	~1B URLs/year; ~1KB per record → ~1TB storage/year; ~100K reads/sec peak

Key Design Decisions

Short Code Generation:
  Option A: Hash (MD5/SHA256) → take first 7 chars → collision check
  Option B: Pre-generated key service (counter-based, Base62 encoded)
  Option C: Snowflake ID → Base62 encode

  Recommended: Counter-based with Base62 encoding
    - 7 chars of Base62 = 62^7 = ~3.5 trillion unique codes
    - No collision checking needed
    - Monotonically increasing → good for DB indexing

Database Schema

-- URLs table
CREATE TABLE urls (
    short_code  VARCHAR(7) PRIMARY KEY,  -- Base62 encoded
    original_url TEXT NOT NULL,
    user_id     BIGINT,
    created_at  TIMESTAMP DEFAULT NOW(),
    expires_at  TIMESTAMP,
    click_count BIGINT DEFAULT 0
);

-- Analytics (separate table for write performance)
CREATE TABLE clicks (
    id          BIGSERIAL PRIMARY KEY,
    short_code  VARCHAR(7),
    clicked_at  TIMESTAMP DEFAULT NOW(),
    user_agent  TEXT,
    ip_address  INET,
    referrer    TEXT
);

Trade-offs

Decision	Option A	Option B	Recommendation
Storage	SQL (PostgreSQL)	NoSQL (DynamoDB)	NoSQL for scale — simple key-value access pattern
Redirect	301 (permanent)	302 (temporary)	302 if you need analytics; 301 for caching
Caching	Cache all	Cache hot URLs only	Cache hot URLs in Redis (80/20 rule)
ID generation	Centralized counter	Distributed (Snowflake)	Distributed for multi-region

Q2: Design a Rate Limiter

Answer:

A rate limiter controls the rate of requests a client can send to an API, protecting against abuse, DDoS attacks, and resource exhaustion.

graph TD
    CLIENT["Client"]
    CLIENT --> RL["Rate Limiter
(middleware / API gateway)"]
    RL -->|"Under limit"| API["API Servers"]
    RL -->|"Over limit"| REJECT["429 Too Many Requests
Retry-After: 30"]
    RL --> STORE["Rules & Counter Store
(Redis)"]

    subgraph Algorithms
        A1["Fixed Window"]
        A2["Sliding Window Log"]
        A3["Sliding Window Counter"]
        A4["Token Bucket"]
        A5["Leaky Bucket"]
    end

    style RL fill:#56cc9d,stroke:#333,color:#fff
    style REJECT fill:#ff7851,stroke:#333,color:#fff

Requirements

Type	Requirement
Functional	Limit requests per client (IP, user ID, API key); Return rate limit headers; Support different limits per endpoint
Non-functional	Ultra-low latency (<1ms overhead); Distributed (works across multiple servers); Highly available; Accurate counting

Algorithm Comparison

Algorithm	How It Works	Pros	Cons
Fixed Window	Count requests in fixed time windows (e.g., per minute)	Simple, low memory	Burst at window boundaries (2x allowed)
Sliding Window Log	Store timestamp of each request, count in sliding window	Accurate	High memory (stores all timestamps)
Sliding Window Counter	Weighted count across current + previous window	Accurate + low memory	Approximate
Token Bucket	Tokens added at fixed rate, each request consumes one	Allows controlled bursts	Slightly complex
Leaky Bucket	Requests queue and process at fixed rate	Smooth output rate	Doesn’t allow bursts

Token Bucket Design (Recommended)

# Redis-based Token Bucket (distributed)
# Key: rate_limit:{client_id}
# Fields: tokens (float), last_refill (timestamp)

async def is_allowed(redis, client_id: str, max_tokens: int, refill_rate: float) -> bool:
    """
    max_tokens: bucket capacity (e.g., 100)
    refill_rate: tokens per second (e.g., 10)
    """
    key = f"rate_limit:{client_id}"
    now = time.time()

    # Lua script for atomicity (no race conditions)
    lua_script = """
    local tokens = tonumber(redis.call('hget', KEYS[1], 'tokens') or ARGV[1])
    local last_refill = tonumber(redis.call('hget', KEYS[1], 'last_refill') or ARGV[3])
    local now = tonumber(ARGV[3])
    local max_tokens = tonumber(ARGV[1])
    local refill_rate = tonumber(ARGV[2])

    -- Refill tokens based on elapsed time
    local elapsed = now - last_refill
    tokens = math.min(max_tokens, tokens + elapsed * refill_rate)

    if tokens >= 1 then
        tokens = tokens - 1
        redis.call('hset', KEYS[1], 'tokens', tokens, 'last_refill', now)
        redis.call('expire', KEYS[1], 3600)
        return 1  -- Allowed
    else
        redis.call('hset', KEYS[1], 'tokens', tokens, 'last_refill', now)
        return 0  -- Rejected
    end
    """
    result = await redis.eval(lua_script, 1, key, max_tokens, refill_rate, now)
    return result == 1

Where to Place the Rate Limiter

Location	Pros	Cons
API Gateway (recommended)	Centralized, handles all services	Single point of failure
Middleware (per service)	Fine-grained, service-specific rules	Each service must implement
Client-side	Reduces unnecessary requests	Can be bypassed
CDN/Edge	Stops attacks before reaching origin	Limited rule flexibility

Q3: Design a Chat/Messaging System (WhatsApp)

Answer:

A real-time messaging system supports 1-on-1 and group messaging with delivery guarantees, presence tracking, and message persistence.

graph TD
    SENDER["Sender (Alice)"]
    SENDER -->|"WebSocket"| GW1["Chat Gateway
(maintains connections)"]
    GW1 --> MQ["Message Queue
(Kafka)"]
    MQ --> ROUTER["Message Router
(fan-out service)"]
    ROUTER --> GW2["Chat Gateway
(Bob's server)"]
    GW2 -->|"WebSocket"| RECEIVER["Receiver (Bob)"]

    MQ --> DB["Message Store
(Cassandra)"]
    ROUTER -.->|"Bob offline"| PUSH["Push Notification
(APNs / FCM)"]

    style GW1 fill:#56cc9d,stroke:#333,color:#fff
    style MQ fill:#ffce67,stroke:#333
    style DB fill:#6cc3d5,stroke:#333,color:#fff

Requirements

Type	Requirement
Functional	1-on-1 messaging; Group chat (up to 500 members); Sent/Delivered/Read receipts; Online/offline presence; Message history
Non-functional	Low latency (<300ms end-to-end); High availability (99.99%); Eventual consistency acceptable; 50B messages/day
Capacity	500M DAU; 100 messages/user/day; ~100 bytes/message → ~5TB/day

Key Components

Component	Technology	Purpose
Connection layer	WebSocket servers	Persistent bidirectional connections
Message queue	Kafka	Decouple send/receive, handle spikes
Message store	Cassandra	Write-heavy, append-only, partitioned by chat_id
User presence	Redis	Track online/offline status with TTL
Push notifications	APNs/FCM	Deliver to offline users
Media storage	S3 + CDN	Images, videos, voice notes

Message Delivery Flow

1. Alice sends message via WebSocket → Chat Gateway
2. Gateway publishes to Kafka topic (partitioned by chat_id)
3. Message Router consumes from Kafka:
   a. Persist message to Cassandra (status: "sent")
   b. Look up Bob's connection in Session Store (Redis)
   c. If online → push via WebSocket → update status: "delivered"
   d. If offline → send push notification
4. Bob opens app → fetch undelivered messages → update: "delivered"
5. Bob reads message → client sends ACK → update: "read"

Group Messaging Fan-out

Strategy	How It Works	Best For
Fan-out on write	Copy message to each member’s inbox at send time	Small groups (<100 members)
Fan-out on read	Store once, recipients pull on connect	Large groups / channels
Hybrid	Fan-out on write for small groups, on read for large	Production systems (WhatsApp)

Q4: Design a Social Media News Feed (Twitter/Instagram)

Answer:

A news feed system aggregates and ranks posts from users you follow, delivering a personalized, near real-time content stream.

graph TD
    POSTER["User Posts Tweet"]
    POSTER --> PS["Post Service"]
    PS --> DB["Post Store"]
    PS --> FANOUT["Fan-out Service"]
    FANOUT --> CACHE["Feed Cache
(per user, Redis)"]

    READER["User Opens Feed"]
    READER --> FS["Feed Service"]
    FS --> CACHE
    FS --> RANK["Ranking Service
(ML model)"]
    RANK --> FEED["Merged &
Ranked Feed"]

    style FANOUT fill:#56cc9d,stroke:#333,color:#fff
    style CACHE fill:#ffce67,stroke:#333
    style RANK fill:#6cc3d5,stroke:#333,color:#fff

Requirements

Type	Requirement
Functional	Create posts (text, images, video); Follow/unfollow users; View personalized news feed; Like, comment, share
Non-functional	Feed generation <500ms; High availability; 500M DAU; Eventual consistency acceptable
Capacity	2 posts/user/day → 1B posts/day; Average 300 followers; Feed shows top 20 posts

Fan-out Strategy: The Core Decision

graph LR
    subgraph FanOutWrite["Fan-out on Write (Push)"]
        POST1["New Post"] --> COPY["Copy to all
followers' feeds"]
        COPY --> F1["Alice's Feed Cache"]
        COPY --> F2["Bob's Feed Cache"]
        COPY --> F3["Carol's Feed Cache"]
    end

    subgraph FanOutRead["Fan-out on Read (Pull)"]
        OPEN["Open Feed"] --> FETCH["Fetch posts from
all followed users"]
        FETCH --> M1["User A's posts"]
        FETCH --> M2["User B's posts"]
        FETCH --> M3["User C's posts"]
    end

    style FanOutWrite fill:#56cc9d,stroke:#333,color:#fff
    style FanOutRead fill:#6cc3d5,stroke:#333,color:#fff

Aspect	Fan-out on Write (Push)	Fan-out on Read (Pull)
When	At post creation time	At feed request time
Latency	Fast reads (pre-computed)	Slow reads (compute on demand)
Write cost	High (copy to all followers)	Low (store once)
Hot users	Celebrity with 50M followers → 50M writes	No write amplification
Best for	Normal users (<10K followers)	Celebrities / high-follower users

Hybrid Approach (Twitter/Instagram’s Actual Design)

Normal users (<10K followers):
  → Fan-out on write: pre-compute feed for all followers
  → Feed is ready when they open the app

Celebrity users (>10K followers):
  → Fan-out on read: don't pre-compute
  → When user opens feed, merge:
      - Pre-computed feed (from normal users they follow)
      - On-demand fetch (from celebrities they follow)
  → Rank the merged result

Feed Ranking

Signal	Weight	Source
Recency	High	Post timestamp
Engagement	High	Likes, comments, shares on the post
Relationship	Medium	Interaction history with poster
Content type	Medium	User preference (video vs text)
Diversity	Low	Avoid showing too many posts from one user

Q5: Design a File Storage System (Dropbox/Google Drive)

Answer:

A cloud file storage system lets users upload, download, and sync files across devices with high reliability and global availability.

graph TD
    CLIENT["Desktop / Mobile Client"]
    CLIENT -->|"Upload (chunked)"| API["API Gateway"]
    API --> META["Metadata Service"]
    META --> METADB["Metadata DB
(MySQL)"]
    API --> UPLOAD["Upload Service"]
    UPLOAD --> QUEUE["Upload Queue"]
    QUEUE --> STORE["Object Storage
(S3)"]

    CLIENT2["Another Device"]
    CLIENT2 --> SYNC["Sync Service"]
    SYNC --> NOTIFY["Notification Service
(WebSocket / Long Poll)"]
    SYNC --> CDN["CDN
(Download cache)"]

    style API fill:#56cc9d,stroke:#333,color:#fff
    style STORE fill:#6cc3d5,stroke:#333,color:#fff
    style NOTIFY fill:#ffce67,stroke:#333

Requirements

Type	Requirement
Functional	Upload/download files; Sync across devices; File versioning; Share files/folders
Non-functional	High reliability (no data loss); Low latency downloads; Support files up to 10GB; 100M users, 1M DAU
Capacity	1 file/user/day, avg 5MB → 5TB/day; Total storage: ~1.5PB

Chunked Upload Design

Why chunking?
  - Resume interrupted uploads (mobile networks)
  - Deduplicate at chunk level (save storage)
  - Parallel upload of chunks (faster)
  - Delta sync: only upload changed chunks

Chunk size: 4MB (balance between overhead and resume granularity)

Upload flow:
  1. Client splits file into 4MB chunks
  2. Client computes SHA-256 hash per chunk
  3. Client asks server: "Do you have chunk with hash X?"
     - Yes → skip (deduplication)
     - No → upload chunk
  4. After all chunks uploaded → server assembles file
  5. Server updates metadata DB with file record
  6. Notification service alerts other devices to sync

Data Model

-- Files metadata
CREATE TABLE files (
    file_id     UUID PRIMARY KEY,
    user_id     BIGINT NOT NULL,
    filename    VARCHAR(255),
    path        TEXT,
    size_bytes  BIGINT,
    checksum    VARCHAR(64),
    version     INT DEFAULT 1,
    created_at  TIMESTAMP,
    updated_at  TIMESTAMP
);

-- File chunks (for dedup and resume)
CREATE TABLE chunks (
    chunk_hash  VARCHAR(64) PRIMARY KEY,  -- SHA-256
    size_bytes  INT,
    s3_location TEXT,
    ref_count   INT DEFAULT 1  -- for garbage collection
);

-- File-to-chunk mapping
CREATE TABLE file_chunks (
    file_id     UUID,
    chunk_index INT,
    chunk_hash  VARCHAR(64),
    PRIMARY KEY (file_id, chunk_index)
);

Sync and Conflict Resolution

Scenario	Resolution Strategy
Same file edited on 2 devices	Create conflict copy with timestamp
File deleted on one device, edited on another	Keep the edited version, log deletion
Concurrent uploads of same new file	Last-write-wins or merge (depends on file type)
Offline edits	Queue changes locally, sync when online

Q6: Design a Video Streaming Platform (YouTube/Netflix)

Answer:

A video streaming platform handles upload, transcoding, storage, and adaptive delivery of video content to millions of concurrent viewers.

graph TD
    CREATOR["Content Creator"]
    CREATOR -->|"Upload video"| UPLOAD["Upload Service"]
    UPLOAD --> QUEUE["Transcoding Queue
(SQS/Kafka)"]
    QUEUE --> TRANSCODE["Transcoding Workers
(multiple resolutions)"]
    TRANSCODE --> STORE["Object Storage
(S3 / GCS)"]
    STORE --> CDN["CDN
(Edge servers worldwide)"]

    VIEWER["Viewer"]
    VIEWER -->|"Adaptive bitrate"| CDN
    VIEWER --> API["API Service
(metadata, search, recommendations)"]
    API --> METADB["Metadata DB"]

    style TRANSCODE fill:#56cc9d,stroke:#333,color:#fff
    style CDN fill:#ffce67,stroke:#333
    style STORE fill:#6cc3d5,stroke:#333,color:#fff

Requirements

Type	Requirement
Functional	Upload videos; Stream videos (adaptive bitrate); Search and browse; Like, comment, subscribe
Non-functional	Low startup latency (<2s); Smooth playback (no buffering); Global availability; 1B DAU, 5M videos uploaded/day
Capacity	Avg video: 200MB raw → 500MB transcoded (multiple resolutions); ~1PB new storage/day

Video Processing Pipeline

Upload → Original Storage → Transcoding → CDN Distribution

Transcoding outputs (per video):
  ┌────────────────────────────────────────┐
  │ Resolution   Bitrate    File Size      │
  │ 360p         800 kbps   ~50MB          │
  │ 480p         1.5 Mbps   ~100MB         │
  │ 720p         3 Mbps     ~200MB         │
  │ 1080p        6 Mbps     ~400MB         │
  │ 4K           20 Mbps    ~1.5GB         │
  └────────────────────────────────────────┘
  + Audio tracks (multiple languages)
  + Subtitles (multiple languages)
  + Thumbnail generation (every 10s for preview)

Adaptive Bitrate Streaming (ABR)

graph LR
    CLIENT["Video Player"]
    CLIENT -->|"Measures bandwidth"| ABR["ABR Algorithm
(DASH / HLS)"]
    ABR -->|"Good network"| HD["1080p chunks"]
    ABR -->|"Poor network"| SD["480p chunks"]
    ABR -->|"Very poor"| LOW["360p chunks"]

    style ABR fill:#56cc9d,stroke:#333,color:#fff

Protocol	Used By	Segment Size
HLS (HTTP Live Streaming)	Apple, most platforms	2-10s segments
DASH (Dynamic Adaptive Streaming)	YouTube, Netflix	2-10s segments

CDN Strategy

Approach	Description
Push popular content	Pre-load trending videos to edge servers
Pull on demand	Edge fetches from origin on first request, then caches
Regional origin	Multiple origin servers in different regions
Long tail	Less popular content served from fewer / central CDN nodes

Q7: Design a Notification System

Answer:

A notification system delivers timely, relevant notifications across multiple channels (push, SMS, email) to billions of users.

graph TD
    TRIGGER["Event Triggers
(order shipped, new follower, etc.)"]
    TRIGGER --> NS["Notification Service"]
    NS --> PREF["User Preferences
(channels, frequency, opt-outs)"]
    NS --> TEMPLATE["Template Service
(personalize message)"]
    NS --> QUEUE["Priority Queues
(Kafka / SQS)"]

    QUEUE --> PUSH["Push Worker
(APNs / FCM)"]
    QUEUE --> SMS["SMS Worker
(Twilio)"]
    QUEUE --> EMAIL["Email Worker
(SES / SendGrid)"]

    PUSH --> USER["User Device"]
    SMS --> USER
    EMAIL --> USER

    style NS fill:#56cc9d,stroke:#333,color:#fff
    style QUEUE fill:#ffce67,stroke:#333

Requirements

Type	Requirement
Functional	Multi-channel: push, SMS, email, in-app; User preferences and opt-out; Notification templates; Rate limiting per user
Non-functional	Soft real-time (<30s for push, minutes for email); At-least-once delivery; 10B notifications/day; Pluggable providers

Architecture Deep Dive

Event flow:
  1. Service emits event: {"type": "order_shipped", "user_id": 123, "data": {...}}
  2. Notification Service:
     a. Check user preferences (opted-in channels, quiet hours)
     b. Check rate limits (max 5 push/hour per user)
     c. Render template with user data
     d. Enqueue to channel-specific queues with priority
  3. Channel workers:
     a. Dequeue message
     b. Call provider API (APNs, Twilio, SES)
     c. Handle retries with exponential backoff
     d. Log delivery status
  4. Analytics:
     - Track: sent, delivered, opened, clicked, unsubscribed

Handling Scale and Reliability

Challenge	Solution
Provider failures	Retry with exponential backoff + fallback providers
Duplicate notifications	Idempotency key per notification (dedup in Redis)
Quiet hours / time zones	Store user timezone, schedule delivery accordingly
Notification fatigue	Rate limiting + batching (digest emails)
Provider rate limits	Queue with controlled concurrency per provider
Delivery tracking	Webhook callbacks from providers + polling

Q8: Design a Search Autocomplete System

Answer:

An autocomplete system suggests query completions in real time as users type, based on popularity, personalization, and recency.

graph TD
    USER["User types: 'syst'"]
    USER --> API["Autocomplete API
(<100ms response)"]
    API --> CACHE["Local Cache
(per server)"]
    CACHE -->|"miss"| TRIE["Trie Service
(in-memory)"]
    TRIE --> RESULTS["Top-K results:
1. system design
2. systems programming
3. systematic review"]

    subgraph Offline["Offline Pipeline (hourly)"]
        LOGS["Search Logs"] --> AGG["Aggregation
(MapReduce)"]
        AGG --> BUILD["Build Trie
(top queries)"]
        BUILD --> DEPLOY["Deploy to
Trie Servers"]
    end

    style API fill:#56cc9d,stroke:#333,color:#fff
    style TRIE fill:#6cc3d5,stroke:#333,color:#fff
    style Offline fill:#ffce67,stroke:#333

Requirements

Type	Requirement
Functional	Return top 5-10 suggestions per prefix; Rank by popularity / recency / personalization; Handle misspellings (fuzzy match)
Non-functional	P99 latency <100ms; Support 100K QPS; Update suggestions without downtime

Trie Data Structure

Trie for ["system", "systems", "syslog", "syntax"]:

         root
          |
          s
          |
          y
         / \
        s    n
        |    |
        t    t
        |    |
        e    a
        |    |
        m    x
       /
      s

Each node stores:
  - Character
  - Top-K queries passing through this prefix
  - Frequency / score for ranking

Two-Phase Architecture

Phase	Component	Latency	Frequency
Online (serve)	Trie servers + cache	<100ms	Per keystroke
Offline (build)	MapReduce + Trie builder	Minutes	Every 15-60 min

Ranking Signals

Signal	Description	Weight
Query frequency	How often this query is searched	High
Recency	Trending queries weighted higher	Medium
Personalization	User’s past search history	Medium
Freshness	New events (e.g., breaking news)	Variable

Q9: Design a Distributed Key-Value Store

Answer:

A distributed key-value store provides fast, reliable storage and retrieval of data across a cluster of machines, handling partitioning, replication, and failure recovery.

graph TD
    CLIENT["Client"]
    CLIENT --> COORD["Coordinator Node
(routes to correct partition)"]
    COORD --> N1["Node 1
(keys A-H)"]
    COORD --> N2["Node 2
(keys I-P)"]
    COORD --> N3["Node 3
(keys Q-Z)"]

    N1 --> R1["Replica 1a"]
    N1 --> R2["Replica 1b"]

    subgraph Ring["Consistent Hashing Ring"]
        H1["Hash(key) →
walk clockwise →
find node"]
    end

    style COORD fill:#56cc9d,stroke:#333,color:#fff
    style Ring fill:#ffce67,stroke:#333

Requirements

Type	Requirement
Functional	`get(key) → value`; `put(key, value)`; `delete(key)`; Support arbitrary value sizes (up to 1MB)
Non-functional	High availability (AP system); Tunable consistency; Low latency (<10ms P99); Horizontal scaling (add nodes without downtime)

CAP Theorem Trade-offs

graph TD
    CAP["CAP Theorem:
Pick 2 of 3"]
    CAP --> C["Consistency
(every read gets latest write)"]
    CAP --> A["Availability
(every request gets a response)"]
    CAP --> P["Partition Tolerance
(works despite network failures)"]

    C --- CP["CP Systems:
MongoDB, HBase, Redis Cluster"]
    A --- AP["AP Systems:
Cassandra, DynamoDB, CouchDB"]

    style CAP fill:#56cc9d,stroke:#333,color:#fff
    style CP fill:#6cc3d5,stroke:#333,color:#fff
    style AP fill:#ffce67,stroke:#333

Key Design Components

Component	Design Choice	Rationale
Partitioning	Consistent hashing with virtual nodes	Even distribution, minimal reshuffling when nodes join/leave
Replication	Replicate to N=3 clockwise neighbors	Fault tolerance
Consistency	Quorum: W + R > N (configurable)	Tunable: W=1,R=3 (fast writes) or W=2,R=2 (balanced)
Conflict resolution	Vector clocks + last-write-wins	Handle concurrent writes during partitions
Failure detection	Gossip protocol	Decentralized, scalable node health checks
Write path	Write-ahead log → MemTable → SSTable	Fast writes, durable, efficient reads

Consistency Levels (DynamoDB/Cassandra Style)

Setting	Write (W)	Read (R)	Behavior
Strong	W=N	R=1 or W=1, R=N	Always latest value
Quorum	W=2, R=2 (N=3)		Latest value if no concurrent writes
Eventual	W=1	R=1	Fastest, may read stale

Q10: Design an API Gateway and Load Balancer

Answer:

An API Gateway is the single entry point for all client requests, handling routing, authentication, rate limiting, and protocol translation. A Load Balancer distributes traffic across backend servers for high availability and throughput.

graph TD
    CLIENTS["Clients
(Web, Mobile, Partners)"]
    CLIENTS --> GW["API Gateway"]
    GW --> AUTH["Auth Plugin
(JWT / OAuth)"]
    GW --> RL["Rate Limiter"]
    GW --> ROUTE["Request Router"]
    GW --> TRANSFORM["Protocol Translation
(REST ↔ gRPC)"]

    ROUTE --> LB1["Load Balancer
(User Service)"]
    ROUTE --> LB2["Load Balancer
(Order Service)"]
    ROUTE --> LB3["Load Balancer
(Search Service)"]

    LB1 --> US1["User Svc 1"]
    LB1 --> US2["User Svc 2"]
    LB1 --> US3["User Svc 3"]

    style GW fill:#56cc9d,stroke:#333,color:#fff
    style LB1 fill:#ffce67,stroke:#333

API Gateway Responsibilities

Function	Description
Routing	Route `/api/users/` → User Service, `/api/orders/` → Order Service
Authentication	Validate JWT/OAuth tokens before forwarding
Rate limiting	Per-client, per-endpoint throttling
Request transformation	REST ↔︎ gRPC, request/response rewriting
Circuit breaker	Stop forwarding to unhealthy services
Caching	Cache GET responses for static/semi-static data
Logging & metrics	Centralized request logging, latency tracking
SSL termination	Handle HTTPS at the edge

Load Balancing Algorithms

Algorithm	How It Works	Best For
Round Robin	Distribute sequentially to each server	Equal-capacity servers
Weighted Round Robin	Higher-capacity servers get more requests	Mixed hardware
Least Connections	Route to server with fewest active connections	Variable request duration
IP Hash	Hash client IP → always same server	Session affinity
Consistent Hashing	Hash-ring-based routing	Cache servers, stateful services

Health Checks

Active health checks:
  - Gateway pings /health on each backend every 5-10s
  - Unhealthy after 3 consecutive failures
  - Healthy after 2 consecutive successes
  - Remove unhealthy servers from rotation

Passive health checks:
  - Monitor response codes and latency
  - If >50% of requests to a server fail → mark unhealthy
  - Automatic recovery when success rate improves

High Availability Design

Layer	Redundancy Strategy
API Gateway	Multiple instances behind DNS round-robin or network LB
Load Balancer	Active-passive pair with virtual IP failover
Backend services	Minimum 3 instances per service, across availability zones
Database	Primary-replica with automatic failover
Cache	Redis Cluster with replication

Summary Table

#	System	Key Concepts
1	URL Shortener	Base62 encoding, key generation, read-heavy caching, 301 vs 302
2	Rate Limiter	Token bucket, sliding window, Redis counters, API gateway placement
3	Chat System	WebSocket, message queues, Cassandra, fan-out, delivery receipts
4	News Feed	Fan-out on write vs read, hybrid approach, feed ranking
5	File Storage	Chunked upload, deduplication, delta sync, conflict resolution
6	Video Streaming	Transcoding pipeline, adaptive bitrate, CDN, HLS/DASH
7	Notification System	Multi-channel, priority queues, rate limiting, template rendering
8	Search Autocomplete	Trie, offline pipeline, top-K ranking, two-phase architecture
9	Key-Value Store	Consistent hashing, CAP theorem, quorum reads/writes, vector clocks
10	API Gateway	Routing, auth, rate limiting, load balancing algorithms, circuit breaker

System Design Interview Framework

Use this framework for any system design question:

Step	Duration	What to Do
1. Requirements	5 min	Clarify functional + non-functional; estimate scale (QPS, storage)
2. High-Level Design	10-15 min	Draw core components; define APIs; identify data flow
3. Deep Dive	15-20 min	Database schema; algorithm choices; scaling strategies
4. Wrap Up	5 min	Review requirements; discuss bottlenecks; suggest improvements

What’s Next?

This article covered the top 10 system design interview questions. For related content:

Foundational concepts: System Design Interview QA - 1
Infrastructure deep dives: System Design Interview QA - 2
Design patterns: Design Pattern Interview QA - 1
Enterprise patterns (Spring, CQRS, MVC): Design Pattern Interview QA - 2
Production API design: Python SWE Interview QA - 4