System Design Notes

Notes about System Design

Last Updated: June 01, 2023 by Pepe Sandoval

Want to show support?

If you find the information in this page useful and want to show your support, you can make a donation

Use PayPal

This will help me create more stuff and fix the existent content...

System Design

System Design Intro & Basics

System Design Diagrams

  • Stacked service boxes mean load balancer

  • Divider must have protocol use

  • DB are repesented as cylinders/barrels

  • Sync requests are represented with solid arrow and async requests are represented with queue cylinder or dotted line

  • Avoid crossing lines, follow left to right, top to bottom flow for diagram and align elements

    System Design Diagram Blocks

Estimates on System Design

  • The important thing to remember for System Design is: CPU Cache > RAM > Disk > Local Network > Global Network

  • Capacity and throughput numbers to estimate if more than one instance of certain resource is needeed

    Estimated System Design

Load Balancer

  • It may be a physical machine or a virtual machine, but it will be a separate instance from your app

  • Load balancer runs a software called a reverse proxy

  • The goal of this software is to distribute the traffic between multiple servers/services

  • Load balancer uses a stragegy to distribute traffic. For example:

    • Round-Robin Distibute requests 1 per server/service and cirlcles around to the first server/services once it reached last server
    • Least Connections Distribute requests to servers/services with least requests being handled or active connections
    • Resource Based Distriubte request based on servers/services with more available resources
    • Random Randomly picking servers/services to handle requests
    • Weighted variants of other strategies
  • Load balancer pros:

    • Makes our system more resilient (if one server/service fails, load balancer must be able to route to an operational one)
    • Makes the system more scalable (horizontaly)


Types of Load Balancers

  • Layer 4

    • Works at the transport layer level
    • Make desicions based on IP, Port or on whether this is a TCP or UDP request
    • Simpler load balancers
  • Layer 7

    • Works at the App layer level
    • Has acces to everything a Layer 4 load balancers have and also HTTP, headers, cookies and payload
    • They can make desicions based on the data in the request
    • More complex load balancers

CDN - Content Delivery Network

  • You have your app in your servers hosted in a main site then you have a Content Delivery Network (which are other servers spread across the globe), which stores the static content of you app so users can fetch it from a closer server
  • CDN is a cache for certain content (usually static content like Images, CSS, JS, HTML files, etc.)
  • The intention of a CDN is to decrease latency, allowing to places assets closer to the users

Types CDN

  • Push CDN

    • When you upload new data or file this is donwloaded by all CDN servers
    • Good for when you DON'T have much content It is expensive because all CDN servers need to have all your data
  • Pull CDN

    • If a user request a content that the CDN doenst have, only this particular CN will download the content from the main app server and have it ready
    • If some content is never request by any user a Pull CDN will never have it
    • Good for when you have lots of content If you are the firts user to request it, it will be very slow for you


  • The goal of cache is to speed up getting some data from a slower storage by storing part of this data in a faster storage. For example:
    • Store somethin in RAM instead of accessing the disk
    • Browser storing data on disk so it won't need to access the network
  • In general we use cache to improve read performance (latency) and reduce load (throughput) but we pay the price with complexity and dealing with inconsitent data

Caching Strategies

  • Cache aside:
    • App has access to cache and main storage, checks cache if not thre, then fetch from storage and updates cace
    • + we cache only what's needed
    • - cache misses are expensive (slow) and require someone to implement this complexity
  • Read Through:
    • App doesn't have direct access to the storage, but instead always interacts with the cache API, in case of a miss API needs to be in charge of fetching data from storage and updating the cache
    • + We cache only what's needed, cache is abstracted from app, API deals with that complexity
    • - Cache misses are expensive (slow) and API still deals with implementing complexity of dealing with a miss
  • Write Through:
    • App interacts with an API that immediately for each update also stores data in cache.
    • + data needed from cache is almost always up-to-date
    • - Writes are expensive and there is a chance that we will write data to cache that no one ever readd (redundant unused data)
  • Write Behind:
    • App interacts with an API that for each update also stores data in cache but not immediately, API waits for a certain timeout or more write events to flush everything to storage.
    • + Write seems faster to user because slow storage is not accessed on every write
    • - Not fullly reliable because if cache crash before the timeout occurs or before certain events happen we may lose data and not flushing data often enough may create inconsistencies.


Eviction Policies

  • These are the strategies a cache can follow once it becomes to large

  • Eviction policies help us decide whci elements we should remove from cache

  • LRU (Least Recently Used):

    • A linked list where the Head of the list points to the next element to be removed
    • When an elements is accesed (cache hit) it's moved to the end of the list, this makes least frequently used elements to end up ath the start of the list
    • When an elements is not found (cache miss) and needs to be fetched from storage it is added to the end of the list as pushes one element out
    • Suffers from False Cache Eviction which means if a lot of new elements are requested at once it may evict some other frequently accessed ones
  • LFU (Least Frequently Used)

    • Every element has a counter that is increments every certain time (higher the counter signals least used element)
    • When there is a cache hit (element accessed) counter is reset to 0
    • When there is a cache miss (element needs to be fetched) algorithm evicts element with highest counter
    • Suffers less from False Cache Eviction but adds complexity and overhead of keeping track of the counters


  • Redis is a popular cache solution
  • Refers to a in-memory key value storage, which means acts similar to a dictionary that lives in RAM
  • It supports TLL (Time to live) you specify key-value and you tell it for how long it should be stored in seconds
  • It supports persistance so it can save to disk every certain amount of time
  • Usually it is a remote service so there is some network latency


  • On system design when we refer to queue people usually refers to piece of software that runs on dedicated hardware to process async requests
  • One service puts messages on a queue (producer) and other service process those messages (consumer), messages are usually stored in a queue for some time in case producer or consumer crashes, messages can be preserved there while one or both of the services recover
  • Queues increase latency

Messaging Models

  • Message Queue
    • Queue receives message from a producer ans chooses only one consumer that can process the messages
    • In this model the queu making sure that in the end the message is processed by only one consumer
    • If the chosen consumer is not available the queue will try to assign the message to another
    • Message queues indicate an asynchronous action so could be unordered Messages
    • RabbitMQ popular queue service best used as message queue
  • Publisher/Subscriber (Pub/Sub)
    • Used when you want to notidy multiple consumers about an event
    • Services subcribe to a message queue and everytime a producer/publisher puts a message on the queue all the subscribers/consumers are notified about it
    • Pub/sub queues are used as notification delivery and ensure ordered Messages
    • Kafka pub/sub system with high throughput

Queue Messaging Models



  • TCP is the most common protocol when developing distributed systems. It's the backbone of the Internet. (HTTP is based on TCP)
  • Reliable: establish connection, sends message in packets, each acknowledged and retried in case not ack by receiver
  • Ordered: numbers the packets send (sequence number) to preserve order
  • Error-Checked Checksum is added to the packet send, so receiver can calculate checksum and verify packet is correct
  • Slow & Complex all the logic to preserved order and make sure packets arrive make it slow and complex to implement


  • Fast & Simple it has none of the complexity of TCP, making it faster and easier to implement
  • Unreliable, Unordered, Error-prone Packets of a message may be lost, arrive out of order or be corrupted
  • UDP is good for a constant stream of data (Because you'll probabily receive the next message soon enough it doest matter much if some packages are lost or corrupted)


  • HyperText (text with links to other documents) Transfer Protocol.
  • It is a request-response protocol based on TCP
    • HTTP Request can consists of 4 parts: Method, URL, Headers and Body (Optional)
    • HTTP Response can consists of 3 parts: Status, Headers and Body (Optional)
  • HTTP Methods:
    • POST - Create
    • GET - Read
    • PUT - Update
    • DELETE - Delete
    • PATCH - Partial Update
  • Common Status Codes
    • 100-199 Informational
      • 100 Continue (Used to validate headers before actually sending something)
    • 200-299 Successful
      • 200 OK
      • 201 Create
    • 300-399 Redirection
      • 301 Moved Permanently
    • 400-499 Client Error (Client did something wrong)
      • 401 Unauthorized
      • 403 Forbidden
      • 404 Not Found
    • 500-599 Server Error
      • 500 Internal Server Error
      • 503 Service Unavailable


  • Try to use pairs first part is the name resources in plural second is ID of the resource i.e. GET /users/123
  • For nested resources repeat pairs pattern i.e. GET /users/123/books/567
  • UsePUT to switch state i.e. PUT /users/567/disable or PUT /users/567/enable
  • Do NOT use GET to change entity or resource i.e. DONT: GET /users/123/enable
  • PUT usually is idempotent (like a boolean if you set it twice to same value doesn't matter)
  • POST usually is not idempotent (like a counter everytime you set it changes to different value)
  • Provide pagination and sorting i.e. GET books?limit=10&offset=100


  • It is a duplex (wo way) protocol build on top of TCP
  • Client connection is negotiated with Server only once to establish connection after that client and server can send messages simultaneously
  • Creates a persisten connection between client ans server, needed when you need to receive async messages from the server
  • Usually more complicated than HTTP because WebSockets need statefullness

RPC (Remote PRocedure Call)

  • The idea of invokin another service as if it were a local function

  • Process

    1. You describe the function in an IDL (Interface Description Language)
    2. You send this description to the generator and a chosen language
    3. the generator creates the implementation in a particular language returning you a stub function with the definition of your function in a specific language
    4. You implement the actual functionality (the body) of the function
  • Add a level of abstracion to instead of using a specific language use and IDL

  • gRPC: It is an RPC protocol that uses Protobuf as IDL and HTTP2 as transport, not supported by browsers

    • Protobuf is binary protocol that store descriptions on .proto files, it is not human readable and smaller/faster than JSON or XML

GraphQL (Graph Query Language)

  • It is protocol based on HTTP that is used to avoid overfetching and underfetching (requests and responsed are in JSON)
  • Overfetching getting more info from a request that the one you actually need
    • You do a GET of a resource it returns all the info about that user but you only need 1 or 2 fields
  • Underfetching getting less ingfor a request forcing you to make extra requests using the information from the first request
    • You do a GET of a resource it returns the IDs of other resources that have the information you need so you need to make more GET requests to actually get all the info you need
  • Query it is the method to fetch data, allows you to define which fields and nested entities you want to get only what you need
  • Mutation it is the method to modify data in the protocol, allows you to modify only what you need

Protocols Usage Summary

Protocol/Usage External API Bi-directional High Thoughput Web browser support
REST Yes No No Yes
WebSockets Yes Yes Yes Yes
gRPC No No Yes No
GraphQL Yes No No Yes
UDP Yes Yes Yes No


  • Parallelism Means actually doing more than one task at the same time
    • e.g. two processors each one doing a separate calculation at the same time
  • Concurrency Doing multiple tasks but givin each a slice of time and switching between the tasks
    • Gives the illusion we are doing more than one task at the same time
    • e.g. a single processor first doing some calculations for a process then doing a different calculation for another process


  • Each process has a separate memory space
  • One process is not allowed to acess memory from another process
  • To communicate between each other processes use interporcess communication like: Files, Signals (specific command/message like kill -9 PSN), Pipes (stdout, stdin, etc.), Sockets (establish a connection)
  • Process has its memory in Heap which is shared between Threads


  • The one in charge of actually running code
  • A process creates one or more threads to run
  • Threads can run on separate CPUs
  • A thread must have a stack to store local data/variables and call stack
  • Threads overhead: spawning a thread per user accessing your system may cause resources competition problems (CPU usage, memory, access to locks, etc) or bottle neck issues because OSs usually limit the number of threads

Process & Threads Anatomy

Thread Pools

  • It is a way to limit the number of thread to avoid hitting OS limitations but at the same creating multiple threads to run something
  • If you havent reached the limit of the pool a new thread is created to run a task, once thread is done it frees a space on the pool



  • They are created for each of the elements of a column in a table to make searches on this column faster
  • When you index a column a B-Tree+ is created (!= to Binary tree) this type of tree allows 3 children per node (unlike binary trees that only allw 2 children per node)
    • B-Tree+ has all the values on the leaves and the nodes have ranges
    • In a B-Tree+ all the leaves are ordered and connected like a linked list

B Tree+


  • Relational databases have a limitation in how big a single instance can get, sharding providea s solution to this
  • The idea is to split a large DB into smaller DBs (shards).
  • Some strategies to split and access the splitted DBs are: Tenant Based Sharding and Hash Based Sharding
  • Usually only implemented for very large systems, it is complex to implement
  • it distributes the keys, not the values (i.e. in a social network app if you use ID of user as key and some users post a lot and other don't, you will end up with shards larger than other due to this)
Tenant Based Sharding (Geo Sharding)
  • Good for systems that have clear separation between entities (i.e. Uber, driver works in a certain country and if this same drives works in other country needs to be registered as a separate user on that country)
  • Good when know that certain set of users will only access on one of the database while other set of users will access a different one (drivers from France access FR DB and drivers from US use US DB)
  • If we need user to access data from different DBs app needs to handle this with reference to different DBs or by copying the same user to different DBs
  • Usually used with DBs per country so when we need to operate on new country we just add a new DB for that country, although single shards may still be too big to handle (lots of users in one country but few users in others)

Geo Sharding

Hash Based Sharding (Map Sharding)
  • Multiple DBs/shards and each time we need to perform an operation on a database we first calculate which shard this entity is located (use a hash function on the user ID)
  • Creates even distribution of users but adding shards is usually difficult becasue we need a new hash function which means resharding data (moving data between shards to keep everything running as expected)
  • It's harder to store relational data, because entities that are related may end up in different shards and it would be hard connect between them
  • Using a routing layer or locator service that helps locate the shard needed can help but at the same time locator becomes single point of failure

Map Sharding

Consistent Hashing
  • Instead of assigning each shard a single value, we assign shrds a range of values
  • Makes it easier to add shards because we need to move less entities (resharding is less expensive in this strategy)

Consistent Hashing


  • The idea is to split a large table in a DB into smaller tables
  • DB tables and indexes are files so smaller files & indexes results in faster queries
  • Adds complexity because requires maintanance, we must be aware of which tables are partitioned, in case we need to add new partitions (due to new values, after certain dates, etc)
  • It is hard to maintain uniqueness cuase each of the partitions is a separate physical table so unique key will be per particular table and not global.
  • More useful and also more used than sharding because it can be applied to a smaller scope
  • Some strategies to split and access DBs tables are: List of values, range of dates or Hash of a key
    • List of values
      • Depending on the values of an entity we place it in a different table
      • Usually produce small tables (fast access)
      • Also can produce uneven data and usually requires moving data between tables when a value of an entity change
    • Range of Dates
      • Depending on the date value of an entity we place it in a different table
      • Usually produce small (fast access) tables and good for delete operations of old data
      • Still can produce uneven data
    • Hash of a Key
      • Eetities need and ID as primary ky (i.e UUID), we define number of tables to use, then we use modulooi operator (%) to know in which table should we store data
      • Usually produce more evenly distributed data
      • Queries based on date or other values can be slow due we may need to query multiple tables
      • Increasing tables is hard because you will need to redistribute data

Partitioning & Strategies

Sharding is breaking one large database into smaller databases. Partitioning is breaking one large table into smaller tables.

Partitioning vs Sharding

CAP (Consistency, Availability, Partition Tolerance) Theorem

  • In case of a network partitioning, if our network nodes become disconnected from each other, system must choose if it stays consistent or highly available.
  • When system is partitioned into nodes if these are disconnected from each other for what ever reason system can only stay either consistent or highly available (cannot both)
  • In summary systems can only be consistent or highly available
  • If you favor consistency
    • Writes/access to DB during disconnection of one or more of the nodes is not allowed becasue it could create inconsisten data
    • User will need to retry operation(s) at a later time
    • All nodes see same data but may not be able to make changes
  • If you favor high availability
    • Writes/access to DB during disconnection are allows causing inconsistent data for a while until conflicts are resolved
    • Needs a conflict resolution algorithm like Majority based or timestamp based
    • All nodes are able to make changes but may not see the same data

ACID (Atomic, Consistency, Isolation, Durability) Transactions

  • Atomic

    • Cannot be broken or interrupted, either everything inside an atomic transaction happens or nothing happens at all.
    • If a block of code or commands is atomic and one of the operations fail all the block needs to be rolled back
    • Requires memory to store data until is commited
  • Consistency

    • If you try to commit something that is wrong transaction must fail
    • Operations must be validated before commited
    • i.e. If each elements of a DB table as an unique key and you try to insert a repeated key transaction must be rejected by BD
    • Requires CPU consumption and IO accesses to make sure data is valid
  • Isolation

    • Until a transaction is committed, other users cannot see the results of that transaction.
    • Nothing is seens by other users until commit
    • Requires memory and CPU consumption to make sure trasanctions won't interfere with each other
  • Durability

    • After a transaction is commited data must be saved regarless of any error/exception scenarios (i.e., DB crashes, app crashes, power loss on servers, etc)
    • Everything you commit is backed up to storage
    • Requires IO access to guarante data is store in non-volatile storage and/or replicated for backup
  • Having an ACID implementation is complex and anything complex is usually slow

Session Management

  • It is the concept used to preserve state between HTTP requests

  • Cookies are a client side solution to support state on HTTP

    • Save resources on the server since everything is on the client side
    • Cookies have size limitations and max number of cookies is also limited
    • Cookies are transfered as HTTP headers for each request
  • Sticky sessions:

    • Session cookie is used as a key to access information stored on the server side (store info on a memory of a specific server)
    • Uses load balancer (with layer 7 support) which alwasy picks same server based on the session cookie (it needs to pick the same server that has access to the information stored)
  • Key-Value store:

    • Use a key-value store instead of storing information in the memory of a particular server
    • Servers need to access this key-value storage to get information
    • Adds complexity to implement accesses to key-value storage and this storage can be a single point of failure

Serialization & Deserialization

  • Data Deserialization means taking bytes and bits in raw form (for example from a file) and creating an object or data structure
    • The process of taking something that is stored somewhere and translate it to a data structure is called data deserialization
  • Data Serialization taking a data structe or object and converting into something that can be stored on disk ((for example a file))
  • The process of taking a data structure or an object and converting it into something that can be stored on disk is called data serialization
  • When doing Serialization and DeSerialization on an app we want to avoid interleaving and have just a single request processed simultaneously from start to finish
  • To avoid interleaving we can have the component that receives requests from users and then sends events over a queue to another component (that does Serialization & Deserialization)
  • Component that does Serialization & Deserialization must process requests one by one (Request Serialization) and it is the only one that can write to the storage

Data serialization is converting between raw bits & bytes to data structs while Request Serialization means processing requests from start to finish sequentially

Serialization & Deserialization

CQRS (Command Query Responsibility Segregation)

  • Refers to the segregation of wuery commands, spliting the storage into two parts Command and Query (each part optimized for its particular use)
  • Query optimized for reads (has indexes) and Command optimized for writes (no indexes)
  • We have a scheduled task that syncs/updates both parts accordingly
  • Implementation can be complex since we need to monitor task that syncs both parts and users may see some inconsistent data due to syncing process


Want to show support?

If you find the information in this page useful and want to show your support, you can make a donation

Use PayPal

This will help me create more stuff and fix the existent content...