Software Architecture Design Notes

Software Architecture Design


Last Updated: June 01, 2023 by Pepe Sandoval



Want to show support?

If you find the information in this page useful and want to show your support, you can make a donation

Use PayPal

This will help me create more stuff and fix the existent content...


Software Architecture

1. Introduction to Software Architecture

  • The software architecture of a system is a high level description of the system structure, it's different components and how those components communicate with each other

    • High level description in this context means it is an abstraction that shows only the important parts and hide implementation details
    • Components: are black box elements that are defined by their behavior and APIs.
  • The structure of our system describes both the intent of our product and its qualities.

  • There are infinite ways to organize code but different organizations will give us different properties, like:

    • Performace, Scalability, Ease of adding features, response to failures and/or security attacks
  • SW Arch abstractions

    • Classes/structs: communication between objects inside a program.
    • Modules/packages/libraries: how they interact with each other.
    • Services (processes/groups of processes): separate services potentially on different computers.
  • A distributed multi-service approach allows us to architect systems that can handle large amounts of requests, process, and store very large amounts of data,

SDC

  • SW architecture cannot be proven to be both correct and optimal, but we can follow a methodical design process, apply industry proven architectural patterns and best practices

2. System Requirements

  • Always focus on figuring out and narrowing down what exactly we need to build

  • Translate high level request to technical requirements

  • Classifying requirements into categories can help, type of requirements:

    • Features of the system:
      • Functional requirements that describe what the system must do
      • Describe system as a black box with user actions or events as input and what must be thour system output
      • Example: A user logs into the ride share app and the system must display a map with the nearby drivers within 5km
    • Quality Attributes:
      • Non-Functional requirements that describe properties the system must have
      • Expected quality metrics that measure how well the system performs, like average response time and availability of resources
        • i.e: Download speeds should be at least 50 Mbit/sec, active no later than 1 second after, able to deploy new version every week, get response within 500ms, available 99% of time, etc
      • Like: Scalability, Availability, Reliability, Maintainability, Security, etc.
      • Quality attributes need to measeurable and testable
      • Different architectures provide us with different quality attributes
      • No single architecture can provide all quality attributes cause certain quality attributes contradict with each other.
    • System Constraints:
      • Describe limitatiosn and boundaries of the system (technical, business, legal)
      • Define resources constrains like:
        • Time constraints (deadlines)
        • Max thropughput and Max latency
        • Country laws and/or regulations
        • specific technologies to be used (HW, SW, OS, cloud vendor, use of a third party, programming languages), etc.
  • Use Case: It is a particular situation or scenario in which our system is used to achieve a goal.

  • Use Flow: It is step-by-step or graphical representation of a use case

  • Gather Functional requirements

    1. Identify all users in our system
    • i.e. User and Driver on a ride share app
    1. Document all possible use-cases/scenarios in which an user can use our system
    • i.e. use cases: Rider first time registration, rider login, request ride, success match for ride or unsuccesful ride, etc
    1. Expand use case through a flow of events Usually using a sequence diagram and/or interactions between users, could include other objects like the entity in charge of processing user actions
    • i.e. a sequence diagram (each arrow to the system represents API)
    1. Interactions and/or events should be documented specifying the action and data involved
    • i.e. a list of notes in the sequence diagram documenting this (data flowing between users and systems represent arguments and returns of our API)
  • Gather Functional requirements

3. Quality Attributes for Large Scale Systems

Performance

  • Performance in terms of Speed == Response time (Response Time = Processing Time + Waiting Time): The time between a client sending a request and receiving a response.

    • Processing Time: The time it takes our system to process the request, building and sending the response. The amount of time spent actively in our system (our code, databases, waiting for IO, etc)
    • Waiting Time: Duration of the time the request and reponse spends inactively in our system, usually spend in network or on SW queue waiting to be handled
    • When measuring response time we can take the average response time obtained from sending multiple requests
    • Ideally we create a Response time distribution histogram to map how many of the requests-responses send took certain amount of time (3 request = 10ms each, 2 requests = 20ms each, etc.)
    • Tail latency is the small percentage of response times from our system that take the longest in comparison to the rest of the values.

    Processing Time

  • Performance in terms of processing == Throughput (Throughput = Tasks/second = Bytes/second): Refers to the ability to ingest and process data at a given period of time.

    • Throughput The amount of work performed by our system per unit of time
    • Degradation point is that point in our performance graph, where the performance is starting to get significantly worse as the load increases.

Scalability

  • Scalability Refers to the ability to handle a growing amount of work in a cost effective way by adding resources to the system.
  • In a scalable system for the same amount of effort and cost, it is able to achieve much better results.
  • Vertical scalability Increase or upgrade processing/storage resources (Better/Newer servers/databases) to process the load faster on each single machine
    • Any application can benefit from it, without any additional code changes.
    • Migration of system to other machines should be simpler
    • There will be a physical limit regardless of how much money we put
    • We are locking ourselves to an inherently centralized system so it can't provide high availability nor fault tolerance
  • Horizontal scalability Increase the number of instances of processing/storage resources (More Servers/Databases) to spread the load among a group of machines
    • No physical limit to scalability
    • Can provide high availability and fault tolerance
    • Needs code changes or needs to be designed to scale horizontally
    • Increases complexity and adds coordination overhead
  • Organizational scalability Increase the number of developers/engineers working on the prokect
    • Breaking system into separate servives that can be worked by different teams of developers independently
    • A service has its own code base, its own stack, and its own release schedule, and communicate with other services through loosely coupled protocols over the network.

Availability

  • Availability refers to the fraction of time/percentage that our service is functional and accessible to users

  • Uptime: time that our system is functional and accessible

  • Downtime: time that our system is unavailable

  • Availability % = Uptime/(Total Uptime + Downtime) = MTBF/(MTBF+MTTR)

  • MTBF (Mean Time Between Failures) represents the average time the system is operational

  • MTTR (Mean Time to Recovery) represents the average time the system takes to detect and recover from a failure (average downtime)

  • MTBF/(MTBF+MTTR) formula shows that fast detection and recover can help us achieve high availability

  • Availability Industry standards set by cloud vendors is generally anywhere between 99% to 99.9%

    Availability

  • What errors prevent us from having high availability

    • SW Bugs: Introduced by developers
    • SW Errors: out of memory, segfault (read/write invalid memory), null pointers
    • HW Errors: crashing HW due to hitting usage limits, power outage, network infrastructure issues, general congestion
  • Fault Tolerance means our system can remain operational and available to the users despite failures within one or multiple of its components.

  • Fault Tolerance Tactics:

    • Failure prevention: Eliminate single points of failure using replication and redundancy
      • Spatial redundancy: running replicas of our application on different computers
      • Time redundancy: repeat/retry operations until we succed or until a timeout occurs
      • Actice-Active Architecture a redundancy strategy where request go to all replicas, so all replicas must be in sync, if one goes down the other replicas step in automatically
        • Keeping all replicas in sync is not trivial, requires additional coordination and overhead
        • We can spread load across all replicas so we can process more traffic
      • Actice-Passive Architecture a redundancy strategy where we have a primary-active replica that takes all requests while other passive replicas take periodic snapshots of its state, if primary fails one of the passive replicas must take its place
        • Does not improve performance
        • Eaiser to implement
    • Failure detection and isolation Have the ability to detect faulty components and isolate them from rest of the group
      • Usually implemented using a monitoring service that check the system and make desicions
        • Monitor systesm can have false positives due to network issues or other scenarios
        • Monitor number of errors & resposne time if any of these metrics are high it can interpret this as a faulty component
    • Recovery: Refers to actions taken to try to recover a component for example:
      • Stop sending traffic and/or stop processing on that component
      • Restart component

SLA, SLO, SLI

  • SLA (Service Level Agreement)

    • An agreement between a service and user, that defines the promises we make to our customers in terms of quality
    • Defines metrics for availability, performance, data durability, time to respond to failures
    • It can explicitly state penalties if promises are breached
  • SLO (Service Level Objective)

    • Used to refer to the individual goals that we set for our system (i.e. resolution time objective between 24 to 48 Hours)
    • Represent the target values for the quality attribure we aspire to have in our system
    • An SLA is compromised of multiple SLOs
  • SLI (Service Level Indicator)

    • It the actual measure of our compliance with a service-level objective
    • The actual data measurements obtained with a monitoring system or from other calculations (i.e. grom our logs we calculate % of users requests receiving a successful response)
  • To define an SLO we should think about the metrics that users care about the most then find the right SLI to track those SLOs and try that our SLA does not contain too many SLOs, with many SLO it's hard to prioritize and set realistic goals when designing our SW architecture, finally create recovery plan for when we fail to meet our SLOs

4. API

  • APIs can be seen as contracts between engineers who implement the system and the client apps that use the system
  • Clients calling our API: front end clients like web browsers, backed systems external to our organization or internal systems within our organization
  • For large scale system we should define APIS for each of the components of our system

API categories

  • Public APIs
    • Exposed to the general public and can be used/called by any application
    • Usually only requires registration on the system in order to control who and how the API is used. also allos black list users if needed
  • Private/Internal/Partner APIs
    • Exposed only internally within our company.
    • Exposed to certain users or subscription based

API Best practices

  • Provide a complete encapsulation of the internal design and implementation
  • Be completely decoupled from our internal changing the interface
  • Be easy to use (keep things consistent), easy understand and hard to misuse
  • Have only one way to get certain data or perform a task and not many
  • Have descriptive names for our actions and resources
  • Expose only the information and the actions that the user need, and not more than that.
  • Keep the operations idempotent as much as possible to allow users to resend request without consequences
    • idempotent: something that doesn't have any additional effect on the result if it's performed more than once (with same data).
      • idempotent example: update user's addres to: Baker street 1230
      • NOT idempotent example: add 100 to a user's balance
  • Support API pagination for large payload responses so our system will provide only a small segment of the response by specifying a max size of each response from our system and an offset
    • Pagination divides large set of data into discrete chunks of data, for example social media shows you the last 10 post if you scroll down you get another 10 and so one but you dont get all the posts on your firts request
  • Support Async Operations for requests that needs one single big result/response
    • Async send respond to the client inmmediatly with an identifier that allows the client to track/poll the status of the operation and eventually receive the final result
  • Support Versioning allowing clients which version of the API they are currently using

RPC (Remote Procedure Calls)

  • A Remote Procedure Call is the ability of a client application to execute a sub routine on a remote server.
  • The RPC API style revolves around methods that are exposed to the client, system is abstracted through a set of methods the client can call
  • For the debveloper the remote method invocation looks like calling a normal local method
  • the API as well as the data types that are used in the API methods are declared using a special interface description language,
  • We use a specia; code generatop tool to create two separate implementations on on the server (Server Stub) and one on the client (Client Stub)
  • DTOs (Data Transfer Objects) Custom object or data structures that are also generate on the client/server side

RPC

  • In general when using an RPC as a developer we pick an appropriate framework, define the API as well as the relevant data types, use the frameworks interface description language and publish that description.
  • Usually on the server side developer generate the Server Stub, when a new client wants to integrate with us and make API calls to us, all they have to do is use the publicly available frameworks tools to generate their client's stub based on the API definition that we published earlier.
  • RPC are usually slow, we may need to introduce async versions for slow methods
  • RPC can be unreliable because client is using a network to communicate with our system, in some cases client cant know if server received the message which can be mitigated by making operations idempotent
  • RPC commonly used between two backed systems to completely abstract the network or for communication between different components internally in a large system
  • RPC not used for end user web page/app when we want to take advatage of HTTP headers and cookies
  • RPC revolves more around actions so every action is simply a new method with a different name and a different signature.
  • RPC frameworks: gRPC (Google), Thrift (Facebook), RMI (Java Remote Method Invocation: allows one Java virtual machine to invoke methods on an object running in another Java virtual machine.)

REST (Representational State Transfer) API

  • Not a standard nor protocol, it is actually a set of architectural style, constraints and best practices for defining APIs for the web
  • It is a resource oriented approach, the main abstraction to the used is a named resource
  • A Resource encapsulate different entities in our system and REST API allows the user to manipulate those resources only through a small number of methods
  • User request a resource and system responds with a representation of the current state of that resource
  • A system that provides a REST APIs exposes stateless information through the API cause it does not maintain any session information about a client, each message served in isolation without any information about previous requests
  • Resources are organized in hierarchy where each resource is either simple or a collection
    • Simple: it has a state and can contain one/more sub-resources
    • Collection: Contains a list of resources of the same type

Simple & Collection

  • The representation of each resource state can be expressed in different ways (JSON, html page, an image, a video stream, etc.)

  • REST API best practices

    • Name resources using clear, specific and meaningful nouns and verbs for actions on those resources
    • Use singulars for simple resources and plurals for collections
  • The REST API limits the number of methods to predefined CRUD operations which are usually mapped to HTTP methods:

    • Create -> POST : Create new resource
    • Read/Get -> GET : Don't change state just get current state of resource or list of sub-resources for collections
    • Update -> PUT : Idempotent to change update a resource
    • Delete -> DELETE/POST : Idempotent to delete a resouce

REST movies example

5. Large Scale Systems Architectural Building Blocks

Load Balancer

  • It's role is to balance the traffic load among a group of servers in our system
  • A load balancer provides
    • High scalability: add instances when load increases
    • High availability: load balancer can be configures to stpo sending traffic to a server that is unreachable
    • Performance: Adds little latency but increases throughput
    • Maintainability: allows us to take down some instances for debug/maintenance/upgrade tasks
  • Types of load balancers:
    • DNS Load balancing:
      • DNS returns a list of IP addresses corresponding to different servers for the same app, DNS can return this list in a different order for each client request, this way the DNS functions as a load balancer
      • Simple to implement but not robust
      • Client apps get the direct IP addresses of our servers which is not very secure
      • it doesnt provide monitoring,
      • it is always round robin no matter if one of the instances is more overloaded
    • HW and SW load balancers HW load balancers are dedicades devices designed for load balancing while SW load balancers perform same function on general purpose server or processing unit
      • All traffic goes to this dedicated load balancers entity
      • Can monitor servers connected to the balancer
      • It can load based on current load or other criteria
      • Doesn't solve instances in different geos HW and SW load balancer
  • Global Server Load Balancer hybrid of a DNS load balancer and a HW/SW load balancer
    • It can provide DNS service but also figure out user's location
    • Can monitor servers connected to the balancer
    • Returns address of closest load balancer that will serve the client
    • Can be configures to route traffic based on response time, currnet load/traffic on a geo location, etc
    • Users can be easily routed to different locations, GSLB
  • Load balancer examples: HAProxy, NGINX, AWS - Elastic Load Balancing (ELB), GCP - Google Cloud Platform Load Balancer, Microsoft Azure Load Balancer

Message Broker

  • The fundamental building block of async software

  • Usually used inside our system

  • Implementation uses a queue to store messages between senders and receivers

  • Used to route messages to appropiate modules

  • Can be used to validate messages

  • User sends message to front end which returns success if request was added correctly to queue (message broker) later on when backend receives the message and fully complete the request it can send a confirmation message that will be delivered to user

  • The message broker can serve multiple services or modules subscribed to the queue

    Message broker

  • A Message Broker provides:

    • Fault tolerance & Availability: allow services to communicate with each other even if they are temporaruly unavailable
    • Scalability: it can queue up messages during traffic spikes
    • Impact performance; Adds latency to the system
  • Message Broker examples: RabbitMQ, Apache Kafka, Amazon Simple Queue Service (SQS), Microsoft Azure (Service Bus, Event Hubs, Event Grid)

Want to show support?

If you find the information in this page useful and want to show your support, you can make a donation

Use PayPal

This will help me create more stuff and fix the existent content...