Supervision Tree

A supervision tree is a hierarchical structure of processes designed to build fault-tolerant systems through organized error handling and automatic recovery. This pattern, pioneered by OTP, enables systems to achieve exceptional reliability by embracing failure as a normal part of operation rather than trying to prevent it entirely.

Core Philosophy: Let It Crash

The supervision tree embodies the “let it crash” philosophy:

Processes should fail fast when encountering errors
Failed processes are restarted by supervisors
System integrity is maintained through isolation
Complex error handling is moved to supervision layer

This approach simplifies code by separating business logic from error recovery logic.

Architecture

Process Hierarchy

        [Application Supervisor]
               /        \
    [DB Supervisor]  [Web Supervisor]
         /    \           /    \
   [Writer] [Reader] [Handler] [Cache]

Each level in the tree represents a supervision boundary with specific responsibilities and restart strategies.

Process Types

Supervisors

Monitor child processes
Restart failed children according to strategy
Don’t perform business logic
Can supervise other supervisors (creating depth)

Workers

Perform actual work (business logic)
Report to exactly one supervisor
Can fail and be restarted
Should be designed for clean startup/shutdown

Special Processes

Application: Root of supervision tree
Dynamic Supervisors: Create children on demand
Task Supervisors: Manage short-lived tasks

Restart Strategies

One-for-One

When a child process fails, only that specific process is restarted.

Before: [A] [B] [C] [D]
B fails: [A] [X] [C] [D]
After:  [A] [B'] [C] [D]

Use when processes are independent.

One-for-All

When any child fails, all children are terminated and restarted.

Before: [A] [B] [C] [D]
B fails: [A] [X] [C] [D]
After:  [A'] [B'] [C'] [D']

Use when processes are strongly dependent.

Rest-for-One

When a child fails, that child and all children started after it are restarted.

Before: [A] [B] [C] [D]
B fails: [A] [X] [C] [D]
After:  [A] [B'] [C'] [D']

Use when there’s a startup dependency chain.

Simple-One-for-One

Special strategy for dynamically created children of the same type.

All children run the same code
Children added/removed dynamically
Efficient for thousands of similar processes

Restart Intensity and Period

Supervisors track restart frequency to prevent infinite restart loops:

Max Restarts: Maximum number of restarts allowed
Time Period: Time window for counting restarts
Escalation: If threshold exceeded, supervisor itself fails

Example configuration:

%% Allow max 3 restarts in 5 seconds
{RestartStrategy, MaxRestarts, Period} = {one_for_one, 3, 5}

Implementation Examples

Erlang/OTP

-module(my_supervisor).
-behaviour(supervisor).
 
init([]) ->
    Children = [
        {worker1, {worker_module, start_link, []},
         permanent, 5000, worker, [worker_module]},
        {worker2, {other_worker, start_link, []},
         temporary, 5000, worker, [other_worker]}
    ],
    {ok, {{one_for_one, 3, 5}, Children}}.

Elixir

defmodule MyApp.Supervisor do
  use Supervisor
 
  def start_link(opts) do
    Supervisor.start_link(__MODULE__, :ok, opts)
  end
 
  def init(:ok) do
    children = [
      {MyApp.Worker, arg1},
      {MyApp.Cache, name: MyApp.Cache}
    ]
 
    Supervisor.init(children, strategy: :one_for_one)
  end
end

Akka (Scala)

class MySupervisor extends Actor {
  override val supervisorStrategy = 
    OneForOneStrategy(maxNrOfRetries = 3, 
                      withinTimeRange = 1 minute) {
      case _: ArithmeticException => Restart
      case _: NullPointerException => Stop
      case _: Exception => Escalate
    }
  
  def receive = {
    case Props(cls, args) => 
      sender() ! context.actorOf(Props(cls, args))
  }
}

Effect-TS (TypeScript)

Effect-TS brings supervision tree patterns to TypeScript with type safety and structured concurrency:

import { Effect, Fiber, Scope, Exit } from "effect"
 
const oneForOneStrategy = <E, A>(
  childFactory: () => Effect.Effect<A, E>,
  maxRestarts: number = 3
) =>
  Effect.gen(function* (_) {
    let restarts = 0
    
    const startChild = (): Effect.Effect<A, E> =>
      childFactory().pipe(
        Effect.catchAll(error => {
          if (restarts < maxRestarts) {
            restarts++
            console.log(`Restarting child, attempt ${restarts}`)
            return startChild()
          }
          return Effect.fail(error)
        })
      )
    
    return yield* _(startChild())
  })
 
// Resource-aware supervision with automatic cleanup
const resourceAwareSupervisor = Effect.gen(function* (_) {
  yield* _(Effect.acquireUseRelease(
    Effect.gen(function* (_) {
      const scope = yield* _(Scope.make())
      return scope
    }),
    (scope) => Effect.gen(function* (_) {
      const children = [
        () => childProcess("worker-1"),
        () => childProcess("worker-2")
      ]
      
      return yield* _(createSupervisor({
        maxRestarts: 3,
        restartWindow: 5000,
        strategy: "one_for_one"
      }, children))
    }),
    (scope) => Scope.close(scope, Exit.unit)
  ))
})

See Effect-TS Supervision Patterns for comprehensive implementation examples.

Child Specifications

Each child in a supervision tree has specifications defining:

Restart Type

Permanent: Always restarted on termination
Temporary: Never restarted
Transient: Restarted only on abnormal termination

Shutdown Strategy

Timeout: Grace period for cleanup (milliseconds)
Brutal Kill: Immediate termination
Infinity: Wait indefinitely (for supervisors)

Child Type

Worker: Leaf node performing work
Supervisor: Branch node managing children

Error Propagation

Isolation Boundaries

Errors are contained at process level
Supervisors create failure domains
Critical and non-critical paths separated
Cascading failures prevented through hierarchy

Escalation Path

Worker crashes → caught by immediate supervisor
Supervisor attempts restart per strategy
If restart threshold exceeded → supervisor crashes
Parent supervisor handles failed supervisor
Continues up tree to application level

Design Patterns

Application Supervision Tree

           [Application]
                |
         [Root Supervisor]
         /      |       \
   [Core]   [Features]  [Support]
     |         |           |
  [Workers] [Services]  [Caches]

Service Supervision

Each service gets its own supervisor subtree:

      [Service Supervisor]
       /       |        \
  [Listener] [Pool]  [State]
              |
        [Workers 1..N]

Dynamic Worker Pool

    [Pool Supervisor]
    (simple_one_for_one)
           |
    [Dynamic Workers]
    Created on demand

Benefits

Fault Tolerance

Automatic recovery from failures
System continues operating during partial failures
Graceful degradation under load

Simplicity

Business logic separated from error handling
Cleaner code with less defensive programming
Predictable failure behavior

Visibility

Clear system structure
Easy to understand failure domains
Simplified debugging and monitoring

Scalability

Add/remove workers dynamically
Hierarchical organization scales naturally
Independent failure domains

Best Practices

Design Principles

Fail Fast: Don’t try to handle unexpected errors in workers
Isolate State: Keep state in separate processes when possible
Layer Supervisors: Create multiple supervision levels for complex systems
Match Strategy to Dependencies: Choose restart strategy based on process relationships

Common Patterns

Split Critical/Non-Critical: Separate essential from optional functionality
Database Connections: Pool supervisor with worker connections
Web Servers: Acceptor pool with request handlers
Background Jobs: Task supervisor for async work

Anti-Patterns to Avoid

Deep Nesting: Too many supervision levels add complexity
Defensive Programming: Over-engineering error handling in workers
Shared State: Processes sharing mutable state break isolation
Ignoring Escalation: Not handling supervisor failures

Monitoring and Observability

Metrics to Track

Restart frequency per supervisor
Process uptime and lifetime
Message queue lengths
Memory usage per process tree

Debugging Tools

Observer (Erlang/Elixir): Visual supervision tree explorer
Process registry: Named process tracking
Crash dumps: Post-mortem analysis
Distributed tracing: Cross-node supervision

Use Cases

Telecommunications

Call routing with automatic failover
Session management with recovery
Network element supervision

Web Applications

Request handler pools
WebSocket connection management
Background job processing

IoT Systems

Device connection management
Sensor data pipeline supervision
Command and control hierarchies

Financial Systems

Transaction processing supervision
Market data feed handlers
Risk calculation pipelines

OTP - Origin and primary implementation
Actor Model - Underlying concurrency model
Distributed Systems - Application domain
Byzantine Fault Tolerance - Advanced fault tolerance
Fail-Stop Model - Failure detection approach

🌿 Alternef Digital Garden

Supervision Tree

Supervision Tree

Core Philosophy: Let It Crash

Architecture

Process Hierarchy

Process Types

Supervisors

Workers

Special Processes

Restart Strategies

One-for-One

One-for-All

Rest-for-One

Simple-One-for-One

Restart Intensity and Period

Implementation Examples

Erlang/OTP

Elixir

Akka (Scala)

Effect-TS (TypeScript)

Child Specifications

Restart Type

Shutdown Strategy

Child Type

Error Propagation

Isolation Boundaries

Escalation Path

Design Patterns

Application Supervision Tree

Service Supervision

Dynamic Worker Pool

Benefits

Fault Tolerance

Simplicity

Visibility

Scalability

Best Practices

Design Principles

Common Patterns

Anti-Patterns to Avoid

Monitoring and Observability

Metrics to Track

Debugging Tools

Use Cases

Telecommunications

Web Applications

IoT Systems

Financial Systems

Related Concepts

Resources

Backlinks

Graph View

Table of Contents