Supervision Tree
A supervision tree is a hierarchical structure of processes designed to build fault-tolerant systems through organized error handling and automatic recovery. This pattern, pioneered by OTP, enables systems to achieve exceptional reliability by embracing failure as a normal part of operation rather than trying to prevent it entirely.
Core Philosophy: Let It Crash
The supervision tree embodies the “let it crash” philosophy:
- Processes should fail fast when encountering errors
 - Failed processes are restarted by supervisors
 - System integrity is maintained through isolation
 - Complex error handling is moved to supervision layer
 
This approach simplifies code by separating business logic from error recovery logic.
Architecture
Process Hierarchy
        [Application Supervisor]
               /        \
    [DB Supervisor]  [Web Supervisor]
         /    \           /    \
   [Writer] [Reader] [Handler] [Cache]
Each level in the tree represents a supervision boundary with specific responsibilities and restart strategies.
Process Types
Supervisors
- Monitor child processes
 - Restart failed children according to strategy
 - Don’t perform business logic
 - Can supervise other supervisors (creating depth)
 
Workers
- Perform actual work (business logic)
 - Report to exactly one supervisor
 - Can fail and be restarted
 - Should be designed for clean startup/shutdown
 
Special Processes
- Application: Root of supervision tree
 - Dynamic Supervisors: Create children on demand
 - Task Supervisors: Manage short-lived tasks
 
Restart Strategies
One-for-One
When a child process fails, only that specific process is restarted.
Before: [A] [B] [C] [D]
B fails: [A] [X] [C] [D]
After:  [A] [B'] [C] [D]
Use when processes are independent.
One-for-All
When any child fails, all children are terminated and restarted.
Before: [A] [B] [C] [D]
B fails: [A] [X] [C] [D]
After:  [A'] [B'] [C'] [D']
Use when processes are strongly dependent.
Rest-for-One
When a child fails, that child and all children started after it are restarted.
Before: [A] [B] [C] [D]
B fails: [A] [X] [C] [D]
After:  [A] [B'] [C'] [D']
Use when there’s a startup dependency chain.
Simple-One-for-One
Special strategy for dynamically created children of the same type.
- All children run the same code
 - Children added/removed dynamically
 - Efficient for thousands of similar processes
 
Restart Intensity and Period
Supervisors track restart frequency to prevent infinite restart loops:
- Max Restarts: Maximum number of restarts allowed
 - Time Period: Time window for counting restarts
 - Escalation: If threshold exceeded, supervisor itself fails
 
Example configuration:
%% Allow max 3 restarts in 5 seconds
{RestartStrategy, MaxRestarts, Period} = {one_for_one, 3, 5}Implementation Examples
Erlang/OTP
-module(my_supervisor).
-behaviour(supervisor).
 
init([]) ->
    Children = [
        {worker1, {worker_module, start_link, []},
         permanent, 5000, worker, [worker_module]},
        {worker2, {other_worker, start_link, []},
         temporary, 5000, worker, [other_worker]}
    ],
    {ok, {{one_for_one, 3, 5}, Children}}.Elixir
defmodule MyApp.Supervisor do
  use Supervisor
 
  def start_link(opts) do
    Supervisor.start_link(__MODULE__, :ok, opts)
  end
 
  def init(:ok) do
    children = [
      {MyApp.Worker, arg1},
      {MyApp.Cache, name: MyApp.Cache}
    ]
 
    Supervisor.init(children, strategy: :one_for_one)
  end
endAkka (Scala)
class MySupervisor extends Actor {
  override val supervisorStrategy = 
    OneForOneStrategy(maxNrOfRetries = 3, 
                      withinTimeRange = 1 minute) {
      case _: ArithmeticException => Restart
      case _: NullPointerException => Stop
      case _: Exception => Escalate
    }
  
  def receive = {
    case Props(cls, args) => 
      sender() ! context.actorOf(Props(cls, args))
  }
}Effect-TS (TypeScript)
Effect-TS brings supervision tree patterns to TypeScript with type safety and structured concurrency:
import { Effect, Fiber, Scope, Exit } from "effect"
 
const oneForOneStrategy = <E, A>(
  childFactory: () => Effect.Effect<A, E>,
  maxRestarts: number = 3
) =>
  Effect.gen(function* (_) {
    let restarts = 0
    
    const startChild = (): Effect.Effect<A, E> =>
      childFactory().pipe(
        Effect.catchAll(error => {
          if (restarts < maxRestarts) {
            restarts++
            console.log(`Restarting child, attempt ${restarts}`)
            return startChild()
          }
          return Effect.fail(error)
        })
      )
    
    return yield* _(startChild())
  })
 
// Resource-aware supervision with automatic cleanup
const resourceAwareSupervisor = Effect.gen(function* (_) {
  yield* _(Effect.acquireUseRelease(
    Effect.gen(function* (_) {
      const scope = yield* _(Scope.make())
      return scope
    }),
    (scope) => Effect.gen(function* (_) {
      const children = [
        () => childProcess("worker-1"),
        () => childProcess("worker-2")
      ]
      
      return yield* _(createSupervisor({
        maxRestarts: 3,
        restartWindow: 5000,
        strategy: "one_for_one"
      }, children))
    }),
    (scope) => Scope.close(scope, Exit.unit)
  ))
})See Effect-TS Supervision Patterns for comprehensive implementation examples.
Child Specifications
Each child in a supervision tree has specifications defining:
Restart Type
- Permanent: Always restarted on termination
 - Temporary: Never restarted
 - Transient: Restarted only on abnormal termination
 
Shutdown Strategy
- Timeout: Grace period for cleanup (milliseconds)
 - Brutal Kill: Immediate termination
 - Infinity: Wait indefinitely (for supervisors)
 
Child Type
- Worker: Leaf node performing work
 - Supervisor: Branch node managing children
 
Error Propagation
Isolation Boundaries
- Errors are contained at process level
 - Supervisors create failure domains
 - Critical and non-critical paths separated
 - Cascading failures prevented through hierarchy
 
Escalation Path
- Worker crashes → caught by immediate supervisor
 - Supervisor attempts restart per strategy
 - If restart threshold exceeded → supervisor crashes
 - Parent supervisor handles failed supervisor
 - Continues up tree to application level
 
Design Patterns
Application Supervision Tree
           [Application]
                |
         [Root Supervisor]
         /      |       \
   [Core]   [Features]  [Support]
     |         |           |
  [Workers] [Services]  [Caches]
Service Supervision
Each service gets its own supervisor subtree:
      [Service Supervisor]
       /       |        \
  [Listener] [Pool]  [State]
              |
        [Workers 1..N]
Dynamic Worker Pool
    [Pool Supervisor]
    (simple_one_for_one)
           |
    [Dynamic Workers]
    Created on demand
Benefits
Fault Tolerance
- Automatic recovery from failures
 - System continues operating during partial failures
 - Graceful degradation under load
 
Simplicity
- Business logic separated from error handling
 - Cleaner code with less defensive programming
 - Predictable failure behavior
 
Visibility
- Clear system structure
 - Easy to understand failure domains
 - Simplified debugging and monitoring
 
Scalability
- Add/remove workers dynamically
 - Hierarchical organization scales naturally
 - Independent failure domains
 
Best Practices
Design Principles
- Fail Fast: Don’t try to handle unexpected errors in workers
 - Isolate State: Keep state in separate processes when possible
 - Layer Supervisors: Create multiple supervision levels for complex systems
 - Match Strategy to Dependencies: Choose restart strategy based on process relationships
 
Common Patterns
- Split Critical/Non-Critical: Separate essential from optional functionality
 - Database Connections: Pool supervisor with worker connections
 - Web Servers: Acceptor pool with request handlers
 - Background Jobs: Task supervisor for async work
 
Anti-Patterns to Avoid
- Deep Nesting: Too many supervision levels add complexity
 - Defensive Programming: Over-engineering error handling in workers
 - Shared State: Processes sharing mutable state break isolation
 - Ignoring Escalation: Not handling supervisor failures
 
Monitoring and Observability
Metrics to Track
- Restart frequency per supervisor
 - Process uptime and lifetime
 - Message queue lengths
 - Memory usage per process tree
 
Debugging Tools
- Observer (Erlang/Elixir): Visual supervision tree explorer
 - Process registry: Named process tracking
 - Crash dumps: Post-mortem analysis
 - Distributed tracing: Cross-node supervision
 
Use Cases
Telecommunications
- Call routing with automatic failover
 - Session management with recovery
 - Network element supervision
 
Web Applications
- Request handler pools
 - WebSocket connection management
 - Background job processing
 
IoT Systems
- Device connection management
 - Sensor data pipeline supervision
 - Command and control hierarchies
 
Financial Systems
- Transaction processing supervision
 - Market data feed handlers
 - Risk calculation pipelines
 
Related Concepts
- OTP - Origin and primary implementation
 - Actor Model - Underlying concurrency model
 - Distributed Systems - Application domain
 - Byzantine Fault Tolerance - Advanced fault tolerance
 - Fail-Stop Model - Failure detection approach