Circuit breaker

When a service is overloaded, additional interaction may only worsen its overloaded state. This is especially true when combined with retry mechanisms such as Schedule. Sometimes, simply using a back-off retry policy might not be sufficient during peak traffic. To prevent such high in demand resources from overloading, a circuit breaker can be used to protect the service by failing fast. This helps us achieve stability and prevents cascading failures in distributed systems.

Circuit breaker protocol

A circuit breaker is named after a similar concept in electrical engineering. It may be in one of three states.

🔀Closed

This is the state in which the circuit breaker starts.
Requests are made normally in this state:
- When an exception occurs, it increments the failure counter.
  - When the failure counter exceeds the given maxFailures threshold, the breaker moves to the Open state.
- A successful request will reset the failure counter to zero.

⏹️Open

In this state, the circuit breaker short-circuits/fails-fast all requests.
- This is done by throwing the ExecutionRejected exception.
If a request is made after the configured resetTimeout, the breaker moves to the Half Open state, allowing one request to go through as a test.

⤴️Half Open

The circuit breaker is in this state while allowing a request to go through as a test request.
- All other requests made while test request` is still running will short-circuit/fail-fast.
If the test request succeeds, the circuit breaker is tripped back into Closed, with the resetTimeout and the failures count also reset to initial values.
If the test request fails, the circuit breaker moves back to Open, and the resetTimeout is multiplied by the exponentialBackoffFactor up to the configured maxResetTimeout.

Additional context for this pattern

Circuit Breaker pattern in Cloud Design Patterns.

Opening strategies

Arrow offers several strategies to determine when the circuit breaker should open and short-circuit all incoming requests. The currently available are:

Count. This strategy sets a maximum number of failures. Once this threshold is exceeded, the circuit breaker moves to Open. Note that every time a request succeeds, the counter is set back to zero; the circuit breaker only moves to Open when the aforementioned number of failures happen consecutively.
Sliding Window. This strategy counts the number of failures within a given time window. Unlike the Count approach, the circuit breaker will only move to Open if the number of failing requests tracked within the given period exceeds the threshold. As the time window slides, the failures outside the window length are ignored.

Arrow's `CircuitBreaker`

Let's create a circuit breaker that only allows us to call a remote service twice. After that, whenever more than two requests fail with an exception, the circuit breaker starts short-circuiting/failing-fast.

A new instance of CircuitBreaker is created using of; there, we specify the different options.

Deprecation in Arrow 1.2

The of constructor function has been deprecated in favor of exposing the CircuitBreaker constructor.

Then we wrap every call to the service that may potentially fail with protectOrThrow or protectEither, depending on how we want this error to be communicated back. If the error arises, the internal state of the circuit breaker also changes.

Failure means exception

The protectEither function has a slightly misleading name. The Either refers to the fact that it returns Either<ExecutionRejected, A>, not to signalling errors within the protected block by returning Either.Left.

@ExperimentalTime
suspend fun main(): Unit {
  val maxFailuresUntilOpen = 2
  val circuitBreaker = CircuitBreaker(
    openingStrategy = OpeningStrategy.Count(maxFailuresUntilOpen),
    resetTimeout = 2.seconds,
    exponentialBackoffFactor = 1.2,
    maxResetTimeout = 60.seconds,
  )

  // normal operation
  circuitBreaker.protectOrThrow { "I am in Closed: ${circuitBreaker.state()}" }.also(::println)

  // simulate service getting overloaded
  for (i in 1 .. maxFailuresUntilOpen + 1) {
    Either.catch {
      circuitBreaker.protectOrThrow { throw RuntimeException("Service overloaded") }
    }.also(::println)
  }
  circuitBreaker.protectEither { }
   .also { println("I am Open and short-circuit with ${it}. ${circuitBreaker.state()}") }

  // simulate reset timeout
  println("Service recovering . . .").also { delay(2000) }

  // simulate test request success
  circuitBreaker.protectOrThrow { 
    "I am running test-request in HalfOpen: ${circuitBreaker.state()}" 
  }.also(::println)
  println("I am back to normal state closed ${circuitBreaker.state()}")
}

A common pattern to make resilient systems is to compose a circuit breaker with a backing-off policy that prevents the resource from overloading. Schedule is insufficient to make your system resilient because you also have to consider parallel calls to your functions. In contrast, a circuit breaker track failures of every function call or even different functions to the same resource or service.

@ExperimentalTime
suspend fun main(): Unit {
  suspend fun apiCall(): Unit {
    println("apiCall . . .")
    throw RuntimeException("Overloaded service")
  }

  val circuitBreaker = CircuitBreaker(
    openingStrategy = OpeningStrategy.Count(2),
    resetTimeout = 2.seconds,
    exponentialBackoffFactor = 1.2,
    maxResetTimeout = 60.seconds,
  )

  suspend fun <A> resilient(schedule: Schedule<Throwable, *>, f: suspend () -> A): A =
    schedule.retry { circuitBreaker.protectOrThrow(f) }

  // simulate getting overloaded
  Either.catch {
    resilient(Schedule.recurs(5), ::apiCall)
  }.let { println("recurs(5) apiCall twice and 4x short-circuit result from CircuitBreaker: $it") }

  // simulate reset timeout
  delay(2000)
  println("CircuitBreaker ready to half-open")

  // retry once,
  // and when the CircuitBreaker opens after 2 failures
  //    retry with exponential back-off with same time as CircuitBreaker's resetTimeout
  val fiveTimesWithBackOff = Schedule.recurs<Throwable>(1) andThen
    Schedule.exponential(2.seconds) and Schedule.recurs(5)

  Either.catch {
    resilient(fiveTimesWithBackOff, ::apiCall)
  }.let { println("exponential(2.seconds) and recurs(5) always retries with actual apiCall: $it") }
}

One circuit breaker to rule them all

If several (concurrent) threads access the same service, they should be protected by the same circuit breaker. That is, not just circuit breakers created with the same parameters, literally the same instance.

Circuit breaker

Circuit breaker protocol​

Opening strategies​

Arrow's CircuitBreaker​

Circuit breaker protocol

Opening strategies

Arrow's `CircuitBreaker`