Adaptive load balancing

At work, I've recently run up against the classic challenge faced by anyone running a high-availability service: Load balancing in the face of failures. I'm not sure the right solution has been written in software yet, but after a good deal of hammock time and chatting with coworkers, I think I've put together an algorithm that might work.

Let's say you have a goodly sized collection of API servers each talking to a handful of backend servers, load-balancing between them. The API servers receive high request rates that necessitate calls to the backend and must be kept highly available, even if backend servers unexpectedly go down or intermediary network conditions degrade. Backpressure is not an option; you can't just send HTTP 429 Too Many Requests. Taking the load off of a backend server that is suffering is good, but that can put more pressure on the others. How do you know what failure rate means you should be shedding load? How do you integrate both latency/timeout issues and explicit errors?

Generally: How do you maximize successful responses to your callers while protecting yourself from cascading failures? How can a load-balancer understand the cluster-level health of the backend?

The short version: Track an exponentially decaying health measure for each backend server based on error rates, distribute requests proportionally to health, and skip over servers that have reached an adaptive concurrency limit based on latency measures.

Update 2019-07-30: While I no longer think this precise approach is what I want, the general outlines are still good. You can read my conclusions about traffic-informed load balancing. The experimental code that I'm still working on is an evolution of the algorithm outlined here, but it replaces the buckets with a single exponentially decaying average and discards the entire fallback cascade in favor of a single weighted random selection.

Sample scenarios

Just to get a sense of the variety of conditions that a load balancer has to handle, here are some scenarios I've encountered:

  • Everything's fine! All backend servers (simply "nodes", from here on in) are responding as expected in a reasonable amount of time.
  • One node suddenly goes down, and all requests time out or fail fast
  • One node starts coming into and out of a usable state with high frequency (randomly per-request, usually called "flaky") or low frequency (on the scale of seconds to minutes, termed "flapping")
  • One or more nodes start showing high latency due to CPU contention, and with reduced load their latency improves
  • Network conditions degrade such that all nodes show high latency, and reduced load does not help
  • Nodes are replaced through dynamic configuration of the API servers or load balancer, and must be brought into service
  • Combinations: One node of 3 is succeeding only 50% of the time, and therefore should receive ~0% of the traffic. Then the other two go hard-down, at which point the first node should receive the majority of the traffic (and possibly some load-shedding should occur.) Health is relative.

And here are several general issues, just so we're all working from the same baseline:

  • Cascading failure: All the nodes are running near their max capacity but appear healthy from the outside. One node has a CPU hiccup, and the contention pushes it over the threshold, and it goes down. The load balancer takes it out of service and redistributes request load to the other servers, sending them over their threshold and causing them to fail.

    Latency issues can propagate up the call chain, too. If the backend servers start timing out, the API servers can spend too much time waiting for them to respond, and the incoming requests pile up. The increased contention can lead to failure. Load-shedding can help, here; if the backend service becomes unhealthy, the API server can choose to fail-fast on some of the requests rather than waiting to see if they will succeed. This is the flip side of best-effort serving: Sometimes you have to fail pessimistically on some requests to ensure the service stays up for the remaining requests.

    One other note: When a call to one node fails, it might be tempting to make a second call to a different node. But if the failure occurred due to elevated load or an otherwise stressed system, you're now potentially doubling the request volume, which could send you into a downward spiral. Multiplying your request count might help smooth over transient glitches but could make things go from bad to worse when service is already degraded.

  • Thundering herd: Cache stampedes are the classic example—a cache entry expires, and thousands of servers try to recompute the cache entry simultaneously, making the same call to the same backend at the same time, overwhelming it. In general, any sort of highly correlated behavior can produce a thundering-herd-like problem, and in general the solution is to add randomness (e.g. carefully jittered cache expiry, such as x-fetch.)
  • "What's a failure, anyway?" Sometimes your service starts returning huge numbers of failures, but it's behaving within spec. Why? Because your caller is making weird requests, such as repeatedly asking for a non-existent resource. This can look like a service failure, but you don't want to start shedding load (as long as responses are fast enough.) With HTTP, generally 5XX errors are a better indicator than 4XX, but again it's possible that one caller is making requests that hit a corner-case, and it is only affecting their own traffic. Perhaps with disciplined use of HTTP status codes your load balancer can tell the difference, but you might want to only treat connection failures and a select few error codes as true health indicators.

Existing approaches

Here are ways people deal with some of these issues:

  • Send all your backend requests to a single DNS entry populated by a cluster of basic load balancer servers, which then try to distribute the requests evenly. They may use random or round-robin selection, or may use health-aware balancing such as routing to the node with the least number of outstanding requests, the lowest recent latency, or other health measures.
    • Lowest-latency has one curious failure mode that rachelbythebay calls the load-balanced capture effect in which one broken machine can capture and reject most of the traffic. Use with caution!
    • Alternatively, each API server knows about all the backend servers, and chooses where to send each request. This is a client-side load-balancer.
  • Load-balancers may also use health-checks to decide whether a node should be in rotation. AWS's ELB is one such load-balancer service. It makes out-of-band health-check calls to each backend server, brings them into service after a few consecutive successes, and takes them out after a few consecutive failures. Load is distributed to in-service nodes by round-robin, hashing, or least-outstanding depending on whether an ALB, NLB, or CLB is used (and in what capacity). This is a health-aware load-balancer, but it does not have a notion of cluster health.
    • Nodes that are flapping may not be detected as unhealthy.
    • Nodes that are partially degraded (capable of some operations but not others) are either entirely in service or entirely out.
    • Even if flapping were detected, the atomistic health-check approach falls short. If one node is flapping or partially degraded, it should be removed from service. If 90% of nodes are flapping or partially degraded, they should (arguably!) all be used, barring concurrency limits.
  • Hystrix is a client library by Netflix which implements a circuit-breaker approach using the successes and failures of requests to determine whether to break the circuit. It implements recovery by occasionally allowing a request through to see if it succeeds. This is not a load-balancer, but it illustrates the use of the requests themselves as an in-band health-check.
    • It's possible to put Hystrix in front of each node and round-robin between them, or even try each node once until one doesn't fast-fail.
  • concurrency-limits is a library that Netflix announced as part of a newer strategy for dealing with some of these issues. It uses algorithms similar to TCP congestion control to limit request concurrency per-node. This is again not a load-balancer, and is similar to Hystrix in that it provides a per-node fast-fail. Unlike Hystrix, it deals in latency, rather than errors.

None of these, by themselves, address the problem of how to best allocate requests between nodes of varying health, while both maximizing the success rate in the short term and protecting the nodes from being crushed under excessive load (and going down completely.)

So here's an approach I've come up with that combines several of the ideas. My hope is that it covers the given scenarios, but that it's not so complicated that it's difficult to model and analyze.

Success-weighted, concurrency-limited instant fallback

(Man, what a mouthful. I'll come up with a better name if it works.)

The basic idea is to use a rolling window of health statistics per-node, and use that for weighted-random selection in a client-side load-balancer. Secondarily, use the concurrency-limits library to protect each node, but not for fast-failure; instead, when it rejects a request for a node, fall back to the next node. This fallback cascade is prioritized by the same health weighting, and random permutation is used to encourage even distribution.

Here's the algorithm; all numbers are subject to later tuning.


For each node, store a sliding window of 6 historical stat buckets, an additional "sticky" bucket, and a concurrency-limits Limiter object. Stats on calls to the backend are written to the newest bucket; this consists of a counter of finished requests and a counter of how many of those actually succeeded. (These are updated together, atomically.) Every 5 seconds, the historical buckets shift by one: A new, zeroed bucket is inserted at the beginning and the oldest bucket is removed. If the oldest bucket had a non-zero request count, it is copied over the sticky bucket.

Awful pixelated hand-drawn diagram showing 6 sliding-window buckets and a sticky bucket, once each for 3 nodes

I'm not sure why I bothered including this diagram, except maybe to showcase how bad I am at drawing with a trackpad.

Node selection

When selecting a node to make a call to, first the nodes are prioritized by health, and then used in an instant fallback cascade.

  1. Prioritize the nodes by weighted random shuffle on health:
    1. Assign each node a success rate:
      • If there are finished requests in the historical buckets: Divide the success count sum by the finished count sum, with the sums weighted exponentially by a factor of 3 (most recent bucket highest)
      • No finished requests in historical buckets:
        • If there is data in the sticky bucket: Compute success rate from that data, with a floor of 0.0001 divided by node count
        • No data in sticky bucket either: Use value of 1
    2. Derive a weight for each node: Success rate cubed
    3. Perform a weighted shuffle of nodes (e.g. iterated weighted random sampling without replacement)
  2. Walk the node list until you find one able to accept a request, as defined by whether the Limiter will return a lease.
    • If none available, fail-fast: Return error to caller.
    • Otherwise, make the call to that node
  3. Update statistics:
    • Tell the Limiter whether it was a success, a timeout, or an unrelated error
    • Atomically increment node's finished-count and (if applicable) success-count


  • Decorrelation: The algorithm uses weighting to prefer healthier nodes, but weighted-random selection to keep from sending all requests to the current-healthiest one.
    • When a node is completely down, it takes no requests unless the concurrency limit is reached on the healthy nodes.
    • When all are down, fail fast (either due to concurrency limits, or due to fast error response)
    • When all nodes are suffering, distribute requests proportionally.
  • Fast detection: Each stat bucket is weighted more heavily than the next older one, so newer failures are weighted more than older successes. (The opposite is true for recovery.) Narrower buckets could enhance this effect by reducing dilution. The cubic penalty on the success rate also ensures that the system is biased towards trusting failures more than successes.

    There is likely a more precise algorithm that takes fully into account the number of outstanding requests and more quickly detects a sudden timeout condition. However, the concurrency-limits library should help with this.
  • Recovery: Accomplished by aging-out of buckets combined with the no-data minimum. The sticky bucket helps us distinguish between a newly configured node and one that has simply not received any requests recently, either due to bad health or a background low request rate. Even if a node has no recent requests, and the last recorded data was 0% success, we give it a slight chance to be picked again. 0.0001 divided by number of nodes keeps the damage low if the node is still unhealthy, but could bring it back into service quickly at a high request rate.
  • Fast-fail: Under non-timeout error conditions, we always try to make a request. This assumes that if the backends are truly struggling under load, they will show a latency increase, at which point the concurrency-limits will kick in.
    • If this assumption is challenged, consider implementing a circuit breaker (or perhaps use Hystrix) and include it in the cascade alongside concurrency-limits.
  • Startup: Initially, all nodes are tabula rasa and have equal chance of being selected. However, we don't want a positive feedback loop that leads to one node being considered healthy and the others being untrusted by default, which is why a clean-slate node is considered completely healthy—it needs to be "competitive" with nodes that already do have data.
    • However: Incrementing request stats on the tail of the request means there's a delay before a bad node is uncovered at startup or after a dynamic config change. If we're willing to wait 6 seconds before timing out and considering it an error, that's 6 seconds of requests that might fail before we have stats. We'll rely on concurrency-limits to limit the damage from this slow-failure scenario.

Feedback welcome!

I haven't actually implemented this yet, since I'm still finishing up a load-balancer test harness to analyze performance. I'm also a little unsure if it's too complicated. The problem itself is complicated, but as mentioned earlier, the solution needs to be kept simple enough to model and analyze, even if that means the performance falls short of the perfect ideal.

I keep thinking that someone has to have solved this already, but I haven't turned up any leads so far. (Maybe my search terms are wrong!) I'm open to any and all suggestions, including "you're solving the wrong problem, and here's why". :-)

No comments yet. Commenting is not yet reimplemented after the Wordpress migration, sorry! For now, you can email me and I can manually add comments. Feed icon