Load balancing: Beyond healthchecks

July 21st, 2019

I became interested in finding The Perfect Load Balancer when we had a series of incidents at work involving a service talking to a database that was behaving erratically. While our first focus was on making the database more stable, it was clear to me that there could have been a vastly reduced impact to service if we had been able to load-balance requests more effectively between the database's several read endpoints.

The more I looked into the state of the art, the more surprised I was to discover that this is far from being a solved problem. There are plenty of load balancers, but many use algorithms that only work for one or two failure modes—and in these incidents, we had seen a variety of failure modes.

This post describes what I learned about the current state of load balancing for high availability, my understanding of the problematic dynamics of the most common tools, and where I think we should go from here.

(Disclaimer: This is based primarily on thought experiments and casual observations, and I have not had much luck in finding relevant academic literature. Critiques are very welcome!)


Points I'd like you to take away from this:

  • Server health can only be understood in the context of the cluster's health
  • Load balancers that use active healthchecks to kick out servers may unnecessarily lose traffic when healthchecks fail to be representative of real traffic health
  • Passive monitoring of actual traffic allows latency and failure rate metrics to participate in equitable load distribution
  • If small differences in server health produce large differences in load balancing, the system may oscillate wildly and unpredictably
  • Randomness can inhibit mobbing and other unwanted correlated behaviors
Read full entry »

Adaptive load balancing

March 20th, 2019

At work, I've recently run up against the classic challenge faced by anyone running a high-availability service: Load balancing in the face of failures. I'm not sure the right solution has been written in software yet, but after a good deal of hammock time and chatting with coworkers, I think I've put together an algorithm that might work.

Let's say you have a goodly sized collection of API servers each talking to a handful of backend servers, load-balancing between them. The API servers receive high request rates that necessitate calls to the backend and must be kept highly available, even if backend servers unexpectedly go down or intermediary network conditions degrade. Backpressure is not an option; you can't just send HTTP 429 Too Many Requests. Taking the load off of a backend server that is suffering is good, but that can put more pressure on the others. How do you know what failure rate means you should be shedding load? How do you integrate both latency/timeout issues and explicit errors?

Generally: How do you maximize successful responses to your callers while protecting yourself from cascading failures? How can a load-balancer understand the cluster-level health of the backend?

The short version: Track an exponentially decaying health measure for each backend server based on error rates, distribute requests proportionally to health, and skip over servers that have reached an adaptive concurrency limit based on latency measures.

Update 2019-07-30: While I no longer think this precise approach is what I want, the general outlines are still good. You can read my conclusions about traffic-informed load balancing. The experimental code that I'm still working on is an evolution of the algorithm outlined here, but it replaces the buckets with a single exponentially decaying average and discards the entire fallback cascade in favor of a single weighted random selection.

Read full entry »

My own Creepy Facebook Surveillance Moment

February 17th, 2019

I've heard any number of stories from people about creepy things Facebook or other ad systems have done. "I was talking about X with a friend, and that evening an ad for X popped up on a web page!" The insidious thing is that it *could* have just been coincidence. You can't prove anything.

Well, this week it happened to me, and I don't even use Facebook. I can't prove anything. But it's deeply disturbing. TL;DR: Blank Facebook account I opened 8.5 years ago and never used receives recommendation, out of the blue, to check out a small store I only just learned existed and started patronizing.

Read full entry »

Image descriptions on Mastodon

January 9th, 2019

I'd like to talk a bit about why 1) image descriptions on Fedi are so great, and 2) why I sometimes reply to people's posts with a description of the image they posted without one. I was worried the latter might come across as passive aggressive, hence this explainy-post. This was originally a series of toots, but it got too long, so it's over here. Also, I want to be able to find it again, ever.

(This post is intended for an audience of people using Mastodon or other Fedi clients, but most of it also applies to image descriptions on the internet in general.)

Read full entry »

Work in progress: Cavern, a decentralized social media protocol

December 26th, 2018

If you follow my blog or have spent more than 5 minutes around me in the past 6 months, you know that I've been spending a lot of time thinking about social media software. Some of that thinking has been crystallizing into a prototype with the working name "cavern". In brief, I'm hoping to create an application and protocol that supports social journaling—like a blog, but with optional privacy filters; think Livejournal or Dreamwidth. Here are some of the properties I want the finished product to include:

  • Give people control over their own writing and media: Everything is created locally, and published out to friends and other contacts. Server disappears? No problem, you still have all your stuff.
  • Decentralized or distributed: Spread the software out over everyone's computers, so there's no central authority to interfere in people's digital lives. People can make their own decisions about nudity, political expression, and appropriate conduct in general.
  • Allow custom privacy levels: World-public, socially-local (n-degrees of separation), access list only, and custom access lists (Google Plus's circles, or Dreamwidth's custom filters).
  • Make use of social accountability: Encourage posting at a socially-local privacy level (not world-public) so that any adverse behavior occurs within a social context, where people already have tools for handling conflict. Bonus: Not publishing (mostly) to the entire world means a lower risk of surveillance and targeted disinformation campaigns.
  • Everyone takes on moderation duty within their own journal, rather than being subjected to the impersonal and overburdened moderation system of a central authority (e.g. Tumblr, Facebook, or Twitter.)
  • Trust-less hosting: To the extent that the system relies on servers not under the user's control, the system must not trust these servers not to censor or spy on their posts. (Cryptography is employed to this end.)

If you're interested in this vision, or even in just these general topics, I encourage you to come participate in the new Social Media Design community I'm organizing on Dreamwidth.