An experiment in repopping popcorn

July 29th, 2019

We use an air popper to make popcorn at home, and there are always a few unpopped kernels at the bottom. Far less than for microwave popcorn, and not enough to worry about waste-wise, but a few. I became curious about whether these were just unpopped, or actually unpoppable.

Verdict in my N=1 experiment: Yes, almost all of them can be repopped! The easiest thing is to just toss 'em back in the popper for next time. Throw 'em back, they're not big enough yet. ;-)

Read full entry »

Load balancing: Beyond healthchecks

July 21st, 2019

I became interested in finding The Perfect Load Balancer when we had a series of incidents at work involving a service talking to a database that was behaving erratically. While our first focus was on making the database more stable, it was clear to me that there could have been a vastly reduced impact to service if we had been able to load-balance requests more effectively between the database's several read endpoints.

The more I looked into the state of the art, the more surprised I was to discover that this is far from being a solved problem. There are plenty of load balancers, but many use algorithms that only work for one or two failure modes—and in these incidents, we had seen a variety of failure modes.

This post describes what I learned about the current state of load balancing for high availability, my understanding of the problematic dynamics of the most common tools, and where I think we should go from here.

(Disclaimer: This is based primarily on thought experiments and casual observations, and I have not had much luck in finding relevant academic literature. Critiques are very welcome!)

TL;DR

Points I'd like you to take away from this:

  • Server health can only be understood in the context of the cluster's health
  • Load balancers that use active healthchecks to kick out servers may unnecessarily lose traffic when healthchecks fail to be representative of real traffic health
  • Passive monitoring of actual traffic allows latency and failure rate metrics to participate in equitable load distribution
  • If small differences in server health produce large differences in load balancing, the system may oscillate wildly and unpredictably
  • Randomness can inhibit mobbing and other unwanted correlated behaviors
Read full entry »

Adaptive load balancing

March 20th, 2019

At work, I've recently run up against the classic challenge faced by anyone running a high-availability service: Load balancing in the face of failures. I'm not sure the right solution has been written in software yet, but after a good deal of hammock time and chatting with coworkers, I think I've put together an algorithm that might work.

Let's say you have a goodly sized collection of API servers each talking to a handful of backend servers, load-balancing between them. The API servers receive high request rates that necessitate calls to the backend and must be kept highly available, even if backend servers unexpectedly go down or intermediary network conditions degrade. Backpressure is not an option; you can't just send HTTP 429 Too Many Requests. Taking the load off of a backend server that is suffering is good, but that can put more pressure on the others. How do you know what failure rate means you should be shedding load? How do you integrate both latency/timeout issues and explicit errors?

Generally: How do you maximize successful responses to your callers while protecting yourself from cascading failures? How can a load-balancer understand the cluster-level health of the backend?

The short version: Track an exponentially decaying health measure for each backend server based on error rates, distribute requests proportionally to health, and skip over servers that have reached an adaptive concurrency limit based on latency measures.

Update 2019-07-30: While I no longer think this precise approach is what I want, the general outlines are still good. You can read my conclusions about traffic-informed load balancing. The experimental code that I'm still working on is an evolution of the algorithm outlined here, but it replaces the buckets with a single exponentially decaying average and discards the entire fallback cascade in favor of a single weighted random selection.

Read full entry »

My own Creepy Facebook Surveillance Moment

February 17th, 2019

I've heard any number of stories from people about creepy things Facebook or other ad systems have done. "I was talking about X with a friend, and that evening an ad for X popped up on a web page!" The insidious thing is that it *could* have just been coincidence. You can't prove anything.

Well, this week it happened to me, and I don't even use Facebook. I can't prove anything. But it's deeply disturbing. TL;DR: Blank Facebook account I opened 8.5 years ago and never used receives recommendation, out of the blue, to check out a small store I only just learned existed and started patronizing.

Read full entry »

Image descriptions on Mastodon

January 9th, 2019

I'd like to talk a bit about why 1) image descriptions on Fedi are so great, and 2) why I sometimes reply to people's posts with a description of the image they posted without one. I was worried the latter might come across as passive aggressive, hence this explainy-post. This was originally a series of toots, but it got too long, so it's over here. Also, I want to be able to find it again, ever.

(This post is intended for an audience of people using Mastodon or other Fedi clients, but most of it also applies to image descriptions on the internet in general.)

Read full entry »