Exponential Backoff in Ruby

ruby algorithms
Posted on: 2015-02-25

Anytime you have client programs connecting to a server, the clients need to know what to do if the server doesn't respond. The naive answer is something like "keep trying, once every second".

But this is awfully hard on the poor server. It had trouble fulfilling one request, and now you're saying "What about now? What about now? What about now!!?" The more it fails, the bigger its workload gets.

A nicer solution is this: if the server can't respond, try again in 2 seconds. If it still can't, try again in 4 seconds so it has some time to recover. If it still can't, wait 8 seconds, then 16. At some point, give up.

Here's a simple implementation in Ruby.

#!ruby
retries = 0
begin
  response = make_request
  do_stuff_with(response)
rescue SomeKindOfRequestError => e
  if retries <= max_retries
    retries += 1
    sleep 2 ** retries
    retry
  else
    raise "Giving up on the server after #{retries} retries. Got error: #{e.message}"
  end
end

There's still a problem here: suppose you've got 100 clients, and the server goes down for a few seconds. After 2 seconds, 100 requests come in simultaneously and the poor server is overloaded. It gets a short rest, but then, after 4 seconds, 100 more requests come in simultaneously and it's overloaded again.

What you'd really like is to spread those out, so that you get 100 requests spread over 2 seconds, then 100 requests spread over 4 seconds, then 8, etc. You could do that something like this:

#!ruby
retries = 0
begin
  response = make_request
  do_stuff_with(response)
rescue SomeKindOfRequestError => e
  if retries <= max_retries
    retries += 1
    max_sleep_seconds = Float(2 ** retries)
    sleep rand(0..max_sleep_seconds)
    retry
  else
    raise "Giving up on the server after #{retries} retries. Got error: #{e.message}"
  end
end

Now, on the first round, one client may sleep 1.99 seconds, one 0.38 seconds, one 1.2 seconds, etc. Each round will be spread out similarly. This gives the server more and more time to get back on its feet and doesn't keep hitting it with synchronized waves of requests.