Older posts...
Older posts...
23 Sep
Designing APIs in a resource-oriented architecture
23 Sep 2014
Designing APIs in a resource-oriented architecture
28 Sep
How I'm going to land my dream job
28 Sep 2014
How I'm going to land my dream job
1 Oct
Neural net training fail
1 Oct 2014
Neural net training fail
13 Oct
Pow + SSL without the hassle
13 Oct 2014
Pow + SSL without the hassle
17 Oct
Using machine learning to rank search results (part 1)
17 Oct 2014
Using machine learning to rank search results (part 1)
23 Oct
Using machine learning to rank search results (part 2)
23 Oct 2014
Using machine learning to rank search results (part 2)
9 Nov
Managing complexity in Go
9 Nov 2014
Managing complexity in Go
25 Nov
Remote work: an engineering leader's perspective
25 Nov 2014
Remote work: an engineering leader's perspective
19 Sep
Running A/B tests on our hosting infrastructure
19 Sep 2016
Running A/B tests on our hosting infrastructure
27 Mar
Every service is an island
27 Mar 2017
Every service is an island

Timing Rack Middlewares with metaprogramming, recursive monkey-patching, and a sprinkle of statistics

Analysing performance of your Rails or Sinatra is easy enough with New Relic, but figuring out whether the soft outer shell of your stack is under-performing is more of a challenge. We’ve written rack-timer to figure things out.

This article was crosss-posted from the HouseTrip engineering blog.

The most typical tool Ruby web stacks use to monitor runtime performance is New Relic. It does a great job at spotting what happens inside transaction, which database queries are slow, etc. It also reports on something called queue time, as the infamous green slab at the bottom of its graphs.

Misleadingly, queue time it not request queuing: it’s the delay between when a requests hits your web frontend (Apache, NGinx) and the beginning of your action. Thus, it includes three things:

  • time spent in your HTTP dispatcher (Passenger, Unicorn)
  • time spent in your Rack middlewares
  • time spent in before/after_filters (if using Rails).

We had a hunch that something awry was going on in the outer layers of our stack, and neither New Relic nor any beautiful gem gave us any intel.

The infamous green “queue time”</figcaption> </figure>

rack-timer aims at solving that by providing timings for dispatch time and middlewares.

Inside Rack-Timer

It works in a borg-like way: add it to the top of your middleware stack (if using Rails, even before Rack::Lock), and it will “assimilate” every other middleware by injecting its code recursively into every other middleware.

The RackTimer::Middleware initializer looks like this:

  def initialize(app)
    @app = app
    self.extend(Borg)
  end

Starting the chain reaction, it injects the Borg in itself; then once part of the collective (simplified code):

  module Borg
    def self.extended(object)
      object.singleton_class.class_eval do
        alias_method :call_without_timing, :call
        alias_method :call, :call_with_timing
      end

      object.instance_eval do
        recursive_borg
      end
    end

    def recursive_borg
      @app.extend(Borg)
    end

The Borg wraps the call method of each middleware it’s injected into with time logging, the injects itself in the next middleware down the stack (conventionally, @app).

At runtime, the call wrappers will transparently call the originally call then output timing information.

Outcomes

Within our biggest application, we let 40 out of 520 workers work with the Borg in place until they’d collected information about 10,000 requests—enough to provide us with statistically significant data.

With rack-timer defaults, logs go to standard error, i.e. Apache’s error.log. We grabbed those, grep’d for the timer’s output, whipped some sed(1) magic along the lines of

$ bzcat error.web[1-4].bz2 | \
  grep '^\[rack-timer\]' | \
  sed -e 's/^.rack-timer. //; s/ took /,/; s/ us$//' \
  > middlewares.csv

and voilà, raw data ready to be digested. Good old Excel then graphed the middleware timings for us:

Middleware timings</figcaption> </figure>

The reason we graph both median and mean is that the latter is sensitive to outliers, whereas the median is a “robust metric”. A significant discrepancy between the two usually hints either at a skewed, non-normal distribution, or more typically presence of extreme outlier.

None of this here—not only the middleware all respond consistently, but the aggregate median time spent in middlewares is (despite their number) a low 7.5ms.

Note the left-most pair of bars in that graph—that’s the tail of the middleware chain, i.e. the application itself (including filters).

Moving one to the queue timings with another handful of sed magic. This time Excel doesn’t cut it, but R is probably an old friend to anyone serious with performance analysis, and stats in general.

Distribution of the queueing timings used to look like this:

(horizontally: log10 of the queuing time in microseconds, ie. 3 is 1ms and 6 is 1 second)</figcaption> </figure>

This is clearly bimodal: the left mode (clustered around 1ms) is expected. Passenger does need to do some work do move requests around, and 1ms is fine.

The second mode is much more worrisome: its overall area (i.e. the overall number of requests spent in that failure mode) is about half of the total, and it’s fairly progressive (not a nice bell curve at all), hinting at some kind of step/non-linear phenomenon.

The conclusion was that something what causing progressively more queuing in some cases.

That’s when we remembered that

  1. We’d added out-of-band garbage collection to relieve the Ruby VM back in the 1.8 days
  2. We’d recently upgraded Passenger to 3.0+

It turned out our out-of-band garbage collection hack (based on rack_after_reply) was no longer compatible with Passenger: what used to be run out-of-band was now run in-band with th next request on a particular worker.

Removing the out-of-band garbage collection solved the issue:

On top of that we’ve cut our average response time by a further 15 to 20%, and obviously suffer less from the spikiness in response time due to random GC hits.

And finally, having a proper excuse to write an arguably dangerous metaprogramming / recursive monkey-patching combo was fun!