Rigorous Performance Testing - Modern Testing Tools

Grant Ellis

This is my second blog post in a series of three. If you haven’t already read my prior post, Rigorous Performance Testing – How We Got Here, then mosey on over for some extra context.

A quick recap

We all know that the Internet has gone through some serious evolution over the past 25 years (really! 1989!). Data Centers and hosting technologies have changed; media has changed (copper to fiber!); switches and peering points have changed; content has changed; addressing and routing has changed (IP Anycast); devices have changed; content has changed.

In the last five years alone, we have seen a transition to rich, interactive, and dynamic sites and applications. Clients are accessing those applications on handheld devices instead of computers. Connectivity is largely wireless instead of wired. These are great technologies, and our lives are better for it — but these same technologies do horrible things to web performance.

Similarly, measuring performance has become quite complicated. Before the web, the simple, venerable ping was sufficient while the web was in its infancy. Then, as bandwidth demands grew, we needed to use HTTP-aware testing tools like cURL. With the adoption of the commercial web, paradigms changed and it became important to measure whole pages with tools like Mercury Load Runner (now HP).

When CDNs started helping the middle-mile with decentralized infrastructure, the testing tools themselves needed to decentralize in order to capture performance data with the CDNs in-line. Gomez (now Compuware) and Keynote stepped in with browser-based testing agents distributed all over the middle-mile (“backbone”) of the Internet.

User experience metrics for the modern web

Now, the web is filled with super-dynamic sites and applications. All of these applications are dynamic on the client-side as well as the server-side. The browser mechanics of a modern application are complicated in themselves, and so testing methodologies have become more sophisticated. One huge differentiator is which performance metrics are tracked.

Fully Loaded

Prior testing tools would simply start a timer, initiate the page load, and then stop the timer after the underlying internet connection was disused. In the “Web 1.0” world, this was as sufficient test — the browser needed all the content in order to actually render the page and get that user happily “using.” On the modern Web “2.0+,” pages don’t need everything in order to be actually functional. Secondary content and/or personalized content may be loaded asynchronously (for example, below-the-fold loading), but the page may be fully functional beforehand. Ternary backend functions like analytics beacons have no bearing on function from the users’ perspective. With these points in mind, internet connection idleness is no reflection of user experience, and Fully Loaded has become less relevant.

Document Complete

The Document Complete event is fired in the browser when, well, when the document is complete. Generally, this means that the page is visually complete, responsive to the user (user can search, scroll, click links, etc.). However, the browser may still be loading asynchronous content or firing beacons — see Fully Loaded above.

However, this metric is imperfect as well: some sites deliberately defer loading of prominent content until after Document Complete.


Some Front-End Optimization (FEO) packages can defer execution of Javascript until after Document Complete.

Script deferral can be hugely misleading. Visual completeness may occur sooner, and Document Complete may be significantly improved as well. Testers will even see evidence of the visual completeness in videos, filmstrips, and screen shots.

However, despite visual completeness, the page may not be responsive until long after Document Complete — users may not be able to click links, scroll, or search. From a user's perspective, this is hugely frustrating and contributes to bounce rates. Imagine if someone switched your browser window for a screen shot, and you kept trying to click links but nothing would happen!

Perhaps more importantly, this tactic improves Document Complete, but only at the cost of making the metric meaningless altogether! One of the primary tenets of Document Complete is that the page is ready for the user. With script deferral, the page is not ready for the user — even if it looks ready.  

Visually Complete

Visually Complete is the moment that all visual elements are painted on the screen and visible for the user. Note that visual completeness is not the same as functional. See the “beware” block above!  

Start Render (or Render Start)

The Start Render event is fired in the browser when something (anything!) is first painted on the screen. The paint event may be the whole page — but it could instead be a single word, single image, or single pixel. That may not sound significant — after all, if the content is not there and the user can’t interact, then what is the value?

Keep in mind that, before Start Render fires, the user is staring at a blank white browser screen, or, worse, the prior page from which they just tried to navigate away from. From the user’s perspective, Start Render is the moment that the web site is clearly working properly.

There is significant evidence that Abandonment (bounce rate) is correlated very strongly with slow Start Render timings. Arguably, Start Render is the most important metric of all.

First Byte

When the browser requests the base page, that request must traverse the Internet (whether or not a CDN is in play), then the hosting facility must fetch (or assemble) the page, then the response must traverse the Internet again back to the device requesting the page. First Byte is the time it takes for the first byte of the response to reach the browser. So, First Byte is a function of twice network latency plus server latency. Other factors, like packet loss, may also impact this metric.

First Byte is transparent for your users. However, the metric is still important because it is critical path for all browser functions.

Speed Index

The Speed Index is a metric peculiar to WebPageTest (more on that below). Loosely speaking, the Speed Index is the average amount of time for visual components to be painted on the screen. More technically, if we plotted all the paint events, then measured the area above the curve, we would have the Speed Index. That is, the Speed Index is the integral of the area above the visual completeness curve.

Pages with a faster Start Render and a faster Visually Complete would have a greater percentage of the screen painted at any time — so the area above the curve would be less, and the Speed Index would be less (lower is better).

WebPageTest has excellent technical documentation on the Speed Index here.

Note again that a fast Speed Index is not the same as functional page. See the “beware” block above!

Tools that support user experience metrics

Real User Monitoring (RUM) Tools

Middle-mile (or “backbone”) testing tools are great for measuring availability from the broader Internet, but they never reflect the experience your users are actually seeing — especially those using wireless connectivity (even Wi-Fi!).

RUM Tools are the best way to fill this gap. Basically, performance data is collected from your end users as they browser your site. RUM tools track all of the above metrics (except Speed Index) and represent exactly what your users are seeing (with one or two exceptions — see below). RUM tools are really easy to install: just paste in a JavaScript tag.


  • True user experience.
  • Easy set-up
  • Support for a broad range of browsers and devices.
  • Collects data from various real-world connection types — including high-latency wireless and packet-loss scenarios.
  • Open source tools are available (Boomerang.js).


  • Inserting a third-party tag hurts performance to a degree. The act of measuring performance with RUM also hurts performance.
  • Safari doesn’t support the browser APIs on which RUM tools are dependent. Data for Safari browsers will be a subset of the metrics above, and remaining metrics are approximated using JavaScript timers rather than using hyper-accurate native browser code.
  • Outliers can be extreme and must be removed before interpreting aggregate data.
  • RUM requires live traffic. It is not possible to use RUM to measure performance of a site pre-launch.

Synthetic tools

RUM tools are excellent for measuring performance, but sometimes we really need synthetic measurements — especially for evaluating performance of pre-production environments (code/stack).


WebPageTest is an open-source, community-supported and widely endorsed tool for measuring and analyzing performance. The testing nodes are community-sponsored and freely available — however, it is possible to set up private testing nodes for your own dedicated use. Scripting capabilities are vastly improved on private nodes.


  • Measures user experience metrics, albeit from backbone locations.
  • Supports traffic shaping, so testers can configure specific bandwidth, latency, or packet-loss scenarios. The traffic shaping is, of course, synthetic and thus less variable than true user connections — but still this is an excellent feature and quite representative of real-world conditions.
  • Supports a subset of mobile clients, and a wide array of browsers.


  • Limited testing agent geographies available.
  • Great analysis overall, but very limited statistical support.
  • Extremely difficult to monitor performance on an ongoing basis or on regular intervals for a fixed period. Testers must set up private instances and WebPageTest Monitor in order to monitor performance.
  • Nodes are not centrally managed and therefore have inconsistent base bandwidth and hardware spec. Furthermore, they can sometimes be unstable or unavailable.
  • Supports multi-step transactions only on private nodes.


Catchpoint is a commercial synthetic testing package. Catchpoint has a massive collection of domestic and international testing nodes available, and a powerful statistical analysis package.


  • Tracks user experience metrics.
  • Supports ongoing performance monitoring.
  • Easy to provision complicated tests.
  • Supports multi-step transactions.
  • Captures waterfall diagrams for detailed analysis.
  • Supports true mobile connection testing. The agents themselves are desktop machines, but they operate on wireless (Edge/3G/4G/LTE) modems.
  • Excellent statistical package.


  • No traffic shaping available. All backbone tests have very high bandwidth and very low latency, so results are not necessarily representative of end-user performance.
  • No support for mobile devices (note that mobile connections are supported).

Keynote Systems

Keynote is also a commercial synthetic testing package. Keynote has existed for a LONG time, and formerly measured only the Fully Loaded metric. However, they have recently revised their service to measure user experience metrics like Document Complete and Start Render.


  • Tracks user experience metrics.
  • Supports ongoing performance monitoring.
  • Easy to provision complicated tests.
  • Supports multi-step transactions.
  • Captures waterfall diagrams for detailed analysis.


  • No traffic shaping available. All backbone tests have very high bandwidth and very low latency, so results are not necessarily representative of end-user performance.
  • No support for mobile devices.

Performance data analysis

So, you’ve picked your performance metrics and your tool, and now you have plenty of data. What are the next steps?

In the final installment of this series, we will discuss statistical analysis and interpretation of performance data sets.