Load Testing at Netflix: Virtual Interview with Coburn Watson

I exchanged several e-mails with Coburn Watson (@coburnw), Cloud Performance Engineering Manager at Netflix, and he was very kind to share very interesting information about load testing at Netflix. I believe that this information is too valuable to keep it hidden and decided to share it in the form of a virtual interview (of course, after asking Coburn’s permission).

AP: I had a conversation with Adrian Cockcroft (@adrianco ) at the Performance and Capacity conference about load testing at Netflix – he had a slide referring to using JMeter in Continuous Integration (CI) in his presentation. He suggested contacting you for details.

I would be interested in any detail you may share. I saw your presentations on Slideshare – but you don’t talk about load testing there.

It is interesting that people don’t talk much about load testing – even if they do interesting things there. So it is difficult to understand what is going on in the industry. I still believe that load testing is an important part of performance risk mitigation and should be used in any mature company – but looks like there are other opinions. Any thoughts?

CW: We do have an in-house utility for load testing which leverages JMeter as the driver with Jenkins as the coordinating framework.

In regards to the load testing we perform at Netflix it takes on two primary forms, both driven by the service oriented push model we operate under. With such a model teams push between every 2 days to 3 weeks, running canaries to identify both functional and performance risk which manifests under production load characteristics. In such a push model the time required to execute a formal performance test with each deployment isn’t really there. This leads to us using the load testing framework to test a new, or significantly refactored, service under production load to characterize performance and scalability characteristics. Typically it lets us validate whether we need EVCache (memcache) in addition to Cassandra, and perhaps additional optimizations. Of the multiple new or refactored services deployed last year (we have over a hundred distinct services) most were load tested by the service teams who developed them. Through such testing all services made it into production without a hitch. The second use of our load testing framework is to verify architectural changes to configuration (cross-coast load proxying) to quantify performance of the system.

One additional use case involves an in-house benchmark suite used to quantify the variability of performance for identical load on multiple instances of the same type, or cross-instance (e.g. m2.2xl vs m3.2xl) in some cases.

Having been through a few companies with quarterly or annual release cycles, subsequently landing at Netflix I feel I have now seen a full spectrum of software development approaches. In the former model (quarterly, annual) it definitely makes sense to have formalized load testing as part of the development cycle, particularly when it’s a shipped product. Without such testing the risk is too great to the customer and regression might not be caught until the product is in the customers hands. Running a SaaS provides Netflix great flexibility. In the Netflix model releases are backwards compatible, so if a push into production results in a significant performance regression (that escaped the canary analysis phase) we simply spin up instances on the old code base and take the new code out of the traffic path. I also believe the fact that we don’t “upgrade” our systems extends our flexibility and is only possible running on the cloud.

AP: Thank you very much for the detailed reply! So you don’t include load testing as a step in everyday CI using canaries instead? And you believe that load testing don’t add value for incremental service changes (on the top of canaries)? A couple of concerns I see: 1) small performance changes won’t be seen due to variation of production workload 2) you accept risk that users routed to the canary would be exposed to performance issues or failures.

CW: Our production workload tends to be quite consistent week-over-week and running a canary as part of the production cluster guarantees it sees the exact same workload traffic pattern/behavior; something very difficult to get right (or even close in most cases) with a load test. We have a “automated canary analysis (ACA)” framework that many services adopt. As part of a push it deploys a multi-instance canary cluster alongside the production cluster (considered baseline). Approximately 300 metrics (in the base configuration) are evaluated over the period of many hours comparing the canary and baseline data. It then scores the canary from 0-1. The score is a sliding scale and represents the risk of pushing the code base into production. Higher scores indicate minimal risk to a full push. These scoring guidelines/interpretation vary by service and are constantly evolving. When applied effectively we have seen it identify many performance problems that would have been difficult to detect in a load test.

One practice which isn’t yet widely adopted but is used consistently by our edge teams (who push most frequently) is automated squeeze testing. Once the canary has passed the functional and ACA analysis phases the production traffic is differentially steered at an increased rate against the canary, increasing in well-defined steps. As the request rate goes up key metrics are evaluated to determine effective carrying capacity; automatically determining if that capacity has decreased as part of the push.

A key factor that makes such a canary-based performance analysis model work well is that we are on the cloud. If you are in a fixed footprint deployment model your deployed capacity is your total capacity. Given we autoscale our services in aggregate many thousands of instances a day (based on traffic rates), if we have either an increase in workload or a change in the performance profile we can easily absorb adding a few more instances into the cluster (push of a button). We also run slightly over-provisioned in most configurations to absorb the loss of a datacenter (AZ) within a given AWS region, so we have flex carrying capacity in place for most services already to absorb minor performance regressions. Overall the architecture, supporting frameworks, and flexibility of the cloud make all this possible for us. So even though I will gladly stand up in support of such a performance risk mitigation strategy, it’s not for everyone. Unless of course if they move the cloud and adopt our architecture. It also works for a SaaS provider, but probably not for someone who is shipping off the shelf software.

I cover the ACA framework in my surge presentation from 2013. If you have time I would recommend watching it as it provides much more context around how Netflix optimizes for engineering velocity, even it if results in occasional failures. The benefit of our model is that failures tend to be very reduced in scope, are identified quickly, and many times our robust retry and fallback frameworks fully insulate end users from any negative experience.

AP: Actually canary testing is the performance testing that uses real users to create load instead of creating synthetic load by a load testing tool. It makes sense in your case when 1) you have very homogenous workloads and can control them 2) potential issues have minimal impact on user satisfaction and company image and you can easily rollback the changes 3) you have fully parallel and scalable architecture; you just trade in the need to generate (and validate) workload for a possibility of minor issues and minor load variability. I guess the further you are away from these conditions, the more questionable such practice would be.

By the way, I guess a major part of ACA may be used in case of generated load too – the analysis should be the same independently from the way you apply load. Is there more information available about ACA anywhere (beyond you presentation)? Are there any plans to open it for public in any way?

CW: For us canary testing represents both a functional and performance testing phase. I do agree that if a customer’s architecture and push model differs significantly from ours then canary testing might not be a great approach, but could still bring value. It wasn’t an accident that we arrived at such a model, many intentional choices were made to get here and the benefits are incredible to both engineering velocity and system reliability.

I am hopeful more details on our ACA framework will make it out to public, but no guaranteed timelines.

Martin Spier says:

February 14, 2014 at 12:46 pm

Hi Alexander,
I work work with Coburn @ Netflix and among other things, I extend and support the load test framework.
Coburn already talked about Canary Analysis and the ACA framework, which in itself is great tool for validating a release before a full push to production. It works really well under our model and on our environment, but as he mentioned, it’s not for everyone.
Your initial question referenced Performance Test in Continuous Integration (CI). We do have a framework to support that, but it’s mostly leveraged by teams that push to production less frequently and have multiple “candidate” builds in between.
The framework itself leverages JMeter, Jenkins, EC2 nodes and our internally developed test harness to generate load and collect client side metrics. Every time a test is triggered on Jenkins, either by a new application build, or scheduled action, a set of jobs will be executed to deploy both current and previous builds to the test environment, create a load generator cluster on EC2, execute the appropriate tests, collect and store client side metrics, destroy the load generator cluster, and trigger automated analysis.
Analysis, both automated and performed by the user, are done in our internally developed web-based analysis tool. The tool collects data generated by JMeter clients and our in-house monitoring tool, providing a centralized analysis interface and persistence for test results. Automated analysis leverages a rule engine to identify possible performance regression and flag builds as stable or unstable. Rules are very flexible and each team is able to define and adjust rules depending on their needs. It takes a few runs to quantify the environment variability (being on the cloud) and adjust rules appropriately, but if correctly used it’s a great tool to provide an early warning.
As mentioned in the previous post there are additional project-level use cases for the load test framework. Longer projects, which typically involve architecture changes and new or significantly refactored services, leverage performance tests less as a “final validation” tool and more in the context of iterative development. We have great cases where the test framework was used for proof-of-concept tests in order to define which architecture was ideal for a service, even before a major part of development started.
Performance tests also allow us to characterize performance and scalability characteristics, either under partial, full, or future production load. Running load generator clusters on EC2 gives us great flexibility and the ability to generate large amounts of load, usually functions of the current production load (2x, 5x, 10x) during tests. Beside the usual “wins” these tests are really helpful on finding the ideal instance types to host specific services, correctly sizing the initial clusters, and verifying the need for cache layers and other optimizations . They also help us identify future scalability issues and work on bottlenecks fixes long before they become a problem.
These are just a few examples and hope this helps clarifying how we leverage performance testing here at Netflix!
-Martin

Leave a Reply