Caching for a Global Netflix

#CachesEverywhere

Published in

Netflix TechBlog

12 min readMar 1, 2016

Netflix members have come to expect a great user experience when interacting with our service. There are many things that go into delivering a customer-focused user experience for a streaming service, including an outstanding content library, an intuitive user interface, relevant and personalized recommendations, and a fast service that quickly gets your favorite content playing at very high quality, to name a few.

The Netflix service heavily embraces a microservice architecture that emphasizes separation of concerns. We deploy hundreds of microservices, with each focused on doing one thing well. This allows our teams and the software systems they produce to be highly aligned while being loosely coupled. Many of these services are stateless, which makes it easier to (auto)scale them. They often achieve the stateless loose coupling by maintaining state in caches or persistent stores.

EVCache is an extensively used data-caching service that provides the low-latency, high-reliability caching solution that the Netflix microservice architecture demands.

It is a RAM store based on memcached, optimized for cloud use. EVCache typically operates in contexts where consistency is not a strong requirement. Over the last few years, EVCache has been scaled to significant traffic while providing a robust key-value interface. At peak, our production EVCache deployments routinely handle upwards of 30 million requests/sec, storing hundreds of billions of objects across tens of thousands of memcached instances. This translates to just under 2 trillion requests per day globally across all EVCache clusters.

Earlier this year, Netflix launched globally in 130 additional countries, making it available in nearly every country in the world. In this blog post we talk about how we built EVCache’s global replication system to meet Netflix’s growing needs. EVCache is open source, and has been in production for more than 5 years. To read more about EVCache, check out one of early blog posts:

Announcing EVCache

Distributed in-memory datastore for Cloud

medium.com

Motivation

Netflix’s global, cloud-based service is spread across three Amazon Web Services (AWS) regions: Northern Virginia, Oregon, and Ireland. Requests are mostly served from the region the member is closest to. But network traffic can shift around for various reasons, including problems with critical infrastructure or region failover exercises (“Chaos Kong”). As a result, we have adopted a stateless application server architecture which lets us serve any member request from any region.

The hidden requirement in this design is that the data or state needed to serve a request is readily available anywhere. High-reliability databases and high-performance caches are fundamental to supporting our distributed architecture. One use case for a cache is to front a database or other persistent store. Replicating such caches globally helps with the “thundering herd” scenario: without global replication, member traffic shifting from one region to another would encounter “cold” caches for those members in the new region. Processing the cache misses would lengthen response times and overwhelm the databases.

Another major use case for caching is to “memoize” data which is expensive to recompute, and which doesn’t come from a persistent store. When the compute systems write this kind of data to a local cache, the data has to be replicated to all regions so it’s available to serve member requests no matter where they originate. The bottom line is that microservices rely on caches for fast, reliable access to multiple types of data like a member’s viewing history, ratings, and personalized recommendations. Changes and updates to cached data need to be replicated around the world to enable fast, reliable, and global access.

EVCache was designed with these use-cases in consideration. When we embarked upon the global replication system design for EVCache, we also considered non-requirements. One non-requirement is strong global consistency. It’s okay, for example, if Ireland and Virginia occasionally have slightly different recommendations for you as long as the difference doesn’t hurt your browsing or streaming experience. For non-critical data, we rely heavily on this “eventual consistency” model for replication where local or global differences are tolerated for a short time. This simplifies the EVCache replication design tremendously: it doesn’t need to deal with global locking, quorum reads and writes, transactional updates, partial-commit rollbacks, or other complications of distributed consistency.

We also wanted to make sure the replication system wouldn’t affect the performance and reliability of local cache operations, even if cross-region replication slowed down. All replication is asynchronous, and the replication system can become latent or fail temporarily without affecting local cache operations.

Replication latency is another loose requirement. How fast is fast enough? How often does member traffic switch between regions, and what is the impact of inconsistency? Rather than demand the impossible from a replication system (“instantaneous and perfect”), what Netflix needs from EVcache is acceptable latency while tolerating some inconsistency — as long as both are low enough to serve the needs of our applications and members.

Cross-Region Replication Architecture

EVCache replicates data both within a region and globally. The intra-region redundancy comes from a simultaneous write to all server groups within the region. For cross-region replication, the key components are shown in the diagram below.

This diagram shows the replication steps for a SET operation. An application calls set() on the EVCache client library, and from there the replication path is transparent to the caller.

The EVCache client library sends the SET to the local region’s instance of the cache
The client library also writes metadata (including the key, but not the data) to the replication message queue (Kafka)
The “Replication Relay” service in the local region reads messages from this queue
The Relay fetches the data for the key from the local cache
The Relay sends a SET request to the remote region’s “Replication Proxy” service
In the remote region, the Replication Proxy receives the request and performs a SET to its local cache, completing the replication
Local applications in the receiving region will now see the updated value in the local cache when they do a GET

This is a simplified picture, of course. For one thing, it refers only to SET — not other operations like DELETE, TOUCH, or batch mutations. The flows for DELETE and TOUCH are very similar, with some modifications: they don’t have to read the existing value from the local cache, for example.

It’s important to note that the only part of the system that reaches across region boundaries is the message sent from the Replication Relay to the Replication Proxy (step 5). Clients of EVCache are not aware of other regions or of cross-region replication; reads and writes use only the local, in-region cache instances.

Component Responsibilities

Replication Message Queue

The message queue is the cornerstone of the replication system. We use Kafka for this. The Kafka stream for a fully-replicated cache has two consumers: one Replication Relay cluster for each destination region. By having separate clusters for each target region, we de-couple the two replication paths and isolate them from each other’s latency or other issues.

If a target region goes wildly latent or completely blows up for an extended period, the buffer for the Kafka queue will eventually fill up and Kafka will start dropping older messages. In a disaster scenario like this, the dropped messages are never sent to the target region. Netflix services which use replicated caches are designed to tolerate such occasional disruptions.

Replication Relay

The Replication Relay cluster consumes messages from the Kafka cluster. Using a secure connection to the Replication Proxy cluster in the destination region, it writes the replication request (complete with data fetched from the local cache, if needed) and awaits a success response. It retries requests which encounter timeouts or failures.

Temporary periods of high cross-region latency are handled gracefully: Kafka continues to accept replication messages and buffers the backlog when there are delays in the replication processing chain.

Replication Proxy

The Replication Proxy cluster for a cache runs in the target region for replication. It receives replication requests from the Replication Relay clusters in other regions and synchronously writes the data to the cache in its local region. It then returns a response to the Relay clusters, so they know the replication was successful.

When the Replication Proxy writes to its local region’s cache, it uses the same open-source EVCache client that any other application would use. The common client library handles all the complexities of sharding and instance selection, retries, and in-region replication to multiple cache servers.

As with many Netflix services, the Replication Relay and Replication Proxy clusters have multiple instances spread across Availability Zones (AZs) in each region to handle high traffic rates while being resilient against localized failures.

Design Rationale and Implications

The Replication Relay and Replication Proxy services, and the Kafka queue they use, all run separately from the applications that use caches and from the cache instances themselves. All the replication components can be scaled up or down as needed to handle the replication load, and they are largely decoupled from local cache read and write activity. Our traffic varies on a daily basis because of member watching patterns, so these clusters scale up and down all the time. If there is a surge of activity, or if some kind of network slowdown occurs in the replication path, the queue might develop a backlog until the scaling occurs, but latency of local cache GET/SET operations for applications won’t be affected.

As noted above, the replication messages on the queue contain just the key and some metadata, not the actual data being written. We get various efficiency wins this way. The major win is a smaller, faster Kafka deployment which doesn’t have to be scaled to hold all the data that exists in the caches. Storing large data payloads in Kafka would make it a costly bottleneck, due to storage and network requirements. Instead, the Replication Relay fetches the data from the local cache, with no need for another copy in Kafka.

Another win we get from writing just the metadata is that sometimes, we don’t need the data for replication at all. For some caches, a SET on a given key only needs to invalidate that key in the other regions — we don’t send the new data, we just send a DELETE for the key. In such cases, a subsequent GET in the other region results in a cache miss (rather than seeing the old data), and the application will handle it like any other miss. This is a win when the rate of cross-region traffic isn’t high — that is, when there are few GETs in region A for data that was written from region B. Handling these occasional misses is cheaper than constantly replicating the data.

Optimizations

We have to balance latency and throughput based on the requirements of each cache. The 99th percentile of end-to-end replication latency for most of our caches is under one second. Some of that time comes from a delay to allow for buffering: we try to batch up messages at various points in the replication flow to improve throughput at the cost of a bit of latency. The 99th percentile of latency for our highest-volume replicated cache is only about 400ms because the buffers fill and flush quickly.

Another significant optimization is the use of persistent connections. We found that the latency improved greatly and was more stable after we started using persistent connections between the Relay and Proxy clusters. It eliminates the need to wait for the 3-way handshake to establish a new TCP connection and also saves the extra network time needed to establish the TLS/SSL session before sending the actual replication request.

We improved throughput and lowered the overall communication latency between the Relay cluster and Proxy cluster by batching multiple messages in single request to fill a TCP window. Ideally the batch size would vary to match the TCP window size, which can change over the life of the connection. In practice we tune the batch size empirically for good throughput. While this batching can add latency, it allows us to get more out of each TCP packet and reduces the number of connections we need to set up on each instance, thus letting us use fewer instances for a given replication demand profile.

With these optimizations we have been able to scale EVCache’s cross-region replication system to routinely handle over a million RPS at peak daily.

Challenges and Learnings

The current version of our Kafka-based replication system has been in production for over a year and replicates more than 1.5 million messages per second at peak. We’ve had some growing pains during that time. We’ve seen periods of increased end-to-end latencies, sometimes with obvious causes like a problem with the Proxy application’s autoscaling rules, and sometimes without — due to congestion on the cross-region link on the public Internet, for example.

Before using VPC at Amazon, one of our biggest problems was the implicit packets-per-second limits on our AWS instances. Cross that limit, and the AWS instance experiences a high rate of TCP timeouts and dropped packets, resulting in high replication latencies, TCP retries, and failed replication requests which need to be retried later. The solution is simple: scale out. Using more instances means there is more total packets-per-second capacity. Sometimes two “large” instances are a better choice than a single “extra large,” even when the costs are the same. Moving into VPC significantly raised some limits, like packets per second, while also giving us access to other enhanced networking capabilities which allow the Relay and Proxy clusters to do more work per instance.

In order to be able to diagnose which link in the chain is causing latency, we introduced a number of metrics to track and monitor the latencies at different points in the system: from the client application to Kafka, in the Relay cluster’s reading from Kafka, from the Relay cluster to the remote Proxy cluster, and from Proxy cluster to its local cache servers. There are also end-to-end timing metrics to track how well the system is doing overall.

At this point, we have a few main issues that we are still working through:

Kafka does not scale up and down conveniently. When a cache needs more replication-queue capacity, we have to manually add partitions and configure the consumers with matching thread counts and scale the Relay cluster to match. This can lead to duplicate/re-sent messages, which is inefficient and may cause more than the usual level of eventual consistency skew.
If we lose an EVCache instance in the remote region, this results in an increase in latency as the Proxy cluster tries and fails to write to the missing instance. This latency leads back to the Relay side, which is awaiting confirmation for each (batched) replication request. We’ve worked to reduce the time spent in this state: we detect the lost instance earlier, and we are investigating reconciliation mechanisms to minimize the impact of these situations. We have made changes in the EVCache client that allow the Proxy instances to cope more easily with the possibility that cache instances can disappear.
Kafka monitoring, particularly for missing messages, is not an exact science. Software bugs can cause messages not to appear in the Kafka partition, or not to be received by our Relay cluster. We monitor by comparing the total number of messages received by our Kafka brokers (on a per topic basis) and the number of messages replicated by the Relay cluster. If there is more than a small acceptable threshold of difference for any significant time, we investigate. We also monitor maximum latencies (not the average), because the processing of one partition may be significantly slower for some reason. That situation requires investigation even if the average is acceptable. We are still improving these and other alerts to better detect real issues with fewer false-positives.

Future

We still have a lot of work to do on the replication system. Future improvements might involve pipelining replication messages on a single connection for better and more efficient connection use, optimizations to take better advantage of the network TCP window size, or transitioning to the new Kafka 0.9 API. We hope to make our Relay clusters (the Kafka consumers) autoscale cleanly without significantly increasing latencies or increasing the number of duplicate/re-sent messages.

EVCache is one of the critical components of Netflix’s distributed architecture, providing globally replicated data at RAM speed so any member can be served from anywhere. In this post we covered how we took on the challenge of providing reliable and fast replication for caching systems at a global scale. We look forward to improving more as our needs evolve and as our global member base expands. As a company, we strive to win more of our member’s moments of truth and our team helps in that mission by building highly-available distributed caching systems at scale. If this is something you’d enjoy too, reach out to us — we’re hiring!

— The EVCache Team (Shashi Madappa, Vu Nguyen, Scott Mansfield, Sridhar Enugula, Allan Pratt, Faisal Zakaria Siddiqi)