Netflix at AWS re:Invent 2017

Netflix Technology Blog
Netflix TechBlog
Published in
11 min readNov 6, 2017

--

by Jason Chan

Netflix is excited to be heading back to Las Vegas for AWS re:Invent at the end of the month! Many Netflix engineers and recruiters will be in attendance, and we’re looking forward to meeting and reconnecting with cloud enthusiasts and Netflix OSS users. We’re posting the schedule of Netflix talks here to make it a bit easier to find our speakers at re:Invent. We’ll also have a booth on the expo floor, so please stop by and say hello!

Monday — November 27

10:45am ARC208:Walking the tightrope: Balancing Innovation, Reliability, Security, and Efficiency
Coburn Watson, Director, Cloud Performance and Reliability Engineering

Abstract: At Netflix, we make explicit tradeoffs to balance our four key focus domains of innovation, reliability, security, and efficiency to ensure our customers, shareholders, and internal engineering stakeholders are happy. In this talk, learn the strategies behind each of our focus domains to optimize for one without detracting from another.

12:15pm SID206: Best Practices for Managing Security on AWS
Will Bengtson, Senior Security Engineer and Armando Leite of AWS

Abstract: To help prevent unexpected access to your AWS resources, it is critical to maintain strong identity and access policies and track, effectively detect, and react to changes. In this session you will learn how to use AWS Identity and Access Management (IAM) to control access to AWS resources and integrate your existing authentication system with IAM. We will cover how to deploy and control AWS infrastructure using code templates, including change management policies with AWS CloudFormation. Further, effectively detecting and reacting to changes in posture or adverse actions requires the ability to monitor and process events. There are several services within AWS that enable this kind of monitoring such as CloudTrail, CloudWatch Events, and the AWS service APIs. We learn how Netflix utilizes a combination of these services to operationalize monitoring of their deployments at scale, and discuss changes made as Netflix’s deployment has grown over the years.

1:00pm MAE403 — OTT: State of Play: Lessons Learned from the Big 3, Hulu, and Amazon Video
Vinod Viswanathan, Director, Media Cloud Engineering and Robert Post of Hulu, BA Winston of Amazon Video, Lee Atkinson of Amazon Web Services UK Ltd (MGM)

Abstract: Every evening video streaming consumes over 70% of the internet’s bandwidth, with demand only expected to increase as young households forego traditional pay TV for OTT services (whether live, on-demand, ad-supported, transactional, subscription, or a combination thereof). In this session senior tech architects from Netflix, Hulu, and Amazon Video discuss lessons and best practices around hosting largest scale video distribution workloads to enable high traffic consumption at demanding reliability requirements. We will dive deep into using AWS Compute Services more effectively for video processing workloads, using the AWS network for large scale content distribution, as well as using AWS Storage services for actively managing large content libraries.

Tuesday — November 28

10:45am ARC209: A Day in the Life of a Netflix Engineer III
Dave Hahn, Senior SRE

Abstract: Netflix is a large, ever changing ecosystem system serving millions of customers across the globe through cloud-based systems and a globally distributed CDN. This entertaining romp through the tech stack serves as an introduction to how we think about and design systems, the Netflix approach to operational challenges, and how other organizations can apply our thought processes and technologies. In this session, we discuss the technologies used to run a global streaming company, scaling at scale, billions of metrics, benefits of chaos in production, and how culture affects your velocity and uptime.

11:30am CMP204: How Netflix Tunes EC2 Instances for Performance
Brendan Gregg, Senior Performance Architect

Abstract: At Netflix, we make the best use of Amazon EC2 instance types and features to create a high- performance cloud, achieving near bare-metal speed for our workloads. This session summarizes the configuration, tuning, and activities for delivering the fastest possible EC2 instances, and helps you improve performance, reduce latency outliers, and make better use of EC2 features. We show how to choose EC2 instance types, how to choose between Xen modes (HVM, PV, or PVHVM), and the importance of EC2 features such SR-IOV for bare-metal performance. We also cover basic and advanced kernel tuning and monitoring, including the use of Java and Node.js flame graphs and performance counters.

8:00pm Tuesday Night Live
Greg Peters, Chief Product Officer

Wednesday — November 29

11:30am MCL317: Orchestrating Model Training for Netflix Recommendations
Faisal Siddiqi, Engineering Manager, Personalization and Data Infrastructure, Eugen Cepoi, Senior Software Engineer, and Davis Shepherd, Senior Software Engineer

Abstract: At Netflix, we use Machine learning algorithms extensively to recommend relevant titles to our 100+ Million members based on their tastes. Everything on the member homepage is an evidence driven, A/B tested experience backed by machine-learned models. These models are trained using Meson, our workflow orchestration system. Meson distinguishes itself from other workflow engines in that it can handle more sophisticated execution graphs such as loops and parameterized fan-outs. Meson is able to schedule Spark jobs, Docker containers, bash scripts, Scala gists and more. Meson also provides a rich visual interface for monitoring active workflows, inspecting execution logs etc. It has a powerful Scala DSL for authoring workflows as well as a REST API. In this talk, we will focus on how Meson orchestrates the training of Recommendation ML models in production and how we have re-architected it to scale up for a growing need of broad ETL applications within Netflix. As a driver for this change, we have had to evolve the persistence layer for Meson. We will talk about how we migrated from Cassandra to AWS RDS backed by Aurora.

12:15pm NET303: A day in the life of a Cloud Network Engineer at Netflix
Donavan Fritz, Senior Network SRE and Joel Kodama, Cloud Network SRE

Abstract: Netflix is big and dynamic. At Netflix, IP addresses mean nothing in the cloud. This is a big challenge with Amazon VPC Flow Logs. VPC Flow Log entries only present network-level information (L3 and L4), which is virtually meaningless. Our goal is to map each IP address back to an application, at scale, to derive true network-level insight within Amazon VPC. In this session, the Cloud Network Engineering team discusses the temporal nature of IP address utilization in AWS and the problem with looking at OSI Layer 3 and Layer 4 information in the cloud.

1:00pm ARC312: Why Regional Reservations are a Game Changer for Netflix
Rajan Mittal, Senior Analyst and Andrew Park, Senior Manager, Technology Planning and Analysis

Abstract: Learn how Netflix efficiently manages costs associated with 150K instances spread across multiple regions and heterogenous workloads. By leveraging internal Netflix tools, the Netflix capacity team is able to provide deep insights into how optimize our end users’ workload placements based on financial and business requirements. In this session, we discuss the efficiency strategies and practices we picked up operating at scale using AWS since 2011, along with best practices used at Netflix. Because many of our strategies revolve around Reserved Instances, we focus on the evolution of our Reserved Instance strategy and the recent changes after the launch of regional reservations. Regional Reserved Instances provide tremendous financial flexibility by being agnostic to instance size and Availability Zone. However, it’s anything but simple to adopt regional Reserved Instances in an environment with over 1,000 services, that have varying degrees of criticality combined with a global failover strategy.

1:00pm SID304: SecOps 2021 Today: Using AWS Services to Deliver SecOps
Alex Maestretti, Manager, Security Intelligence and Response Team and Armando Leite of AWS

Abstract: This talk dives deep on how to build end-to-end security capabilities using AWS. Our goal is orchestrating AWS Security services with other AWS building blocks to deliver enhanced security. We cover working with AWS CloudWatch Events as a queueing mechanism for processing security events, using Amazon DynamoDB to provide a stateful layer to provide tailored response to events and other ancillary functions, using DynamoDB as an attack signature engine, and the use of analytics to derive tailored signatures for detection with AWS Lambda. Log sources include available AWS sources and also more traditional logs, such as syslog. The talk aims to keep slides to a minimum and demo live as much as possible. The demos come together to demonstrate an end-to-end architecture for SecOps. You’ll get a toolkit consisting of code and templates so you can hit the ground running.

1:45pm DEV334: Performing Chaos at Netflix Scale
Nora Jones, Senior Chaos Engineer

Abstract: Chaos Engineering is described as “the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” Going beyond Chaos Monkey, this session covers the specifics of designing a Chaos Engineering solution, how to increment your solution technically and culturally, the socialization and evangelism pieces that tend to get overlooked in the process, and how to get developers excited about purposefully injected failure. This session provides examples of getting started with Chaos Engineering at startups, performing chaos at Netflix scale, integrating your tools with AWS, and the road to cultural acceptance within your company. There are several different “levels” of chaos you can introduce before unleashing a full-blown chaos solution. We provide a focus on each of these levels, so you can leave this session with a game plan you can culturally and technically introduce.

4:45pm SID316: Using Access Advisor to Strike the Balance Between Security and Usability
Travis McPeak, Senior Security Engineer and Patrick Kelley, Senior Security Engineer

Abstract: AWS provides a killer feature for security operations teams: Access Advisor. In this session, we discuss how Access Advisor shows the services to which an IAM policy grants access and provides a timestamp for the last time that the role authenticated against that service. At Netflix, we use this valuable data to automatically remove permissions that are no longer used. By continually removing excess permissions, we can achieve a balance of empowering developers and maintaining a best-practice, secure environment.

Thursday — November 30

8:30am Werner Vogel’s Keynote
Nora Jones, Senior Chaos Engineer

11:30am NET402: Elastic Load Balancing Deep Dive and Best Practices
Andrew Spyker, Manager, Netflix Container Cloud and David Pessis of AWS

Abstract: Elastic Load Balancing automatically distributes incoming application traffic across multiple Amazon EC2 instances for fault tolerance and load distribution. In this session, we go into detail about Elastic Load Balancing configuration and day-to-day management, and also its use with Auto Scaling. We explain how to make decisions about the service and share best practices and useful tips for success.

12:15pm CMP311: Auto Scaling Made Easy: How Target Tracking Scaling Policies Hit the Bullseye
Vadim Filanovsky, Performance and Reliability Engineer, and Tara Van Unen and Anoop Kapoor of AWS

Abstract: Auto Scaling allows cloud resources to scale automatically in reaction to the dynamic needs of customers, which helps to improve application availability and reduce costs. New target tracking scaling policies for Auto Scaling make it easy to set up dynamic scaling for your application in just a few steps. With target tracking, you simply select a load metric for your application, set the target value, and Auto Scaling adjusts resources as needed to maintain that target. In this session, you will learn how you can use target tracking to setup sound scaling policies “without the fuss”, and improve availability under fluctuating demand. Netflix is spending $6 billion on original content this year, with shows like The Crown, House of Cards and Stranger Things, and plenty more in the future. Hear how they’re using target tracking scaling policies to improve performance, reliability and availability around the world at prime times, without over-provisioning — and without guesswork. They will share some best practices examples of how target tracking allows their infrastructure to automatically adapt to changing traffic patterns in order to keep their audience entertained and their costs on target.

12:15pm DAT308: A story of Netflix and AB Testing in the User Interface using DynamoDB
Alex Liu, Senior Software Engineer

Abstract: Netflix runs hundreds of multivariate A/B tests a year, many of which help personalize the experience in the UI. This causes an exponential growth in the number of user experiences served to members, with each unique experience resulting in a unique JS/CSS bundle. Pre-publishing millions of permutations to the CDN for each build of each UI simply does not work at Netflix scale. In this session, we discuss how we built, designed, and scaled a brand new Node.js service, Codex. Its sole responsibility is to build personalized JS/CSS bundles on the fly for members as they move through the Netflix user experience. We’ve learned a ton about building a horizontally scalable Node.js microservice using core AWS services. Codex depends on Amazon S3 and Amazon DynamoDB to meet the streaming needs of our 100 million customers.

12:55pm CMP309: How Netflix Encodes at Scale
Rick Wong, Senior Software Engineer

Abstract: The Netflix encoding team is responsible for transcoding different types of media sources to a large number of media formats to support all Netflix devices. Transcoding these media sources has compute needs ranging from running compute-intensive video encodes to low-latency, high-volume image and text processing. The encoding service may require hundreds of thousands of compute hours to be distributed at moment’s notice where they are needed most. In this session, we explore the various strategies employed by the encoding service to automate management of a heterogenous collection of Amazon EC2 Reserved Instances, resolve compute contention, and distribute them based on priority and workload.

4:00pm ABD401: How Netflix Monitors Applications Real Time with Kinesis
John Bennett, Senior Software Engineer

Abstract: Thousands of services work in concert to deliver millions of hours of video streams to Netflix customers every day. These applications vary in size, function, and technology, but they all make use of the Netflix network to communicate. Understanding the interactions between these services is a daunting challenge both because of the sheer volume of traffic and the dynamic nature of deployments. In this session, we first discuss why Netflix chose Kinesis Streams to address these challenges at scale. We then dive deep into how Netflix uses Kinesis Streams to enrich network traffic logs and identify usage patterns in real time. Lastly, we cover how Netflix uses this system to build comprehensive dependency maps, increase network efficiency, and improve failure resiliency. From this session, youl learn how to build a real-time application monitoring system using network traffic logs and get real-time, actionable insights.

4:00pm ARC321: Models of Availability
Casey Rosenthal, Traffic, Chaos, and Intuition Engineering Manager

Abstract: When engineering teams take on a new project, they often optimize for performance, availability, or fault tolerance. More experienced teams can optimize for these variables simultaneously. Netflix adds an additional variable: feature velocity. Most companies try to optimize for feature velocity through process improvements and engineering hierarchy, but Netflix optimizes for feature velocity through explicit architectural decisions. Mental models of approaching availability help us understand the tension between these engineering variables. For example, understanding the distinction between accidental complexity and essential complexity can help you decide whether to invest engineering effort into simplifying your stack or expanding the surface area of functional output. The Chaos team and the Traffic team interact with other teams at Netflix under an assumption of Essential Complexity. Incident remediation, approaches to automation, and diversity of engineering can all be understood through the perspective of these mental models. With insight and diligence, these models can be applied to improve availability over time and drift into success.

Friday — December 1

8:30am ABD319: Tooling Up For Efficiency: DIY Solutions @ Netflix
Andrew Park, Senior Manager, Technology Planning and Analysis and Sebastien de Larquier, Manager, Science and Analytics

Abstract: At Netflix, we have traditionally approached cloud efficiency from a human standpoint, whether it be in-person meetings with the largest service teams or manually flipping reservations. Over time, we realized that these manual processes are not scalable as the business continues to grow. Therefore, in the past year, we have focused on building out tools that allow us to make more insightful, data-driven decisions around capacity and efficiency. In this session, we discuss the DIY applications, dashboards, and processes we built to help with capacity and efficiency. We start at the ten thousand foot view to understand the unique business and cloud problems that drove us to create these products, and discuss implementation details, including the challenges encountered along the way. Tools discussed include Picsou, the successor to our AWS billing file cost analyzer; Libra, an easy-to-use reservation conversion application; and cost and efficiency dashboards that relay useful financial context to 50+ engineering teams and managers.

10:00am ABD320: Netflix Keystone SPaaS — Real-time Stream Processing as a Service
Monal Daxini, Engineering Manager

Abstract: Over 100 million subscribers from over 190 countries enjoy the Netflix service. This leads to over a trillion events, amounting to 3 PB, flowing through the Keystone infrastructure to help improve customer experience and glean business insights. The self-serve Keystone stream processing service processes these messages in near real-time with at-least once semantics in the cloud. This enables the users to focus on extracting insights, and not worry about building out scalable infrastructure. In this session, I share the benefits and our experience building the platform.

--

--

Learn more about how Netflix designs, builds, and operates our systems and engineering organizations