Netflix Cloud Security SIRT releases Diffy: A Differencing Engine for Digital Forensics in the Cloud

Netflix Technology Blog
Netflix TechBlog
Published in
5 min readJul 17, 2018

--

Forest Monsen and Kevin Glisson, Netflix Security Intelligence and Response Team

Can you spot the difference? Hint: it’s not the bow tie.

The Netflix Security Intelligence and Response Team (SIRT) announces the release of Diffy under an Apache 2.0 license. Diffy is a triage tool to help digital forensics and incident response (DFIR) teams quickly identify compromised hosts on which to focus their response, during a security incident on cloud architectures.

Features

  • Efficiently highlights outliers in security-relevant instance behavior. For example, you can use Diffy to tell you which of your instances are listening on an unexpected port, are running an unusual process, include a strange crontab entry, or have inserted a surprising kernel module.
  • Uses one, or both, of two methods to highlight differences: 1) Collection of a “functional baseline” from a “clean” running instance, against which your instance group is compared, and 2) a “clustering” method, in which all instances are surveyed, and outliers are made obvious.
  • Uses a modular plugin-based architecture. We currently include plugins for collection using osquery via AWS EC2 Systems Manager (formerly known as Simple Systems Manager or SSM).

Why is Diffy useful?

Digital Forensics and Incident Response (DFIR) teams work in a variety of environments to quickly address threats to the enterprise. When operating in a cloud environment, our ability to work at scale, with imperative speed, becomes critical. Can we still operate? Do we have what we need?

When moving through systems, attackers may leave artifacts — signs of their presence — behind. As an incident responder, if you’ve found one or two of these on disk or in memory, how do you know you’ve found all the instances touched by the attackers? Usually this is an iterative process; after finding the signs, you’ll search for more on other instances, then use what you find there to search again, until it seems like you’ve got them all. For DFIR teams, quickly and accurately “scoping a compromise” is critical, because when it’s time to eradicate the attackers, it ensures you’ll really kick them out.

Since we don’t yet have a system query tool broadly deployed to quickly and easily interrogate large groups of individual instances (such as osquery), we realized in cases like these we would have some difficulty in determining exactly which instances needed closer examination, and which we could leave for later.

We’ve scripted solutions using SSH, but we’ve also wanted to create an easier, more repeatable way to address the issue.

How does Diffy work?

Diffy finds outliers among a group of very similar hosts (e.g. AWS Auto Scaling Groups) and highlights those for a human investigator, who can then examine those hosts more closely. More importantly, Diffy helps an investigator avoid wasting time in forensics against hosts that don’t need close examination.

How does Diffy do this? Diffy implements two methods to find outliers: a “functional baseline” method (implemented now), and a “clustering” method (to be implemented soon).

Functional baseline

How does the “functional baseline” method work?

  • Osquery table output representing system state is collected from a single newly-deployed representative instance and stored for later comparison.
  • During an incident, osquery table output is collected from all instances in an application group.
  • Instances are compared to the baseline. Suspicious differences are highlighted for the investigator’s follow-up.

When is the functional baseline useful?

  • When you have very few instances in an application group / low n.
  • When you have had the foresight or established process to successfully collect the baseline beforehand.

Clustering

How does the “clustering” method work?

  • During an incident, osquery table output is collected from all instances in an application group.
  • No pre-incident baseline need be collected.
  • A clustering algorithm is used to identify dissimilar elements in system state (for example, an unexpected listening port, or a running process with an unusual name).

When is the clustering method useful?

  • When you have many instances in an application group.
  • When instances in an application group are expected to be very similar (in which case outliers will stick out quite noticeably).
  • When you are not able to collect a baseline for the application group beforehand.

Integrating into the CI/CD pipeline

In environments supporting continuous integration or continuous delivery (CI/CD) such as ours, software is frequently deployed through a process involving first the checkout of code from a source control system, followed by the packaging of that code into a form combined (“baked”) into a system virtual machine (VM) image. The VM is then copied to a cloud provider, and started up as a VM instance in the cloud architecture. You can read more about this process in “How We Build Code at Netflix.”

Diffy provides an API for application owners to call at deploy time, after those virtual machine instances begin serving traffic. When activated, Diffy deploys a system configuration and management tool called osquery to the instance (if it isn’t already present) and collects a baseline set of observations from the system by issuing SQL commands. We do this on virtual machines, but osquery can do this on containers as well.

State diagram for Diffy’s functional baselining method

During an incident

When an incident occurs, an incident responder can use Diffy to interrogate an ASG: first pulling the available baseline, next gathering current observations from all instances there, and finally comparing all instances within the ASG against that baseline. Instances that differ from the baseline in interesting, security-relevant ways are highlighted, and presented to the investigator for follow-up. If the functional baseline wasn’t previously collected, Diffy can rely solely on the clustering method. We’re not settled on the algorithm yet, but we see Diffy collecting observations from all instances in an ASG, and using the algorithm to identify outliers.

Summary

In today’s cloud architectures, automation wins. Digital forensics and incident response teams need straightforward help to help them respond to compromises with swift action, quickly identifying the work ahead. Diffy can help those teams.

We’ve characterized Diffy as one of our “Skunkworks” projects, meaning that the project is under active development and we don’t expect to be able to provide support, or a public commitment to improve the software. To download the code, visit https://github.com/Netflix-Skunkworks/diffy. If you’d like to contribute, take a look at our Contributor Guidelines at https://diffy.readthedocs.io/ to get started on your plugin and send us a pull request. Oh, and we’re hiring — if you’d like to help us solve these sorts of problems, take a look at https://jobs.netflix.com/teams/security, and reach out!

--

--

Learn more about how Netflix designs, builds, and operates our systems and engineering organizations