The Quiet Victories and False Promises of Machine Learning in Security
content:Table of contents To ML or not to ML Garbage in, Garbage out False positives ML detect evil or not?
Contrary to what you might have read on the Internet, machine learning (ML) is not magic pixie dust. It’s a broad collection of statistical techniques that allows us to train a computer to estimate an answer to a question even when we haven’t explicitly coded the correct answer into the program.
There are classes of problems where ML shines, and when a well-designed machine learning system is applied to the right type of problem, you can unlock insights and scale that were not attainable otherwise.
To ML or not to ML, that is the question
ML is not a panacea, and most security problems neither require nor benefit from ML solutions. In fact, many experts in the field, such as the folks at Google and other large tech companies, suggest that when attempting to solve a complex problem, you should exhaust all other possible approaches before you start trying to apply machine learning.
ML is relatively difficult and expensive compared to heuristic methods and should not be used when a simpler approach is sufficient. A common example in threat detection is a rule that alerts when a connection is initiated to a known, bad IP address. There is no need for ML here, and trying to use it would likely be ineffective anyway.
A good example of successful ML is natural language processing (NLP). NLP allows computers to “understand” human language through text or audio, but human language is incredibly complex. Imagine trying to teach a computer idioms, sarcasm, metaphors, or grammatical irregularities. When we say that ML is good for “narrowly scoped problems,” we mean that you need a combination of models to simulate the nuances of any one language, even for a very specific task like identifying spam email. Most NLP involves a hybrid approach that includes hand-written rules, statistical methods, and/or neural networks or deep learning to address the ambiguous nature of language.
In many ways, cybersecurity faces the same challenges as language processing. Attackers may not use idioms, but many of their techniques are analogous to homonyms (words that sound like other words). They closely resemble actions a system administrator might take for perfectly benign reasons. Like languages across nations, IT environments vary across organizations in purpose, architecture, prioritization, and risk tolerance. As such, it’s impossible to create algorithms, ML or otherwise, that broadly address security use cases in all scenarios. This is why most successful applications of ML in security combine multiple methods to address a very specific issue. Some good examples include spam filters, DDoS or bot mitigation, and malware detection.
Garbage in, garbage out
Regardless of the flavor of ML we are considering, by far the biggest challenge we’ll face has to do with the availability of a sufficient quantity of relevant, usable data to solve our problem. In a supervised ML scenario, you need a large, correctly labeled data set. For example, if you wanted an ML model to identify photos of cats, you should train it on a dataset that contains lots of photos of cats labeled “cat” along with lots of photos of not cats labeled “not cat.” If you don’t have enough photos of cats or the photos are not correctly labeled, your model won’t work well.
A supervised story
In security, a well known supervised ML use case is signatureless malware detection. The best endpoint protection vendors today use ML for this purpose. They accomplish it by labeling huge quantities of malicious samples (or downloading such a data set) and benign samples, thus training a model on “what malware looks like.” This is cool because it can correctly identify evasive mutating malware and other trickery where a file is altered just enough to no longer match a signature but remains malicious. ML doesn’t match the signature. It predicts malice using some other feature set, and can thus often catch malware that signature-based methods miss.
However, because ML models are probabilistic, that is to say, not exact, there is a tradeoff. ML can catch malware that signatures miss, but may also miss malware that signatures catch. This is why modern Endpoint Protection Platforms (EPP) tools use hybrid methods that combine ML and signature-based techniques for optimal coverage.
An unsupervised story
Unsupervised ML is often referred to as “anomaly detection,” although a lot of anomaly detection isn’t ML at all, but rather basic statistics. When used correctly, unsupervised ML can enable dynamic baselining, which is often much more effective than static thresholds.
One successful security use case for unsupervised ML is network anomaly detection. Modern network security tools can even identify patterns in encrypted traffic to catch potential attacks. However, for anomaly detection to work, the target network must be very consistent and predictable or the false positive rates will be unbearable. Additionally, unsupervised methods rely on your organization’s data for the relevant patterns. This means that it takes some time for the tool to learn your environment before it really works properly. It also means that the tool can unintentionally baseline malicious activity that is already present in your network as normal, making it impossible for you to detect.
Like malware detection, most network detection and response tools combine a variety of methods, ML and otherwise, to achieve the best detection they can offer.
Something, something, false positives
Aside from the struggle of selecting a suitable use case and tailoring a good model to a hard-to-find enormous data set, ML presents some additional challenges when it comes to interpreting the output.
The result is a probability. The ML model outputs the likelihood of something. So if your model is designed to identify cats, you’ll get a result that looks like “this thing is 80% likely to be a cat.” This uncertainty is an inherent characteristic of these types of systems, and it can make it difficult to interpret the result. Is 80% cat enough?
The model can’t be tuned, at least typically not by the end user. To deal with the probabilistic outcomes, a tool might have thresholds set by the vendor that collapse them to a binary result. For example, the cat-identification model may be tuned to report that anything 90% or more likely to be a cat IS a cat, and anything else is not. The problem is that your business’s or security team’s tolerance for cat-ness may be higher or lower than what the vendor set. Usually, though not always, it’s not possible for you to alter this tolerance because it’s tuned during model development. Furthermore, if those thresholds are set by someone who is not a very good threat (or cat) expert, they can be as good as arbitrary.
False negatives (FN), or the failure to alert on real scary things, are one painful consequence of ML models, especially poorly tuned ones. We hear a lot about false positives (FP) because they waste time, contribute to team burn out, and are generally frustrating. But there is an inherent tradeoff between FP and FN rates. ML models are usually tuned to optimize the tradeoff, which means they select a model that has the “best” FP and FN rate possible. However, how the FP or FN rate is weighted in such an optimization depends very much on the use case. These thresholds may be very different for different types of organizations, depending on their individual threat and risk assessments. When using an ML-based product, your team usually cannot provide any input regarding your tolerance for FP and FN rates, and you must trust the vendor to select the appropriate thresholds for you.
Not enough context for alert triage. Part of the magic of ML is that it can extract “features” from a data set that may have useful predictive power, but may not be human-perceivable or make much sense. For example, imagine if, for some reason, whether or not something is a cat happened to be highly correlated with the color of the nearest building. This seems arbitrary, and it would never occur to a human being to include the color of a building in their decision about what is or isn’t a cat. But this is part of the point of ML, to find patterns we can’t otherwise find, and to do this at scale. There are two problems here. One is that the features are rarely exposed to the user of the model, so you wouldn’t even know your cat prediction was based on building color – you would get a prediction with no context. The second is that even if the reason for the prediction can be exposed, it is often either not human-readable or seemingly arbitrary and useless in an actual alert triage or incident response situation. This is because the “features” that ultimately define the ML system’s decision are optimized for their predictive power, not their practical relevance to a security analyst looking at the output of that model.
By golly, have we done it?!
In early November 2022, Sysdig’s machine learning miner detection system alerted on a potential threat. Our alert had a probability of 81% and while that is still deemed high confidence, we typically see miners trigger an alert from this detection system with a confidence probability of 96% or more. We took this alert data from the machines and pushed it into the hands of human analysts to determine if the system was correct or we had a false positive for it to learn from. Following the Sysdig Threat Research Team’s investigation, we confirmed that this was indeed a miner with a complex toolchain, demonstrating the effectiveness of machine learning in detecting malicious activity as a complement to traditional rule-based detectors.
Would “statistics” by any other name smell as sweet?
Beyond the pros and cons, there’s one more catch – not all “ML” is really ML. Statistics gives you some conclusions about data you have. ML gives you estimates about related data you didn’t have, or makes a prediction, based on the data you did have. Marketing teams far and wide have enthusiastically latched onto the terms “machine learning” and “artificial intelligence” to signal a modern, innovative, advanced thing of some kind. But there’s often very little regard for whether the tech in question even uses ML, never mind if ML is the right approach in the first place.
ML includes a broad range of techniques, many of which rely on basic statistical methods that have been around for decades (the term ML originated in the 1950s). A linear fit is ML because it has predictive capabilities, but do you want to pay a premium for that grand innovation? The real question is why should you care that a tool uses or claims to use ML at all?
So can ML detect evil or not?
Machine learning can help in detecting certain aspects of evil when you have a pretty good idea of what “evil” looks like and you can define your problem scope to capture those specific aspects. It can also help in detecting deviations from expected behavior in highly predictable systems. The more stable the environment in question is, the more likely ML is to correctly identify anomalies. However, this doesn’t mean that every anomaly is malicious, nor does it mean that the operator will be equipped with enough contextual information to act upon the alert.
As more and more legitimately impressive machine learning systems proliferate in cybersecurity, consider how you can position your organization to derive the most value from true innovation and avoid wasting time and money on the buzzword noise. Like any good tool, ML tools must fit seamlessly into your existing workflows to avoid creating additional friction. ML’s superpower is not in replacing, but in extending the capabilities of existing methods, systems, and teams for optimal coverage and efficiency.
Originally published on Dark Reading