Sunday, January 1, 2017

Some Thoughts on Machine Learning Antivirus

For a while I've been reading that signature based antivirus is dead and machine learning is the future. We're seeing the first wave of machine learning and math based antivirus products such as those made by Cylance, Sentinel One, and Crowdstrike. In my limited experience with these products they're pretty successful at recognizing malicious code, especially new malicious code. But signature based detection has a few perks that machine learning companies haven't recognized and captured yet.

Once we know that a malicious file is on a system, do we care what it's doing or what kind of malware it is? The average user at home may not care who or what is attacking their system, they just want it fixed. In the business space, however, especially in large corporations with sensitive intellectual property, it is critical to know what the malware is. Why do they care? To correctly assess the risk of an infected computer, you have to know whether you're dealing with something innocuous like adware or malware that is indicative of an attacker like a remote access trojan (RAT). One requires the cleaning of a computer while the other may require a full blown incident response investigation. The problem with machine learning AV at present is that they tell you a file is malicious, but not *why* it is malicious. There's no signature telling you this is GhostRAT. So what are your options when you're trying to determine what the malware is? You're going to copy the hash from the console and paste it into VirusTotal and find out what the 55 signature based products detect this malware as. That's not ideal and certainly not scalable if you run a Security Operations Center and receive dozens or hundreds of alerts per day.

Another major difference with current machine learning AV products is that they only check executed files. The files that get written to the file system are not checked. This could be problematic if you have an attacker moving laterally inside your network and planting tools. AV would be running but it wouldn't notify you of the infection until the attacker actually used the tools. Adversaries could essentially prepare the battlefield without being detected. In comparison, signature based AV has the capability of scanning files on access, so the attacker tools would conceivably be detected when they're dropped on the file system (assuming a signature exists for those files).

The traditional signature based AV suites originally just included malware protection. Over time these suites have evolved (bloated?) to include a large variety of functionality such as firewall, access control, encryption, and others. Some of them provide increased protection through cloud based signatures and/or blocking of known bad network traffic. On the other hand, machine learning AV is presently just antivirus and nothing more. I've personally seen systems infected and detected by machine learning AV while the C2 traffic continued unabated. If strict proxy and firewall rules were not set up, this could be a problem. Defense in depth is important.

Machine learning antivirus has only been around for a few years so this technology is still maturing and evolving. The detective potential is great but there are tradeoffs to consider and risks to mitigate. I hope the experiences I've shared help to inform.