8 min read

Machine learning for Malware Detection

Written by

Gaurav Khuntale

Published on

January 17, 2023

Cybercrime cost is increasing at rate of 15% per year according to cybersecurity experts, in 2021 itself financial losses upto $6 trillion are recorded due to cyber attacks. Malware are growing at an unprecedented speed now a days and the traditional signature scanning is not able to keep up the speed. For this reason we propose the modern solution of Artificial Intelligence to classify the PE files as clean or malware. Perfecting this method over time we have become one of few vendors who have received the ICSA certificate for 100% classifying and detection of the malware along with 0 False positives.

Overview

Our client was developing an Endpoint security software, in which as a first line of defence we had to implement static file scanning. If the file is identified as malware before execution, it can be blocked for execution resulting in saving lives, securing nations, avoiding financial losses and many such important things can be accomplished.

‍

Challenges

The intelligence of machines is only the result of the data we get to train them in. Finding relevant dataset is one of the primary challenges in Artificial Intelligence.

We can develop model and gain good accuracy but not at the cost of system performance.

False positives can really be a headache.

Deployment of AI models in live environment.

‍

Solution

Our contribution to this solution starts with choosing the right model applicable for this problem. Second understanding on how malwares work, look and behave.

For datasets of malware we found some good resources and downloaded huge number of malware samples from VirusShare and other repositories, for clean we wrote our own program that we ran on as many machines as we could find and collect data.

Understanding of features used to build model is a crucial stage, this helps a lot to understand the False positive and False negative files and perform detailed analysis on them.

First we focused on developing a model which has good results irrespective of performance. Once the model was created, initial model scan time was much higher just to give a glimpse, a file of size 1.5 GB was taking 8 seconds to scan the file, using High performance computing like CUDA we were able to reduce the scan time to 2 seconds. This helped us in achieving the Higher performance and ultimately increased our efficiency by 400%.

As per our client requirement, the model trained in python were successfully ported in C++ to remove dependency from python environment on customer side.

Combining all this, multiple models were generated. But as the malware detection problem lies in the supervised learning the best results are from Neural Network and LightGBM models. Around 98% accuracies on public datasets.

‍

Outcome

As we have integrated this models in Endpoint security it provides real-timeas well as on-access protection for known as well as zero day malwares, we have also successfully integrated this model in Email security to scan the attachments in email to stop the attack before even downloading the attachments. We have successfully protected customer’s businesses for more than 2 years against Trojan, Ransomware and many more known and unknown virus families using expertise in this domain.

Future Scope

As of now new models are in R&D phase where new features are implemented and also exploring new techniques using AI.

Technical Understandings and Expertise

Understanding of Malware analysis.

Deeper knowledge regarding ML and DL algorithms.

Understanding of PE file format.

Deployment of AI applications in real-world.

‍

Have Any Thoughts...

Let us know if you have any thoughts on the article. We would like to discuss and here your point of view or resolve any queries that you have on the case study.

Write To Us

VoidStarIndia