Malware Detection using Machine Learning Algorithms performed in Google Cloud Platform (GCP)

Sibani Sankar Panigrahi
12 min readAug 19, 2021

--

The malware industry continues to be a well-organized, well-funded market dedicated to evading traditional security measures. Once a computer is infected by malware, criminals can hurt consumers and enterprises in many ways.

In this blog, I will try to explain,

  1. How I have tackled the problem statement and its dataset. As it’s dataset contains approximately 200GB data.
  2. How I have created image features.
  3. How I have prepared the advance features to achieve log loss 0.01.

Please visit my GitHub for complete code and upvote the article if you liked it. Thank you! Hope you will enjoy the rest of the article.

Content:

  1. Business Problem
  2. Dataset/Machine Learning Problem
  3. Using Google Cloud Platform (GCP)
  4. Downloading and extracting the files
  5. Exploratory Data Analysis
  6. Feature Engineering
  7. First Cut Approach
  8. Random Forest Classifier
  9. XGBoost Classifier
  10. Model Comparisons
  11. Summary
  12. References

1. Business Problem

1.1. What is Malware?

The term malware is a contraction of malicious software. Put simply, malware is any piece of software that was written with the intent of doing harm to data, devices or to people.
Source: https://www.avg.com/en/signal/what-is-malware

1.2. Problem Statement

In the past few years, the malware industry has grown very rapidly that, the syndicates invest heavily in technologies to evade traditional protection, forcing the anti-malware groups/communities to build more robust softwares to detect and terminate these attacks. The major part of protecting a computer system from a malware attack is to identify whether a given piece of file/software is a malware.

1.3 Source/Useful Links

Microsoft has been very active in building anti-malware products over the years and it runs it’s anti-malware utilities over 150 million computers around the world. This generates tens of millions of daily data points to be analyzed as potential malware. In order to be effective in analyzing and classifying such large amounts of data, we need to be able to group them into groups and identify their respective families.

This dataset provided by Microsoft contains about 9 classes of malware. ,

Source: https://www.kaggle.com/c/malware-classification

1.4. Real-world/Business objectives and constraints.

  1. Minimize multi-class error.
  2. Multi-class probability estimates.
  3. Malware detection should not take hours and block the user’s computer. It should finish in a few seconds or a minute.

2. Dataset/Machine Learning Problem

2.1. Data

2.1.1. Data Overview

1. ‘.asm’ file (read more: https://www.reviversoft.com/file-extensions/asm)

2. ‘.bytes’ file (the raw data contains the hexadecimal representation of the file’s binary content, without the PE header)

  • Total train dataset consist of 200GB data out of which 50Gb of data is .bytes files and 150GB of data is .asm files:
  • Lots of Data for a single-box/computer.
  • There are total 10,868 .bytes files and 10,868 asm files total 21,736 files
  • There are 9 types of malwares (9 classes) in our give data
  • Types of Malware:
  • Ramnit
  • Lollipop
  • Kelihos_ver3
  • Vundo
  • Simda
  • Tracur
  • Kelihos_ver1
  • Obfuscator.ACY
  • Gatak

2.1.2. Example Data Point

.asm file

.text:00401000                                       assume es:nothing, ss:nothing, ds:_data,    fs:nothing, gs:nothing
.text:00401000 56 push esi
.text:00401001 8D 44 24 08 lea eax, [esp+8]
.text:00401005 50 push eax
.text:00401006 8B F1 mov esi, ecx
.text:00401008 E8 1C 1B 00 00 call ??0exception@std@@QAE@ABQBD@Z ; std::exception::exception(char const * const &)
.text:0040100D C7 06 08 BB 42 00 mov dword ptr [esi], offset off_42BB08
.text:00401013 8B C6 mov eax, esi
.text:00401015 5E pop esi
.text:00401016 C2 04 00 retn 4
.text:00401016 ; ---------------------------------------------------------------------------
.text:00401019 CC CC CC CC CC CC CC align 10h
.text:00401020 C7 01 08 BB 42 00 mov dword ptr [ecx], offset off_42BB08
.text:00401026 E9 26 1C 00 00 jmp sub_402C51
.text:00401026 ; ---------------------------------------------------------------------------
.text:0040102B CC CC CC CC CC align 10h
.text:00401030 56 push esi
.text:00401031 8B F1 mov esi, ecx
.text:00401033 C7 06 08 BB 42 00 mov dword ptr [esi], offset off_42BB08
.text:00401039 E8 13 1C 00 00 call sub_402C51
.text:0040103E F6 44 24 08 01 test byte ptr [esp+8], 1
.text:00401043 74 09 jz short loc_40104E
.text:00401045 56 push esi
.text:00401046 E8 6C 1E 00 00 call ??3@YAXPAX@Z ; operator delete(void *)
.text:0040104B 83 C4 04 add esp, 4
.text:0040104E
.text:0040104E loc_40104E: ; CODE XREF: .text:00401043j
.text:0040104E 8B C6 mov eax, esi
.text:00401050 5E pop esi
.text:00401051 C2 04 00 retn 4
.text:00401051 ; ---------------------------------------------------------------------------

.bytes file

00401000 00 00 80 40 40 28 00 1C 02 42 00 C4 00 20 04 20
00401010 00 00 20 09 2A 02 00 00 00 00 8E 10 41 0A 21 01
00401020 40 00 02 01 00 90 21 00 32 40 00 1C 01 40 C8 18
00401030 40 82 02 63 20 00 00 09 10 01 02 21 00 82 00 04
00401040 82 20 08 83 00 08 00 00 00 00 02 00 60 80 10 80
00401050 18 00 00 20 A9 00 00 00 00 04 04 78 01 02 70 90
00401060 00 02 00 08 20 12 00 00 00 40 10 00 80 00 40 19
00401070 00 00 00 00 11 20 80 04 80 10 00 20 00 00 25 00
00401080 00 00 01 00 00 04 00 10 02 C1 80 80 00 20 20 00
00401090 08 A0 01 01 44 28 00 00 08 10 20 00 02 08 00 00
004010A0 00 40 00 00 00 34 40 40 00 04 00 08 80 08 00 08
004010B0 10 00 40 00 68 02 40 04 E1 00 28 14 00 08 20 0A
004010C0 06 01 02 00 40 00 00 00 00 00 00 20 00 02 00 04
004010D0 80 18 90 00 00 10 A0 00 45 09 00 10 04 40 44 82
004010E0 90 00 26 10 00 00 04 00 82 00 00 00 20 40 00 00
004010F0 B4 00 00 40 00 02 20 25 08 00 00 00 00 00 00 00
00401100 08 00 00 50 00 08 40 50 00 02 06 22 08 85 30 00
00401110 00 80 00 80 60 00 09 00 04 20 00 00 00 00 00 00
00401120 00 82 40 02 00 11 46 01 4A 01 8C 01 E6 00 86 10
00401130 4C 01 22 00 64 00 AE 01 EA 01 2A 11 E8 10 26 11
00401140 4E 11 8E 11 C2 00 6C 00 0C 11 60 01 CA 00 62 10
00401150 6C 01 A0 11 CE 10 2C 11 4E 10 8C 00 CE 01 AE 01
00401160 6C 10 6C 11 A2 01 AE 00 46 11 EE 10 22 00 A8 00
00401170 EC 01 08 11 A2 01 AE 10 6C 00 6E 00 AC 11 8C 00
00401180 EC 01 2A 10 2A 01 AE 00 40 00 C8 10 48 01 4E 11
00401190 0E 00 EC 11 24 10 4A 10 04 01 C8 11 E6 01 C2 00

2.2. Mapping the real-world problem to an ML problem

2.2.1. Type of Machine Learning Problem

There are nine different classes of malware that we need to classify a given a data point => Multi class classification problem

2.2.2. Performance Metric

Source: https://www.kaggle.com/c/malware-classification#evaluation

Metric(s):

  • Multi class log-loss
  • Confusion matrix

2.2.3. Machine Learing Objectives and Constraints

Objective: Predict the probability of each data-point belonging to each of the nine classes.

Constraints:

* Class probabilities are needed. * Penalize the errors in class probabilites => Metric is Log-loss. * Some Latency constraints.

2.3. Train and Test Dataset

Split the dataset randomly into three parts train, cross validation and test with 64%,16%, 20% of data respectively

2.4. Useful blogs, videos and reference papers

http://blog.kaggle.com/2015/05/26/microsoft-malware-winners-interview-1st-place-no-to-overfitting/
https://arxiv.org/pdf/1511.04317.pdf
First place solution in Kaggle competition: https://www.youtube.com/watch?v=VLQTRlLGz5Y
https://github.com/dchad/malware-detection
http://vizsec.org/files/2011/Nataraj.pdf
https://www.dropbox.com/sh/gfqzv0ckgs4l1bf/AAB6EelnEjvvuQg2nu_pIB6ua?dl=0
“ Cross validation is more trustworthy than domain knowledge.”

3. Using Google Cloud Platform (GCP)

I have created a GCP Instance of 500GB storage and 8core vCPUs with 32 GB RAM. Using SSH shell of instance directly may exhaust the storage if we do multiple file operations. As this case study demands file operations, I did a hack with port forwarding mechanism.

Step -1

Created the instance per below. Before that make sure to create firewall. Once both are done, you can create a external IP address with ‘Static’ type and assign the instance.

fig: Google Cloud Instance created

Step -2

We have installed Google Cloud SDK from the below URL.

Step -3

This is the crucial step where we will use port forwarding per below.

Format:

Example:

Step -4

Now use the below command to open jupyter notebook.

Now open the chrome browser and enter instance IP: TCP Port

Once you get the token in console by entering Jupyter notebook command, you can use the same to set the password permanently. Once all the steps done, the jupyter notebook of google compute engine instance is shown per below.

As you can see the storage usage from the below figure, I have used 56% of 492 GB assigned. All in cloud!!! Isn’t it great?

4. Downloading and extracting the files

4.1 Download the dataset from Kaggle

I used the CurlWget extension which is available is Google Chrome Store, you just need to download the dataset from Kaggle and Cancel it.

Open CurlWget, copy the link and open jupyter notebook.

In a cell, type below.

! <link>

Run the cell and you will be amazed how fast it gets downloaded. No storage is used from your local.

4.2 Decompress/extract the compressed file

you can install 7zip using PIP command in the jupyter notebook cell and extract the specific types of files using the above command. In the above code, I am extracting .bytes files from train.7z compressed file.

5. Exploratory Data Analysis

Once I imported all the required libraries. I first plotted the distributions of malware classes. As this is a multiclass classification of 9 classes, I used the below code.

As we can see highest class label is 3 and lowest is class label of 5.

Next, we calculated the size of the byte file and plotted the box plot to have brief overview of maximum sizes of classes.

As we can see from the above box plot, the byte files of class 2 is having maximum file sizes i.e. approximately 8MB.

6. Feature Engineering

I computed the followings and saved the dataset in google cloud:

6.1 Unigram of byte files with sizes

bytes unigrams are given in the dataset, which can easily be saved in the CSV file and read later as per below.

6.2 Unigram of asm files with sizes

asm unigrams are given in the dataset, which can easily be saved in the pickle file and read later as per below.

6.3 Bigram of byte files (selected top 2000 among 66308 features)

During preparing byte file bigrams, I knew it would kill python kernel every now and then because the dataset will compute nearly 66.5k features. Hence I have used multiprocessing and computed the sparse matrices as per below.

Once the sparse matrices are created, I merged those and prepared the final dataset. Similarly, I merged the class labels created as to avoid incorrect assignment of classes to the files.

Using Chi-Square test, I selected the best 2000 features and prepared the dataset.

As you can see below, I have prepared the bigrams dataset having 2000 features.

6.4 Bigram of asm files (selected top 500 among 676 features)

I have prepared the bigrams of asm opcodes per below.

During the course of preparing dataset, I have removed the columns where only 1 unique value is present and selected best 500 features using Chi-Square Test library.

6.5 Trigram of asm files (selected top 800 among 17k features)

I read the sparse matrix created for opcode trigrams and prepared the dataset of having 800 best features using Chi-Square Test

6.6 Image features of asm files (selected top 800)

I created images of each asm files and saved in the path. Once all the images are saved, I calculated the top 800 features.

These are the final features which are merged together. I have tried different types of features to achieve log loss of 0.01 but couldn’t. These are the final features which helped me to achieve log loss 0.01.

You may refer the GitHub link provided above for complete experiments.

I have used Chi-Square test library to select top features. This is to reduce the dimensionality of the dataset as pandas can’t preform well on large datasets.

6.7 Merging all the features

I have merged all the features to split train and test dataset.

7. First Cut Approach

Considering our objective of the case study to achieve the log loss 0.1, it is best to prepare advance features using N-Grams and merge those.

In the below section we have applied machine learning models on the final dataset prepared.

8. Random Forest Classifier

Using Random Forest Classifier we achieved log loss of 0.02 as per below.

9. XGBoost Classifier

This is it!! We have achieved the log loss 0.01 using XGBoost. Hence we can select XGBoost model for better performance and accuracy with least log loss.

10. Model Comparisons

In this section, I am comparing the all models experimented on different types of features including the final selected features.

For remaining models, please visit my GitHub profile.

11. Summary

  1. As mentioned above, I have used GCP because of low latency system I have. I used port forwarding trick using Google Cloud SDK which leveraged me of not loosing storage assigned. Because using file operations in GCP SSH Shell will drastically consume google storage
  2. First, I completed training byte file and ASM unigrams
  3. I used multiprocessing while creating byte bigrams (66308 unique features)
  4. With multiprocessing, I created sparse matrices and its corresponding class labels (i.e. ID and Class). When completed, I concatenated the same and prepared the byte bigram dataset.
  5. While preparing for bigrams and advanced features, I thought of data cleaning first.
  6. Hence I removed the columns which has only 1 unique values and deduplicated the columns as well.
  7. I used chi-square test to compute the best features. Before that I had applied steop-4.
  8. Once selected, I merged all the advanced features and applied machine learning models such as RandomForest and XBGoost.
  9. We achieved logloss 0.01 in XGBoost Classifier
  10. One thing, I experienced collecting y values can be easily ignored as we are more focused towards other features. Hence I sorted every datasets prepared in ascending order and saved in the GCP for future usage.In this way, we can avoid incorrect concatenation. (Merging sometimes throws errors,. Hence opted for concatenation)

12. References

  1. Applied AI Course: https://appliedaicourse.com/
  2. https://www.kaggle.com/c/malware-classification

Thanks for your valuable time for reading this article.

If you like this article- please like, share and follow my channel as it will encourage me to post varieties of concepts, problems on ML and DL…

--

--