You are currently viewing 2- Malware Detection using Deep Learning Approach – Dataset & Flow Diagrams

2- Malware Detection using Deep Learning Approach – Dataset & Flow Diagrams

What is Portable Executable (PE)?

The PE file format describes the predominant executable format for Microsoft Windows operating systems, and includes executables, dynamically-linked libraries (DLLs), and FON font files. The format is currently supported on Intel, AMD and variants of ARM instruction set architectures.

The file format is arranged with a number of standard headers (see Fig. 1 for PE-32 format), followed by one or more sections [20]. Headers include the Common Object File Format (COFF) file header that contains important information such as the type of machine for which the file is intended, the nature of the file (DLL, EXE, OBJ), the number of sections, the number of symbols, etc. The optional header identifies the linker version, the size of the code, the size of initialized and uninitialized data, the address of the entry point, etc. Data directories within the optional header provide pointers to the sections that follow it. This includes tables for exports, imports, resources, exceptions, debug information, certificate information, and relocation tables. As such, it provides a useful summary of the contents of an executable [30]. Finally, the section table outlines the name, offset and size of each section in the PE file. PE sections contain code and initialized data that the Windows loader is to map into executable or readable/writeable memory pages, respectively, as well as imports, exports and resources defined by the file. Each section contains a header that specifies the size and address. An import address table instructs the loader which functions to statically import. A resources section may contain resources such as required for user interfaces: cursors, fonts, bitmaps, icons, menus, etc. A basic PE file would normally contain a .text code section and one or more data sections (.data, .rdata or .bss). Relocation tables are typically stored in a .reloc.

Portable Executable File Format

section, used by the Windows loader to reassign a base address from the executable’s preferred base. A .tls section contains special thread local storage (TLS) structure for storing thread-specific local variables, which has been exploited to redirect the entry point of an executable to first check if a debugger or other analysis tool are being run. Section names are arbitrary from the perspective of the Windows loader, but specific names have been adopted by precedent and are overwhelmingly common. Packers may create new sections, for example, the UPX packer creates UPX1 to house packed data and an empty section UPX0 that reserves an address range for runtime unpacking. For more information: READ THIS EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models (arxiv.org)

Dataset:

The EMBER dataset is a collection of features from PE files that serve as a benchmark dataset for researchers. The EMBER2017 dataset contained features from 1.1 million PE files scanned in or before 2017 and the EMBER2018 dataset contains features from 1 million PE files scanned in or before 2018. This repository makes it easy to reproducibly train the benchmark models, extend the provided feature set, or classify new PE files with the benchmark models.

This paper describes many more details about the dataset: https://arxiv.org/abs/1804.04637

Flow of Our Approach:

The flowchart of Malware Detection using our Model is below:

Flowchart: PE detection using Convolutional Neural Networks

CNN will be used for Training, Evaluating, and Testing. Read more to find out What is CNN?

Convolutional Neural Network(CNN):

A convolutional neural network (CNN) is a specific type of artificial neural network that uses perceptron, a machine learning unit algorithm, for supervised learning, to analyse data. CNNs apply to image processing, natural language processing and other kinds of cognitive tasks.

A convolutional neural network is also known as a ConvNet.

Like other kinds of artificial neural networks, a convolutional neural network has an input layer, an output layer and various hidden layers. Some of these layers are convolutional, using a mathematical model to pass on results to successive layers. This simulates some of the actions in the human visual cortex.

CNNs are a fundamental example of deep learning, where a more sophisticated model pushes the evolution of artificial intelligence by offering systems that simulate different types of biological human brain activity.

Technically, deep learning CNN models to train and test, each input image will pass it through a series of convolution layers with filters (Kernels), Pooling, fully connected layers (FC) and apply Softmax function to classify an object with probabilistic values between 0 and 1. The below figure is a complete flow of CNN to process an input image and classifies the objects based on values.

The Graphical Diagram of CNN will be:

Convolutional Neural Networks

Most Common Convolutional Neural Network types are:

  • CNN-1D (1 Dimensional)
  • CNN-2D (2 Dimensional)

We will be using in our Case Convolutional Neural Networks 1-Dimensional.

Convolutional Neural Networks (1-Dimensional):

A one-dimensional CNN is a CNN model that has a convolutional hidden layer that operates over a 1D sequence. This is followed by perhaps a second convolutional layer in some cases, such as very long input sequences, and then a pooling layer whose job it is to distil the output of the convolutional layer to the most salient elements.

1D CNN – Diagram with Hidden Layers

Neural Network Architecture Design:

In our case, we will be using 1Dimensional CNN. Because of the Dataset in a Sequential pattern.

Our CNN – Model Diagram will be as follows for 1 Dimensional Case:

CNN-1D Architecture

We will be using CNN-1D for our Computational Model. Our Architecture will be as Follows:

CNN1D Layer => 128 Filters => Kernel_size 64 => Strides 64 => Activation ReLU

BatchNormalization

CNN1D Layer => 128 Filters => Kernel_size 3 => Strides 2 => Activation ReLU

BatchNormalization

Flattening Layer

Dense Layer => 256 Neural Nets => Activation ReLU

BatchNormalization

Dense Layer => 32 Neural Nets => Activation ReLU

BatchNormalization

Dense Layer => 2 Neural Nets => Activation ReLU

Our Approach:

Our Approach will be by using Convolutional Neural Networks – 1 Dimensional Computational model for our deployment and project.

We will be First Extracting the Dataset of Ember 2018 Feature version 2 as described in the Paper for the availability of this dataset. Extracting it into vectors and .dat file for the creation of Data Objects. We can also convert it into CSV Files but for faster access and performance, we chose the default method of Data Objects.

After the Creating the Data Objects. We will build up our Neural Net Architecture which contains 2 Conv1D Layers, with BatchNormalization() on each step and flattening it for Two Dense Sequential Layers.

After the dataset extraction part. We will Train the Neural Nets with Visualization of Graph to analyse the Accuracy Results.

For Cross-validation, we will be evaluating through the best performing saved data of model to evaluate and Measure the Accuracy through Graph Visualization.

Testing phase of .exe files will take place after Evaluation. The process will be passing an input file to the python code. Using Ember Feature Extractor, the features will be further matched and compared with the results of current features of the file. A condition will be placed for analysing the nature of file and labelling through either Malicious Features or Non-Malicious Features.

For Part III, Please click the Link Below:

1 – Malware Detection using Deep Learning Approach – Introduction – CodexLearner

Leave a Reply