You are currently viewing 1 – Malware Detection using Deep Learning Approach – Introduction

1 – Malware Detection using Deep Learning Approach – Introduction

We will choose and re-new the implementation of a Paper published in 2017 by Edward Raff, Jon Barker, Jared Sylvester, Robert Brandon, Bryan Catanzaro, Charles Nicholas [1]. Step by Step we will go through the Implementation concept of the Malware Detection system.

This Article is Divided into Four Initial Steps:

1- Introduction

2- Dataset & Flow Diagrams

3- Science & Environment of Implementation.

This article part will cover the Basic Introduction of what is Malware Detection using Artificial Intelligence State-of-Art methods and its current evaluated works as per the paper.

Abstract:

In this work we introduce malware detection from raw byte sequences as a fruitful research area to the larger machine learning community. Building a neural network for such a problem presents a number of interesting challenges that have not occurred in tasks such as image processing or NLP (Natural Language Processing). In particular, we note that detection from raw bytes presents a sequence problem with over two million time steps and a problem where batch normalization appear to hinder the learning process. We present our initial work in building a solution to tackle this problem, which has linear complexity dependence on the sequence length, and allows for interpretable sub-regions of the binary to be identified. In doing so we will discuss the many challenges in building a neural network to process data at this scale, and the methods we used to work around them.

Introduction:

he detection of malicious software (malware) is an important problem in cyber security, especially as more of society
becomes dependent on computing systems. Already, single incidences of malware can cause millions of dollars in damages (Anderson et al. 2013). Anti-virus products provide some protection against malware, but are growing increasingly ineffective for the problem. Current anti-virus technologies use a signature-based approach, where a signature is a set of manually crafted rules in an attempt to identify a small family of malware. These rules are generally specific, and cannot usually recognize new malware even if it uses the same functionality. This approach is insufficient as most environments will have unique binaries that will have never been seen before (Li et al. 2017) and millions of new malware samples are found every day. The limitations of signatures have been recognized by the anti-virus providers and industry experts for many years (Spafford 2014). The need to develop techniques that generalize to new malware would make the task of malware detection a seemingly perfect fit for machine learning, though there exist significant challenges.

To build a malware detection system, we must first determine a feature set to use. One intuitive choice is to use
features obtained by monitoring program execution (APIs called, instructions executed, IP addresses accessed, etc.). This is referred to as dynamic analysis. While intuitively appealing, there are many issues with dynamic analysis in practice. To conduct dynamic analysis, malware must be run inside a specially instrumented environment, such as a customized Virtual Machine (VM), which introduces high computational requirements. Furthermore, in some cases it is possible for malware to detect when it is being analysed. When the malware detects an attempt to analyse it, the malware can alter its behaviour, allowing it to avoid discovery (Raffetseder, Kruegel, and Kirda 2007; Garfinkel et al. 2007; Carpenter, Liston, and Skoudis 2007). Even when malware does not exhibit this behavior, the analysis environment may not reflect the target environment of the malware, creating a discrepancy between the training data collected and real life environments (Rossow et al. 2012). While a dynamic analysis component is likely to be an important component for a long term solution, we avoid it at this time due to its added complexity.

We instead take a static analysis approach, where we look at information from the binary program that can be obtained
without running it. In particular, we look at the raw bytes of the file itself, and build a neural network to determine maliciousness. Neural nets have excelled in learning features from raw inputs for image (Szegedy et al. 2015), signal (Graves, Mohamed, and Hinton 2013), and text (Zhang and LeCun 2015) problems. Replicating this success in the malware domain may help to simplify the tools used for detecting malware and improve accuracy. Because malware may frequently exploit bugs and ignore format specifications, parsing malicious files and using features that require domain knowledge can require significant and nontrivial effort. Since malware is written by a real live adversary, such code will also require maintenance and improvement to adjust to changing behaviour of the malware authors.

Since we desire to learn a system from raw byte inputs, from which higher level representations will be constructed, we choose to use a neural network based approach. However, there exist a number of challenges and differences for this domain that have not been encountered in other tasks. These challenges make research in malware detection intrinsically interesting and relevant from a machine learning perspective beyond merely introducing these techniques to a novel domain. For Microsoft Windows Portable Executable (PE) malware, these challenges include but are not limited to:

  1. The bytes in malware can have multiple modalities of information. The meaning of any particular byte is context
    sensitive, and could be encoding human-readable text (e.g., function names from the import table), binary code, arbitrary objects such as images (from the resource/data sections of a binary), and more.
  2. The content of a binary exhibits multiple types of spatial correlation. Code instructions in a function are intrinsically correlated spatially, but this correlation has discontinuities from function calls and jump commands. Further, the contents at a function level can be arbitrarily re-arranged if addresses are properly corrected.
  3. Treating each byte as a unit in a sequence, we are dealing with a sequence classification problem on the order of two million time steps. To our knowledge, this far exceeds the length of input to any previous neural network based sequence classifier.
  4. Our problem has multiple levels of concept drift over time. The applications, build tools, and libraries developers use will naturally be updated, and alternatives will fall in and out of favour. This alone causes concept drift. But malware is written by a real-life adversary, and is often intentionally adjusted to avoid detection.

Our contributions in this work are the development of the first, to our knowledge, network architecture that can successfully process a raw byte sequence of over two million steps.
Others have attempted this task, but failed to outperform simpler baselines or successfully process the entire file (Anderson 2017), in part because the techniques developed for signal and image processing do not always transfer to this new domain. We identify the challenges involved in making a network detect malware from raw bytes, and the initial methods one can employ to successfully train such a model. We show that this model learns a wider breadth of information types compared to previous domain-knowledge free approaches to malware detection. Our work also highlights a failure case for batch-normalization, which initially rendered our model unable to learn.

For this specific work, we will be using Dataset of Standard called “Ember Dataset” which Stands for Elastic Malware Benchmark for Empowering Researchers (EMBER).

The EMBER dataset is a collection of features from PE files that serve as a benchmark dataset for researchers. The EMBER2017 dataset contained features from 1.1 million PE files scanned in or before 2017 and the EMBER2018 dataset contains features from 1 million PE files scanned in or before 2018. This repository makes it easy to reproducibly train the benchmark models, extend the provided feature set, or classify new PE files with the benchmark models.

This paper describes many more details about the dataset: https://arxiv.org/abs/1804.04637

We will be using Static Analysis Approach for the Computation of PE Header Section Information of Files/File for its Detection based upon the Neural Network Architecture we induce for this specific work based upon other works.

The next section will discuss the overview of the Project with some extra graphical information.

2- Malware Detection using Deep Learning Approach – Flow Diagrams – CodexLearner