Monthly Archives: July 2015

Frequency Analysis of Binary Executables

A set of posts by Mark Bagget on the SANS ISC page spoke about using frequency analysis for detection of DNS names and BASE 64 encoded data. This started me thinking about what else could be learned by using frequency analysis. I settled on using Marks python code to see about identifying executable files.

Importing Marks original code I can write a simple script that will allow me to pass it a binary file.

fc = FreqCounter()
fc.ignorecase = True
fc.ignorechars = ""
fc.tally_str(open(freq_file).read())

With a binary file serving to build my frequency table I can loop through a directory of files and find scores for each one.
I first created a dictionary to hold my filename and score information, this way I can evaluate the results.

score_data = {}

path = directory
files=glob.glob(path)
for file in files:
if os.path.isfile(file):
print file
f = open(file, 'read')
score = fc.probability(f.read())
f.close()
print score
score_data[file] = score

Now to fill a directory with random executables, excel spreasdsheets, pdfs, and images.
For this test I decided to throw both PE and MACH-O file formats into the directory to see what would happen

For the frequency map I decided to use Explorer.exe. I assumed that this would help to build a pretty comprehensive map that would (hopefully) be representative of a binary executable.

For analysis of the results I am simply writing the data out to a csv to allow for graphing and sorting.

#write our results out to a csv
with open(out_file,'wb') as f:
w = csv.writer(f)
w.writerows(score_data.items())

Allright, before we get to the results I noticed that this processing was taking a long time. Reading each file concurrently is not efficient so I did a little refactoring to take into account multiprocessing.

#Get list of files
files = [ join(path,f) for f in listdir(path) if isfile(join(path,f)) ]

def compute_probability(filename):
f = open(filename, 'read')
score = fc.probability(f.read())
f.close()
return filename, score

#multiprocess our frequency comparison
pool = mp.Pool(processes=4)
score_data = dict(pool.map(compute_probability,files))

Previous execution time: 382.1 Seconds
Multiprocessing execution time: 164.9 Seconds

Now that we made execution MUCH faster we can start looking at the results.
Screen Shot 2015-07-10 at 4.44.42 PM

The image is a little hard to see so I have embedded a snippet of the results below. Interestingly enough the probability based frequency analysis rated executables (PE and MACH-O), dlls, and dylibs fairly high. One anomalous excel spreadsheet starts to break the streak with a probability of 11.

Screen Shot 2015-07-10 at 4.54.32 PM
After that the results start to intermingle with dylibs falling in probability (Yes I realize when I grabbed a random assortment of libraries I copied the linked files thus creating duplicates).

These results also show that images and PDFS were NOT likely to be an executable file. So just for fun I decided to see if we could use frequency analysis to find images. I added some pngs and bmps into our directory and used a random png file as my frequency map.

I won’t bore you with the results but nothing in the test came back with a probability higher than 3. And a png did not show up until the 51st result. Images are not necessarily structured data so I was not surprised by these results.

Fun to experiment with frequency analysis feel free to check out the full code on my github, and Marks frequency code.