July | 2015 | Michael Evans

A set of posts by Mark Bagget on the SANS ISC page spoke about using frequency analysis for detection of DNS names and BASE 64 encoded data. This started me thinking about what else could be learned by using frequency analysis. I settled on using Marks python code to see about identifying executable files.

Importing Marks original code I can write a simple script that will allow me to pass it a binary file.

fc = FreqCounter() fc.ignorecase = True fc.ignorechars = "" fc.tally_str(open(freq_file).read())

With a binary file serving to build my frequency table I can loop through a directory of files and find scores for each one.
I first created a dictionary to hold my filename and score information, this way I can evaluate the results.

score_data = {}

path = directory files=glob.glob(path) for file in files: if os.path.isfile(file): print file f = open(file, 'read') score = fc.probability(f.read()) f.close() print score score_data[file] = score

Now to fill a directory with random executables, excel spreasdsheets, pdfs, and images.
For this test I decided to throw both PE and MACH-O file formats into the directory to see what would happen

For the frequency map I decided to use Explorer.exe. I assumed that this would help to build a pretty comprehensive map that would (hopefully) be representative of a binary executable.

For analysis of the results I am simply writing the data out to a csv to allow for graphing and sorting.
#write our results out to a csv with open(out_file,'wb') as f: w = csv.writer(f) w.writerows(score_data.items())
Allright, before we get to the results I noticed that this processing was taking a long time. Reading each file concurrently is not efficient so I did a little refactoring to take into account multiprocessing.
#Get list of files files = [ join(path,f) for f in listdir(path) if isfile(join(path,f)) ]


def compute_probability(filename):

f = open(filename, 'read')

score = fc.probability(f.read())

f.close()

return filename, score

#multiprocess our frequency comparison pool = mp.Pool(processes=4) score_data = dict(pool.map(compute_probability,files))
Previous execution time: 382.1 Seconds
Multiprocessing execution time: 164.9 Seconds

Now that we made execution MUCH faster we can start looking at the results.

The image is a little hard to see so I have embedded a snippet of the results below. Interestingly enough the probability based frequency analysis rated executables (PE and MACH-O), dlls, and dylibs fairly high. One anomalous excel spreadsheet starts to break the streak with a probability of 11.

After that the results start to intermingle with dylibs falling in probability (Yes I realize when I grabbed a random assortment of libraries I copied the linked files thus creating duplicates).

These results also show that images and PDFS were NOT likely to be an executable file. So just for fun I decided to see if we could use frequency analysis to find images. I added some pngs and bmps into our directory and used a random png file as my frequency map.

I won’t bore you with the results but nothing in the test came back with a probability higher than 3. And a png did not show up until the 51st result. Images are not necessarily structured data so I was not surprised by these results.

Fun to experiment with frequency analysis feel free to check out the full code on my github, and Marks frequency code.

Michael Evans

Technical Musings

Monthly Archives: July 2015

Frequency Analysis of Binary Executables