- A couple of weeks ago I was doing some binary analysis for a client. It's not as mainstream as some of the other work I've been working on in recent months but reverse engineering work isn't exactly in demand by your typical client in the tech space and it's something I enjoy doing so I was more than happy to take the project on.
-
- There are probably tools out there that could handle this already - perhaps a tool that has a feature like this built-in natively or hex editor or *nix command-line tool that would make this a cinch. I couldn't think of one off the top of my head, and I figured with the time that it would take me to locate such a tool I could just code something up myself faster.
- A knee-jerk solution to this would be to load the binary file in memory and use a nested loop to iterate through the bytes, and use a HashSet to keep track of the unique bytes we've seen. The size of the HashSet at any given time tells us the amount of unique bytes for that block of # bytes. Here's a look at what that looks like: [INSERT CODE EXAMPLE]
Let's say we're looking at subequences of 10 bytes. The bytes scanned per iteration would look like this: [INSERT ANIMATION]
Notice how in each iteration we're looking at redundant data? This wastes time - and its also unnecessary.
A nested-loop solution for this type of problem is typically a good sign that we need to step back and think carefully about what we're doing.
- [INSERT OUTPUT]
Instead - we'll use a sliding window to preserve information gathered from each previous iteration. Then, we only need to analyze the new bytes we haven't seen yet and adjust our previous metrics accordingly, rather than start from scratch.
We'll ditch the HashSet and swap it out with a datastructue we can use to lookup values (counts) for each key (each byte we encounter). A HashMap would work perfectly fine for this since it lets us work with key/value pairs with ease. However, because we're dealing with individual bytes, we can actually use an even simpler datastructure with a lighter footprint: an array.
We need to initialize an int array with a length of 256:
All elements are zeroed out by default in Java - so that's all we have to do there.
Next, we'll declare 2 variables, "left" and "right" - which defines the left and right points of our windows.
We'll set up a loop that iterates until right is _equal to_ or _greater_ than the length of our data buffer. Next, we employ the following logic inside the loop at each iteration:
-> Look up the value for _byte_ at index _left_ of the lookup table and decrement the value (count) by 1 (because we are "dropping" that byte from the left edge of the current window). If the count was 1 and is now 0, then the byte doesn't exist in the current subsequence of X bytes - so decrease numUniqueBytes by 1. If the count is greater than or equal to 1, we do nothing.
-> Look up the value for _byte_ at index _right_ of the lookup table and increment the value (count) by 1. If the count was 0 and is now 1, this means its a byte we've haven't seen yet in this subsequence of X bytes: so increase numUniqueBytes by 1. If the count is greater than or equal to 1, we do nothing.
-> If numUniqueBytes is greater than maxNumUniqueBytes, then we've hit a new maximum of unique bytes: so we will set maxNumUniqueBytes to numUniqueBytes. If numUniqueBytes is less than or equal to maxNumUniqueBytes - we don't change maxNumUniqueBytes at all.
- Below is a Java method that contains the general logic behind the algorithm.
- The gist contains code that builds on the above logic and extends it into a fully-functional command-line based tool. I also added logic to handle edge-cases as well as support for analyzing bytes in chunks to accommodate large files that can't be stuffed into memory all at once.