Cracking a weak hash function


A few years ago, I found a job offer from a software company that required you to break a hash function in order to be able to apply. The hash function itself was pretty weak, and the constraints were sensible enough for the requirement to make sense. Since then, the company has taken the offer down, and as far as I know, they do not use the same exercise anymore, which is why I decided to post my solution.

The task

I cannot find an archive page of the job offer description, so I’ll paraphrase from memory.

Given this hash function:

def hash(s):
    h = 7
    letters = "acdegilmnoprstuw"
    for c in s:
        h = h * 37 + letters.find(c)
    return h

(notice that Python’s .find() returns the index at which the character is found, or -1 if it isn’t, like indexOf does in other languages)

Find a 9-letter string of characters such that hash(string) is 945924806726376, where string can only contain letters from acdegilmnoprstuw.

The weakness

For this explanation, I will ask you to assume that I’m a 5 year old. No advanced math here, just pure logic that anybody can understand and follow.

If you look at the hash closely, you will notice that it starts with a fixed value 7, and that it then multiplies that value by 37 and adds some (negligible for now) offset. This means that, for every character there is in the string to be hashed, the value of h will be multiplied by 37 and some negligible value, up to 15, will be added.

With this information alone, you know that, as long as the length of the input string increases, the resulting hash will also increase, i.e. the value of the hash is linearly proportional (somewhat) to the input string. We do know that the length will be 9, which means there will be a fixed amount of iterations in hash()’s loop.

The value that’s added after multiplying is simply an offset, a value from 0 to 15, and that’s the only thing that can change between different possible values to be fed into the hash() function. That is, the smallest possible value we can get with these constraints is hash("aaaaaaaaa") -> 909732178565539, and the biggest possible value is hash("wwwwwwwww") -> 963882903480154.

A necessary consequence of all this is that hash("bbbbbbbbb") < hash("bbbbbbbbc") < hash("bbbbbbbbd"). Another thing is that, because of the cumulative design of the hash, characters at the end will produce much smaller differences than characters at the beginning; that is, hash("wwwwwwwww") - hash("wwwwwwwwa") will produce a much smaller difference than hash("wwwwwwwww") - hash("awwwwwwww") will:

>>> hash("wwwwwwwww") - hash("wwwwwwwwa")
>>> hash("wwwwwwwww") - hash("awwwwwwww")

With this knowledge, we now know that, as long as the string “increases” (from aaaaaaaaa to wwwwwwwww), the value of the hash will increase as well, linearly. Also, each character has less and less impact on the final value of the hash as we progress over the string - i.e. the first character will make the biggest difference in value, while the last will make the least.

Therefore, the only thing necessary for this task is to progressively increase characters at the start of the string, and as soon as we surpass that value, we got the proper character (which is the previous one, as the further characters in the string will always add some numerical value to the final hash).

The solution

My code is at the end of this post1. However, the necessary steps can be very easily reproduced from this output:

uaaaaaaaa: too big (958906890920433) # first char is 'u'? no - too big, try something that would make the hash smaller
taaaaaaaa: too big (955394411466512) # 't' makes the hash smaller, but it is still too big
paaaaaaaa: too small (944856973104749) # the first letter is 'p' - move on to the next, because now the hash is too small
puaaaaaaa: too big (946186019384611)
praaaaaaa: too small (945901223753212) # the next one is 'r' - move on to the next
pruaaaaaa: too big (945937143922938)
proaaaaaa: too small (945924315290893) # next one is 'o' - move on to the next


/r/hackcasual on reddit pointed out that you don’t even need to guess like this. That is, for the expression:

a * 37 + x = b

you can solve for x with modulus even though you only know b. The reasoning, I quote, is:

Let’s say I give you an expression a * 37 + x = b. Even though you only know b, you can solve for x with modulus. Think of it like this, let’s say I give you a * 10 + x = 74. As long as a and x are integers, you know x must be 4.

def crackHash(hv):
    while hv > 37:
        print(hv % 37)
        hv /= 37

that cracks it in reverse order

  1. Click here for my code, or here to run it in your browser ↩︎