Hashset's & Hashmap's

What's a Hashset?

Generally, when you want to check if an item is in an array you iterate over each element and check if your element and item matches. For small array’s this is fine but it scales quite poorly, as you have to at worst check each individual element. This is where a hashset comes in, it’s a clever way to quickly determine whether an item is in an array.

How they Work

Before you add an element to your array the first thing you need to do is represent it as a hash. A poor hash, but illustrative example is representing a string as the sum of its ascii values, mod your array size. For example hashing the string “hello” where your array size is 25, would be represented as (104 + 101 + 108 + 111) % 25 = 24 (taking the hash mod the array size keeps your hash from becoming larger than your array).

You then store your string, in this case “hello” at that index, so array[24] = “hello”. Now if you want to check if “hello” is in your array you don’t have to check each element, you simply hash “hello” and then check the array at that index.

Their Main Problem

The reason the hash I gave as an example is bad, is that all permutations of a given string will hash to the same value, so “hello” , “olleh” and “ehllo” all hash to 24. This is called a hash collision, and while there are many ways of dealing with this the method I will be going over is called chaining.

Instead of just storing the items in the array, you instead store an array of linked lists. Now when you have a hash collision you simply insert the item into the linked list. Unfortunately, this does mean when we search for an item which has had hash collisions, you once again have to check your item against each element of the linked list. The good news is, assuming you choose a good hashing algorithm you shouldn’t have many collisions, so the list you’re searching through will be quite small.

There is however one more wrinkle to consider, and that is when the amount of data you’re storing approaches or outstrips the initial array capacity. Because we’re using an array of linked lists we can of course have more elements stored than the array size, however the fuller the hashtable the more likely we’re to have hash collisions, as there are simply less empty spots in the array.

To combat this once our array is deemed to be “too full” we increase the array size by some amount (typically double), and rehash the elements into the new array. The question is then when is the array “too full” as rehashing is an expensive operation, but on the flip-side as hash collisions increase, efficiency decreases.

The answer is to compute what’s referred to as the load factor, which is: number of entries / array capacity.

When the load factor is greater than some threshold, (typically 0.75) it’s deemed “too full”, this strikes a good balance between performance and memory efficiency.

Hashset to Hashmap

So far all I’ve talked about are hashsets, but if you’ve heard anything about hash-tables before they were probably referring to hashmaps. Fortunately the two data structures are essentially the same, and converting one into the other is quite simple.

To turn a hashset into a hashmap, instead of just storing an item and a pointer to the next node in your linked list, you also store some other value alongside it. This means each node holds a pair of elements, with the hashed value typically called the key, and the other usually just referred to as the value.

Now when you look up an element in the hash-table, you verify the node is correct using the key, and you spit the value out as your output. At a high level this works just like a function in the mathematical sense, which is why it’s referred to as a hashmap, as you’re mapping an input (key) to an output (value).