Bootcamper Question #1: Hash efficiency

2021-02-28

A friend forwarding me this question from a friend working thru a Bootcamp programming. It seemed like a great opportunity to write a post, and I would welcome similar questions:

Question

“Can you clarify how classes (and really hashes) can more efficiently use keys when the initialization definition is established?"

Response

This a great question and a fairly deep topic to be diving into if you’re early in your coursework. Which will make it extra fun to try answering! However I would caution that it may not be important to know at this phase, as a fair amount of modern programming consists in trusting that the little black boxes work the way they say they do & moving on to the larger task at hand. Performance and efficiency concerns are precisely where that approach can break down & it’s very valuable to be willing to open up every box, so let’s attempt it. This whole topic can best be addressed by a course or a good book on data structures & algorithms. Since I’ve never taken a formal course I’ll share the intuitions that have helped me despite that.

Let’s start with what makes hashes and classes represented as hashes efficient to begin with and then move up to why initializing them first might aid in this efficiency.

A Naive Dictionary

When you use a hash table, or any dictionary structure, you want to be able to store a value associated with a key and to retrieve that value when all you have is the key. You could do that using arrays¹:

// initialize our "dictionary"
const arrayDictionary = [];

// store some key,value pairs
arrayDictionary.push(['apples', 'red']);
arrayDictionary.push(['bananas', 'yellow']);
arrayDictionary.push(['oranges', 'orange']);

// get an arbitrary value from an array dictionary
function get(key) {
  for (let i = 0; i < arrayDictionary.length; i++) {
    if (arrayDictionary[i][0] === key) {
      return arrayDictionary[i][1];
    }
  }
}

const bananaColor = get('bananas');

Notice that when we get() certain elements, we have to walk thru the whole array to find the value associated with the last-inserted key. This means in the worst-case we walk the whole length of the array. We would say that the get operation for an arrayDictionary has linear-time search, or O(n) complexity in Big-O Notation. For collections of elements which can grow extremely large, we want to avoid any operation which in the worst-case will be linear with respect to the size of the collection.

To make this & other performance concerns more tangible, I like to imagine the computer executing my instructions as a little man who has to run around doing the work of getting values and incrementing counters and comparing things. Even tho I can write a simple for loop which hides² the little man, the little man is still doing a lot of running around depending on how big the collection is on my behalf. I have a rough heuristic of how much work each operation is for the little man to do.³ At each level the amount of work increases by 10-100X that of the previous level:

Numerical addition, subtraction, comparison, and bit shift/rotation/manipulation operations
Multiplication & division
Accessing a value from memory which hasn’t recently been used
Allocating memory

Briefly, how Hash Tables work

So we’ve seen that stepping thru element of the array to get a key’s value is inefficient, what can we do instead? What would be really nice is if something could cheaply tell us, given a key, what position in the array we should look at. This is essentially what a hash function does. It takes a data structure– the key– as input, and returns as output a numerical value which evenly distributes the input over a defined numerical range. Because hash functions are often implemented with a combination of multiplication and bit manipulation operations (1 & 2) they should be much faster than iterating thru the array of elements (3).

A hash table implementation will start life by initializing a fixed-size table to be used to store values. Inside its set(key, value) implementation it will use a hash function to determine where in that table to store the combined key, value pairs. In order to account for the probability of a hash collision– when the hash function gives two different key values give the same numerical position in the table– it will store them in a list much like the one shown in our arrayDictionary example. Iterating over this list is, of course, less efficient than calculating the precise position in the table with a hash function, which is why it’s very important that the hash function evenly distributes the keys over the table– if it didn’t, the values would clump in one section of the table and you would have the same performance characteristics as our naive example. This risk of collision & clumping has to be balanced against our desire to not waste space with a table that’s too large. We also don’t want to have to rehash the table too frequently.

Initialization efficiency & rehashing

When the hash table is created, the size of the fixed-size table is chosen, and often users are able to hint what size it should be inititalized with. This is useful because the size affects the probability of collision and therefore the probability that for a given element you will be iterating over a list– the thing we don’t want to do– again. In order for hash tables to remain performant as more elements get added, some algorithms will re-size the table when the number of elements added exceeds twice the size of the table. This ensures that you’re never iterating over much more than 2 elements. However, this rehashing process will itself require iterating over all elements of the collection. So declaring up-front to the hash table implementation how many elements you expect to be storing helps minimize the frequency of having to perform this rehashing process. That, finally, is I hope a decent answer to your question– by declaring the properties and methods of your class you are effectively declaring the expected size of that hash table. This assumes you are using Ruby or Python or some other language that models classes as hash tables– if you’re using the V8 interpreter in Javascript the answer is more complex because it only converts objects to a hash table as a last-resort.

Technically in Javascript this example is kindof bogus because Arrays can themselves be modeled as hash tables inside the interpreter. A UintArray8 would be more accurate, but it would also be more code– the wider point is that you’re using some structure that you can only access by an integer index. ↩︎
Another way of saying this is that the for loop “abstracts away” the little man. The little man himself is just a more-fun analogy to what the CPU is actually doing when presented with instructions in its native machine code. There are enough layers of abstraction between what you write & the CPU that it’s safer to assume you don’t know what the little man is doing until you actually measure it using profilers or microbenchmarks. ↩︎
This is an extremely rough estimate that is massively affected by the language chosen & architecture being run on. The first two tiers on this table you can check against the raw performance characteristics of assembly language instructions. The next tiers are affected by something called Data Locality which has a wonderful treatment in this chapter from Game Programming Patterns. ↩︎