Bootcamper Question #1: Hash efficiency
A friend forwarding me this question from a friend working thru a Bootcamp programming. It seemed like a great opportunity to write a post, and I would welcome similar questions:
Question
“Can you clarify how classes (and really hashes) can more efficiently use keys when the initialization definition is established?"
Response
This a great question and a fairly deep topic to be diving into if you’re early in your coursework. Which will make it extra fun to try answering! However I would caution that it may not be important to know at this phase, as a fair amount of modern programming consists in trusting that the little black boxes work the way they say they do & moving on to the larger task at hand. Performance and efficiency concerns are precisely where that approach can break down & it’s very valuable to be willing to open up every box, so let’s attempt it. This whole topic can best be addressed by a course or a good book on data structures & algorithms. Since I’ve never taken a formal course I’ll share the intuitions that have helped me despite that.
Let’s start with what makes hashes and classes represented as hashes efficient to begin with and then move up to why initializing them first might aid in this efficiency.
A Naive Dictionary
When you use a hash table, or any dictionary structure, you want to be able to store a value
associated with a key
and to retrieve that value
when all you have is the key
. You could do that using arrays^{1}:
// initialize our "dictionary"
const arrayDictionary = [];
// store some key,value pairs
arrayDictionary.push(['apples', 'red']);
arrayDictionary.push(['bananas', 'yellow']);
arrayDictionary.push(['oranges', 'orange']);
// get an arbitrary value from an array dictionary
function get(key) {
for (let i = 0; i < arrayDictionary.length; i++) {
if (arrayDictionary[i][0] === key) {
return arrayDictionary[i][1];
}
}
}
const bananaColor = get('bananas');
Notice that when we get()
certain elements, we have to walk thru the whole array to find the value
associated with the lastinserted key
. This means in the worstcase we walk the whole length of the array. We would say that the get
operation for an arrayDictionary
has lineartime search, or O(n) complexity in BigO Notation. For collections of elements which can grow extremely large, we want to avoid any operation which in the worstcase will be linear with respect to the size of the collection.
To make this & other performance concerns more tangible, I like to imagine the computer executing my instructions as a little man who has to run around doing the work of getting values and incrementing counters and comparing things. Even tho I can write a simple for
loop which hides^{2} the little man, the little man is still doing a lot of running around depending on how big the collection is on my behalf. I have a rough heuristic of how much work each operation is for the little man to do.^{3} At each level the amount of work increases by 10100X that of the previous level:
 Numerical addition, subtraction, comparison, and bit shift/rotation/manipulation operations
 Multiplication & division
 Accessing a value from memory which hasn’t recently been used
 Allocating memory
Briefly, how Hash Tables work
So we’ve seen that stepping thru element of the array to get a key
’s value
is inefficient, what can we do instead? What would be really nice is if something could cheaply tell us, given a key
, what position in the array we should look at. This is essentially what a hash function does. It takes a data structure– the key
– as input, and returns as output a numerical value which evenly distributes the input over a defined numerical range. Because hash functions are often implemented with a combination of multiplication and bit manipulation operations (1 & 2) they should be much faster than iterating thru the array of elements (3).
A hash table implementation will start life by initializing a fixedsize table to be used to store values. Inside its set(key, value)
implementation it will use a hash function to determine where in that table to store the combined key, value
pairs. In order to account for the probability of a hash collision– when the hash function gives two different key
values give the same numerical position in the table– it will store them in a list much like the one shown in our arrayDictionary
example. Iterating over this list is, of course, less efficient than calculating the precise position in the table with a hash function, which is why it’s very important that the hash function evenly distributes the keys over the table– if it didn’t, the values would clump in one section of the table and you would have the same performance characteristics as our naive example. This risk of collision & clumping has to be balanced against our desire to not waste space with a table that’s too large. We also don’t want to have to rehash the table too frequently.
Initialization efficiency & rehashing
When the hash table is created, the size of the fixedsize table is chosen, and often users are able to hint what size it should be inititalized with. This is useful because the size affects the probability of collision and therefore the probability that for a given element you will be iterating over a list– the thing we don’t want to do– again. In order for hash tables to remain performant as more elements get added, some algorithms will resize the table when the number of elements added exceeds twice the size of the table. This ensures that you’re never iterating over much more than 2 elements. However, this rehashing process will itself require iterating over all elements of the collection. So declaring upfront to the hash table implementation how many elements you expect to be storing helps minimize the frequency of having to perform this rehashing process. That, finally, is I hope a decent answer to your question– by declaring the properties and methods of your class you are effectively declaring the expected size of that hash table. This assumes you are using Ruby or Python or some other language that models classes as hash tables– if you’re using the V8 interpreter in Javascript the answer is more complex because it only converts objects to a hash table as a lastresort.

Technically in Javascript this example is kindof bogus because Arrays can themselves be modeled as hash tables inside the interpreter. A UintArray8 would be more accurate, but it would also be more code– the wider point is that you’re using some structure that you can only access by an integer index. ↩︎

Another way of saying this is that the
for
loop “abstracts away” the little man. The little man himself is just a morefun analogy to what the CPU is actually doing when presented with instructions in its native machine code. There are enough layers of abstraction between what you write & the CPU that it’s safer to assume you don’t know what the little man is doing until you actually measure it using profilers or microbenchmarks. ↩︎ 
This is an extremely rough estimate that is massively affected by the language chosen & architecture being run on. The first two tiers on this table you can check against the raw performance characteristics of assembly language instructions. The next tiers are affected by something called Data Locality which has a wonderful treatment in this chapter from Game Programming Patterns. ↩︎