Implement a Hash Table in Python

Python is all about the namespaces. Namespaces in Python are really just Python dicts with some special sauce. Python’s dict is implemented as a hash table.

In this assignment, you’ll add a simple implementation of a Hash Table to your data-structures repository.

Tasks

The key to a solid implementation of a hash table is a good hashing function. Read quickly over this list of hashing functions to get some sense of the variety available, and the issues involved in designing one. For the purposes of this assignment you will use two different hashing functions. The first should be a naive hashing function, such as the “Additive Hash” described in that reading. For the second, try one of the more complex hashing functions instead.

To simplify testing with your two different hashing functions, design your hash table class to accept different hashing functions at construction time. Add tests demonstrating the differences between them. You might consider testing the distribution of keys among the buckets in your hash table.

Your table should have the following properties:

  • It should be of fixed size. The number of slots in the table should be determined when the table is initialized, by passing an argument: foo = HashTable(1024)
  • It should handle hash collisions by using ‘buckets’ to contain any values that share a hash
  • It should accept only strings as keys. If a non-string is provided, the ‘set’ method should raise an appropriate Python exception.

Your table should implement the following methods:

  • get(key) - should return the value stored with the given key
  • set(key, val) - should store the given val using the given key
  • _hash(key) - should hash the key provided (note that this is an internal api)

Add tests to your data-structures repository to verify that your hash table works properly.

Add notes to the README explaining your approach and listing any resources and collaborations you used.

Note

A note on testing

Testing a hash table consists largely of inserting values with keys and then using the keys to retrieve the values. Since it is difficult to provide enough tests manually to ensure that your bucket system works properly, I propose the following strategy:

Every Unix-like operating system provides a list of dictionary words in a file at /usr/share/dict/words. This very long list (over 250,000 on my system) can provide a reasonable test bed for a hash table. For your tests, use this file. Insert all the words from the file into your hash table, with the key and the value being the same (in other words, my_table.set('pear', 'pear')). Then, you can test by using each word as a key, and verify that the result you get back is the same word.

Time complexity is an important concept to consider when working in software development. Each piece of code you write will take a certain number of steps to execute. Knowing not only where bottlenecks occur but also sections in which optimization will yield very little increase in performance will allow you, as a developer, to identify the areas of most importance.

  • Each method you are required to implement must also include the time complexity in Big O notation in your README.

Submitting Your Work

As usual, do your work on a branch. When you are finished (and all tests are passing) create a new Pull Request containing only the work on balancing. Submit the URL of your pull request. When that is done, you may merge the pull request.

Add documentation of your changes to your README file. Include any sources or collaborations.