In computer science, a hashtable is a data structure for storing data that consists of a list of values, called keys, which get paired with a corresponding list of values, called an array. For example, a business name might get paired with its address. Typically, each value in the array has a position number referred to as a hash. The hash function is generally a set of instructions or an algorithm that maps each key value to a hash — connecting the business name to its address, its phone number and its business category, for instance. The purpose of the hash function is to assign each key to a unique corresponding value in the array; this is commonly referred to as hashing. Hash functions must be properly formatted for a hashtable to function properly.
The performance of a hashtable on a set of data is dependent upon the efficiency of its hash function. A good hash function typically provides for a uniform lookup of keys and an even distribution of mappings in the corresponding array. A hash collision occurs when two keys are assigned to the same corresponding value. When a hash collision occurs, the hash function is usually executed again until a unique corresponding value is found; this commonly results in longer hashing times. Although the number of keys in a hashtable is usually fixed, sometimes there might be duplicate keys. Even so, a well-designed hashtable has effective hash functions that map each key to a unique corresponding value in the array.
Sometimes, inefficient hash functions in a hashtable may also produce a cluster of mappings. If a hash function creates a cluster of mappings for existing keys, this can increase the amount of time it takes to lookup the corresponding values. This can slow down the hashing for future keys since most hash functions generally look for the next available position in the array. If a large cluster of values has already been assigned, it typically would take much longer to look for an unassigned value for a new key.
The load factor is another concept related to the efficiency of a hash function; the load factor is the amount of already existing hashings in relation to the overall size of the corresponding array in a hashtable. It is usually defined by dividing the number of already assigned keys by the size of the corresponding array. As the load factor increases, a good hash function will normally still maintain a constant number of collisions and clusters up to a certain point. Oftentimes this threshold can be used to determine how efficient a hash function is with a given number of keys and when a new hash function may be needed.
Many computer science researchers have strived to produce the perfect hash function — one that produces no collisions or clusters given an increasing load factor. In theory, the key to producing a perfect hashtable is to produce a perfect hash function. In general, researchers believe that a perfect hash function should have constant performance — the number of collisions and clusters — with an increasing load factor. In worst case scenarios, a perfect hash function would still allow for constant hashing without reaching a threshold.