A good hash function should also be fast, simple, and deterministic (i.e. it should always produce the same output for the same input).
In hashing, a simple mapping refers to the process of converting a key into an index in a hash table. This process is typically done using a hash function.
Simple mapping
Here’s a simple example of how you can map a string (key) to an index:
simple_hash function calculates the sum of ASCII values of each character in the string and returns the remainder when divided by TABLE_SIZE. This remainder gives the index in the hash table.
Hash Function
A hash function is an algorithm that maps input data of arbitrary size to fixed-size values. A good hash function minimizes collisions, is fast, simple, and deterministic.
The hash_function takes an integer key and returns the index by applying modulo operation with the TABLE_SIZE. This is a simple and commonly used hash function.
Hash Table
A hash table is a data structure that uses hash functions to store and retrieve values quickly. The key is hashed into an index, where the associated value is stored.
The hash table is an array of HashEntry structures.
The insert function uses the simple_hash function to compute the index for the given key and stores the key-value pair in the table.
Collision Handling in Hash Tables
In a hash table, collisions occur when two or more keys hash to the same index. Several methods can be used to handle collisions. These include:
Example
Separate Chaining: Uses linked lists at each table index to handle collisions.
Linear Probing: Resolves collisions by checking subsequent indices.
Quadratic Probing: Uses a quadratic increment to find the next open index.
Double Hashing: Uses a second hash function to compute the step size for probing.
Rehashing: Involves resizing the table and recalculating hash values when the table becomes too full.
Using Separate Chaining
Separate chaining resolves collisions by storing multiple elements at the same index, typically using a linked list. If multiple keys hash to the same index, they are stored in the list at that index.
Each index in the hash table holds a linked list (HashNode).
When a collision occurs, new elements are added to the front of the list at the corresponding index.
Using Linear/Quadratic Probing
Linear probing and quadratic probing are methods of open addressing, where colliding elements are placed in the next available slot within the hash table.
Important
Linear Probing: If a collision occurs, the algorithm checks the next consecutive index until an empty slot is found.
Quadratic Probing: Similar to linear probing but with a quadratic increment (i.e., checking i+1, i+4, i+9, etc.).
Linear probing is used when a collision occurs by moving to the next index in a circular manner.
The is_occupied flag ensures we skip over empty slots.
Using Double Hashing and Rehashing
Double hashing uses a second hash function to compute the step size for probing. It is used to resolve collisions and is often more effective than linear or quadratic probing.
Rehashing involves resizing the hash table and recalculating hash values when the load factor exceeds a certain threshold.
Double hashing uses a second hash function to compute the step size, which reduces clustering problems compared to linear and quadratic probing.
When a collision occurs, the algorithm moves step_size positions forward until an empty slot is found.
Tries
A Trie (or prefix tree) is a specialized tree-like data structure that stores a set of strings, where each node represents a single character of the string. Tries are particularly useful for tasks involving prefix matching, such as autocomplete or spell-checking.
Info
Nodes: Each node contains a character and a reference to its child nodes.
Root: The root of a trie is an empty node.
Leaf Nodes: A leaf node represents the end of a word (usually marked with a special flag).
Edge: The edge from one node to another represents a character in the string.
Efficiency: Tries provide efficient search, insertion, and deletion operations compared to other string matching algorithms.
Applications of Tries
Word Searching: Tries provide an efficient way to store and search for words in a dictionary.
Autocomplete: Tries are often used in applications like search engines and text editors for autocomplete functionality.
Spell Checking: Tries help in verifying if a word exists in the dictionary by checking the prefix structure.
Insert, Search, and Delete Operations
Example
Insert: Adds a word to the trie by creating nodes for each character.
Search: Looks for a word in the trie by following nodes.
Delete: Removes a word from the trie by deleting nodes as necessary.
Display Words in Lexicographical Order: Traverse the trie in depth-first order and print words when reaching the end of a word.
Word Completion (Autocomplete): Use a given prefix to find matching words in the trie and suggest completions.
Insert Operation
To insert a word into a trie, we start from the root and follow the characters of the word. If a character doesn’t exist in the current node, we create a new node for it.
Search Operation
To search for a word, we start at the root and follow the characters of the word, checking each corresponding child node. If we reach the end of the word and the node is marked as the end of the word, the word exists in the trie.
Delete Operation
To delete a word, we first perform a search to check if the word exists. Then, we recursively remove the nodes that are no longer needed.
Application: Display Words in Trie in Lexicographical Order
To display words in lexicographical order, we perform a depth-first traversal of the trie, and for each leaf node (end of a word), we print the word.
Application: Word Completion (Autocomplete Feature)
To implement autocomplete, we find the node corresponding to the prefix and display all words that can be formed from that node.
Suffix Tree
A Suffix Tree for a string S of length n is a compressed trie that contains all the suffixes of the string. Each leaf node represents a suffix, and the internal nodes represent common prefixes among the suffixes. The tree structure allows for efficient substring searches and many other string processing operations.
NOTE
Definition: A suffix tree is a compressed trie that stores all suffixes of a given string.
Properties:
Stores all suffixes.
Efficient for substring searching, pattern matching, and finding repeated substrings.
Each leaf node corresponds to a suffix.
Construction:
Naive approach builds the tree by inserting each suffix one by one.
Ukkonen’s Algorithm provides an optimal construction in O(n) time.
Applications:
Substring search
Pattern matching
Finding repeated substrings
Text compression
Longest common substring problem
Properties of a Suffix Tree
Compact Representation: It stores all suffixes of a string in a compressed manner.
Leaf Nodes: Each leaf node corresponds to a suffix of the string, and it contains the starting index of the suffix in the original string.
Internal Nodes: Internal nodes represent common prefixes between two or more suffixes.
Edge Labels: Edges in the suffix tree are labeled with substrings of the original string.
Time Complexity: Construction of the suffix tree can be done in O(n) time (where n is the length of the string) with the help of advanced algorithms like Ukkonen’s algorithm.
Construction of a Suffix Tree
There are multiple algorithms for constructing a suffix tree, with the most efficient being Ukkonen’s Algorithm, which constructs the tree in O(n) time. However, constructing a suffix tree can be complex, so we’ll start with a basic, conceptual approach before delving into more optimized algorithms.
Add a Special Character: First, we append a special character (usually ’$’) to the string to ensure that no suffix is a prefix of another. This special character helps avoid ambiguity when constructing the tree.
Insert Suffixes: We then insert each suffix of the string into the tree. Each suffix is inserted by following the characters one by one, creating nodes as necessary.
Example: Construction of a Suffix Tree (Naive Approach)
Node Structure: Each node has a start_index to mark the beginning of a substring and an end_index that points to the end of the substring.
Building the Tree: The build_suffix_tree function inserts each suffix of the string into the tree by iterating through the string and creating new nodes as necessary.
Printing the Tree: The print_suffix_tree function recursively prints each suffix stored in the leaf nodes of the tree.
Suffix Tree (in order of suffixes):
banana$
anana$
nana$
ana$
na$
a$
Optimized Suffix Tree Construction
While the above naive method builds a simple suffix tree, it is not efficient for large strings due to its time complexity. The optimal construction algorithm is Ukkonen’s Algorithm, which builds the suffix tree in O(n) time by using the following ideas:
Implicit Suffix Tree: Instead of building the full suffix tree at once, Ukkonen builds an implicit suffix tree where only the necessary suffixes are stored at each step.
Active Point: The algorithm uses an active point (a combination of an active node, edge, and position) to traverse the tree efficiently.
Suffix Links: These are used to speed up the tree construction process by pointing to the longest suffix that can be shared with the current suffix.
Info
Implementing Ukkonen’s Algorithm is more involved and requires careful handling of active points and suffix links. The general steps of the algorithm include:
Iterating over each character in the string.
Maintaining an active point, which helps to minimize redundant work when inserting new suffixes.
Handling edge cases, such as when a new character introduces a new suffix or when an internal node is split.