Trie¶
*
/ \
/ \
i t
/| | \
/ | \
/ | | o
n t e \
| | ____/|\____ $
| | / | \
t $ a d n
/ \ | | |
o $ $ $ $
|
$
- adapted from an NYU course: http://cs.nyu.edu/~kshitij/courses/ds12/index_files/notes-trie.txt
In the illustration, each node is marked by a character. The $ marks where a word ends at a specific node.
In this trie, the word ‘int’ and ‘into’ are captured. The word ‘in’ is not.
Der weg ist das Ziel: “Life is the journey, not the destination. ~Ralph Waldo Emerson
Motivation¶
- Store data effciently without collisions (hash)
- Sorted ordering of the entries by key (related to the Radix search): https://en.wikipedia.org/wiki/Radix_tree
- Autocomplete
- Matching Algorithms (Spell Check & Hyphenation)
- minus (Speed dependent on storage - RAM faster than hash, HD perhaps slower)
- minus (storing in RAM will eventually crush your system)
- Computational Linguists dig this data structure.
Definition¶
A trie (from information reTRIEval) is a tree data structure that is used to store an associative array where we care about the path to the leaf, rather than any individual content of a node.
All the descendents of a particular node have a common prefix. This prefix is the path that lead to the particular node.
An associative array has a key and a value.
Wikipedia https://en.wikipedia.org/wiki/Trie
The way to read (or think) of the tree is as follows. For each node (denoted by * or $) in the tree we can potentially have 26 children (unlike a binary tree, in which we have at most two children) – one corresponding to each letter (which is written next to the “line” in the figure). The string represented by a node is obtained by following the path from root to that node and writing down the characters along the path.
*
/ \
i / \ t
/ \
* *
/| |\
n / | t e | \ o
/ $ | $
* *
| ____/|\____
t| a/ |d \n
$ $ $ $
o|
$
http://cs.nyu.edu/~kshitij/courses/ds12/index_files/notes-trie.txt
Attributes¶
Key
At each level a trie can have (in US English) 27 characters.
What non-letter character might I have chosen to keep & why?
In fact there’s one more character that we need to account for....(just a sec).
Value
We will not be storing any values for this class.
In the past, I have stored weights or counts as values.
End Of Word Token
You will need to choose an end of word token.
After C’s example, I chose 0 for my implementation.
$ indicates the end of a line in RegEx. Perhaps the lecturer chose it for that reason.
This is character 28.
Operations¶
Insert
For each character in the token, we add a node (if it does not already exist) and navigate to the next level of the trie.
. <- level 0 (root)
|
a <- level 1
|
m <- level 2
|
y <- level 3
|
\0 <- level 4
. <- level 0 (root)
|
a <- level 1
/ \
m n <- level 2
| |
y n <- level 3
| |
\0 \0 <- level 4
Traverse/Search
We will only utilize a depth first search.
- Search current level for first character in key
- If none: return false
- If the matched character is :raw-latex:`\0`: return true
- Move to subtrie that matched this character
- Advance to next character in key
- Go to step 1
Get all words
AutoComplete
Starting with the first letter character, print a list of the top four words.
With each successive character, print out the next top four words.
Google implements this algorithm by the top querys (weighting your queries heavier than the world’s).
Delete
Most simply remove the root of the sub-tree with the unneeded path.
There are more complex methods to clean up your nodes.
We will not implement delete for this class.
For comp. ling, I was reading in corpora, so didn’t really have a need for delete.