A TEXT POST

Suffix Trees: Part I : Intro, Tries

Intro

A Suffix tree is a data structure which is so ingenious in many ways. Its primary application is in string algorithms. I will primarily talk about its application in substring searching. Before going into the subject, I will give an intro to the basic problem at hand.

The Problem

Given a string x, search for the existence of a substring y. A substring is defined as a contiguous sequence of characters in the search string that matches a given string. 

Naive Solution

The obvious solution to the given problem would be to start at every character of the search string and try to match every character of the input string through a sequential scan. The algorithmic complexity for this would be O(|x|*|y|). [Why?] Its rather clear that this is not the most efficient of algorithms if my search list of strings is large. 

Tries

Before we go into suffix trees, we need to talk about a data structure which forms the basis for suffix trees, a Trie. Lets first see the problem a Trie tries to solve.

The Problem

Given a set of n strings, build and maintain a data structure to efficiently find, insert and delete the string associated with a query string q. 

What is a trie?

A trie a tree where every path from root to leaf defines one of the strings of the dictionary, where each edge along the path defines the character of a string in increasing order. The root represents a null string. Any two words in the dictionary branch off at the first distinguishing character i.e they share a common prefix. 

Here is an example of a trie of the set of string ‘a’, ‘an’, ‘the’, ‘there’ and ‘us’

Trie

As you can see in the above figure, “an” and “and” share the common prefix “an” and branch off from there. The black nodes mark the end of a string.

As you can see here, the time taken to search for a string in the dictionary can be done in O(|q|) comparisons. Insertions and deletions can also be done in O(|q|) time. 

Drawbacks of Tries

Implementing tries is not straightforward because of the complexity involved in it. Also, there is a one time overhead of constructing the trie which takes O(|n|*k), where k is the average length of the string. 

In the ensuing post, I will talk more in detail about Suffix Trees and their applications. 

References

[1] The Algorithm design manual, Steven Skiena

  1. varrunr posted this