A TEXT POST

Suffix Trees: Part II Substring search

OK, so lets recall the substring search problem that we discussed in the previous post. We are now going to see an awesome application of Tries to solve this problem.

What is a suffix Tree?

A suffix tree is a Trie of all possible suffixes of a string.

Suffix Tree of the string "test"



.

E.g. For the string “test”, the suffix tree is a trie constructed from the set of strings(suffixes) “test”,”est”,”st” and “t”.

How this solve our substring problem?

There is an interesting property to note about substrings. Every substring of a given string, is a prefix of some suffix of the string. Hence, searching for a string q in the suffix tree will take only O(|q|) time(since its a Trie). Isn’t that mind blowing? 

Implementation

My implementation of suffix trees can be found at 

https://github.com/varrunr/algo-lib/tree/master/src/stree

Drawbacks

The drawbacks of Tries apply here as well.

1.The implementation complexity is higher

2. The cost of construction of a suffix tree is of the order O(n^2) in terms of time and space. Of course the space complexity can be reduced by a few tricks, like collapsing a path containing 1-degree nodes into references to the original string. There also do exist sophisticated algorithms that can build the entire tree in linear time.(Weiner’s, Ukkonen’s, McCreight’s). I will not delve into these, they are a topic for another day. 

PS: My next couple of posts will cover some extra applications of suffix trees like finding the longest Common subsequence and longest palindrome. 

A TEXT POST

Suffix Trees: Part I : Intro, Tries

Intro

A Suffix tree is a data structure which is so ingenious in many ways. Its primary application is in string algorithms. I will primarily talk about its application in substring searching. Before going into the subject, I will give an intro to the basic problem at hand.

The Problem

Given a string x, search for the existence of a substring y. A substring is defined as a contiguous sequence of characters in the search string that matches a given string. 

Naive Solution

The obvious solution to the given problem would be to start at every character of the search string and try to match every character of the input string through a sequential scan. The algorithmic complexity for this would be O(|x|*|y|). [Why?] Its rather clear that this is not the most efficient of algorithms if my search list of strings is large. 

Tries

Before we go into suffix trees, we need to talk about a data structure which forms the basis for suffix trees, a Trie. Lets first see the problem a Trie tries to solve.

The Problem

Given a set of n strings, build and maintain a data structure to efficiently find, insert and delete the string associated with a query string q. 

What is a trie?

A trie a tree where every path from root to leaf defines one of the strings of the dictionary, where each edge along the path defines the character of a string in increasing order. The root represents a null string. Any two words in the dictionary branch off at the first distinguishing character i.e they share a common prefix. 

Here is an example of a trie of the set of string ‘a’, ‘an’, ‘the’, ‘there’ and ‘us’

Trie

As you can see in the above figure, “an” and “and” share the common prefix “an” and branch off from there. The black nodes mark the end of a string.

As you can see here, the time taken to search for a string in the dictionary can be done in O(|q|) comparisons. Insertions and deletions can also be done in O(|q|) time. 

Drawbacks of Tries

Implementing tries is not straightforward because of the complexity involved in it. Also, there is a one time overhead of constructing the trie which takes O(|n|*k), where k is the average length of the string. 

In the ensuing post, I will talk more in detail about Suffix Trees and their applications. 

References

[1] The Algorithm design manual, Steven Skiena

A TEXT POST

Intro

Hi everyone,

I am pursuing my Masters in Computer Science a Stony Brook University, NY. Following today, I plan to use this as a notepad for all the cool stuff I am learning in my course work and otherwise. I hope it may be of use to people. Please feel free to suggest any mistakes in the content, or that I add in some stuff.