Shortest superstring homework
Given a set or words, find the shortest string that contains every word as a substring somewhere in the string.
scala main.scala input/words.txt
also try words2.txt
to words5.txt
.
Compile:
scalac main.scala
Run:
scala Main input/words.txt
As explained by e.g. Mucha (2013), the shortest superstring problem is NP-hard. As mentioned by Alanko and Norri (2017), the most recent theoretical advance is by Paluch (2014) which presents an algorithm that produces a superstring at worst 2 11/30 times longer than the true shortest superstring. Earlier, Mucha (2013) presented an algorithm with a 2 11/23 approximation ratio. These papers don't mention the time complexity of their algorithms, but an earlier work by Kaplan et al. (2005) achieved an approximation ratio of 2 1/2 with runtime complexity slower than O(n^4), so I would assume the others didn't markedly improve on that, otherwise I think they would have mentioned it.
Assuming the word length is limited, the number of input words and the total length of the input in characters are proportional to each other, so we can just talk about n in runtime complexity analysis without specifying which one we mean, they both give the same result. (This should me true for words from human languages, but not necessarily when the shortest superstring problem is applied in DNA sequencing.)
The common approach to the shortest superstring problem is to first remove duplicate words, and words that are proper substrings of some other word. Then calculate the pairwise overlaps of how many characters the end (suffix) of a word has in common with the prefix (beginning) of another word. Now we can see this as a directed graph: The words are the nodes, and each graph edge from word a to word be has a weight of how much the end of word a overlaps with the beginning of word b. Now the problem is to search for the path that maximizes the sum of weights and visit every node. This path then gives the order in which to join the words to build the superstring.
A path that visits every node of a graph exactly once is called a Hamiltonian path. Now we have transformed the shortest superstring problem into the maximum asymmetric (our graph is asymmetric) Traveling Salesman Problem. This doesn't do us much good, since this MAX-ATSP problem is equally NP-hard.
I wrote my code in Scala.
However, the overlap graph leads to a greedy algorithm, where we just start picking edges from the graphs, starting from the edges with the largest weight (longest word overlap). Only when an edge we're about to pick would close the path into a cycle before we have visited all the nodes (words), then we ignore this edge.
My program follows these ideas, as explained by Martinová (2014).
-
Remove words that are duplicates and substrings.
-
Calculate directed pairwise overlap for each word.
For example, the directed pair (over,error) has overlap of 2 characters, but in the other direction (error,over) the overlap is 0. For n words, allocate an n-by-n integer matrix so that the element (i,j) gives the weight of the edge from word i to word j.
-
Store the triplets (weight,j,i) into a list and order this list from the largest weight to the smallest.
(Could also store then into a priority queue based on a heap, and save a tiny bit of time, since we're not going to use all of them, so sorting the whole list is a tiny bit wasteful.)
-
Allocate another n-by-n matrix for marking edge statuses. Initially mark all edges as free.
-
Start taking edges from the list. If the edge (i,j) is marked as free in the matrix, mark is as path. Then mark all other edges starting from node i as not-free, and all other edges ending in node j as not-free.
Also, follow the existing path backward from node i as far as the path goes, and call the first node a. (If i is the first node in this fragment of the path, then a is i.) And follow the path forward from node j as far as it goes, and call this node b. Then mark the edge (b,a) as cycle-forming. If this edge was later to be added to the path, it would close the path into a cycle before the path has visited all the nodes.
When we have added n-1 edges to the path (n-1 so 1 less than the number of words, after duplicates and substrings were removed)
-
Finally, start from any path node in the path matrix, and follow the path both backward and forward and construct the whole path, having n nodes and n-1 edges.
Start from the first node in the path, take the word, and join the word with the next word on the path. I look up the overlap from the overlap matrix and remove that many characters from the beginning of the second word. But thinking about cache locality, it might perhaps be a little faster to just recalculate the overlap (at least with human language words, when the overlap is never a large number). This builds the greedy approximation superstring.
Because I fill in the whole n-by-n matrix, actually two of them, my program runtime complexity is O(n^2), with n the number of input words. Actually also the way I implemented the initial step of removing duplicate words and words that are substrings of other words, also that initial step runs in O(n^2).
In a way, after step 3, the problem is already solved. (There wouldn't even be a need to sort the list, just find an element with largest weight.) We could join the two words with the longest overlap, remove the original two words from the data and add the joined word, and then start all over. But this would result in an O(n^3) runtime algorithm.
Alanko and Norri (2017) give a faster greedy algorithm that runs in O(n log s) time, where n is the number of words and s is the size of the alphabet the words consist of. They also have a C++ implementation in their github repo.
I tested their code, and it runs on the 500 words in input/words5.txt
in about 0.06 seconds. Where my Scala program (in compiled form) runs in about 1.3 seconds, of which maybe around 0.5 seconds is the startup time of the java runtime.
I have to say, both the original algorithm by Ukkonen (1990) and the improvements by Alanko and Norri (2017) are complicated, and beyond what I can implement in a couple of days and evenings. But if one really wanted fast code, this would obviously be the way to go. Also, the software by Alanko and Norri is available in GitHub and I compiled and tested it.
Both my code and that of Alanko and Norri produces a superstring of length 2361 from the longest example input, input/words5.txt
.
I also tried two stochastic versions of my program:
-
When there are several possible edges to choose, all having the same value for word overlap, choose a random edge.
However, trying this several times, always produces a superstring of the same length, with the provided test inputs. Looks like these test cases of English words have only a small number or good (large) word overlaps, and the greedy algorithm is always able to make use of them all. And then the order in which 1-character and 0-character overlap words are joined doesn't matter, the resulting length is always the same.
-
Instead of taking the edge with the largest word overlap, take a number S of edges from the ordered list. Shuffle the order of this sample randomly, and start taking edges from this sample. This means that we are not always joining words greedily using the largest possible overlap. But sometimes we are choosing some slightly smaller overlaps first.
I needed to increase the sampling size S to over 100 before it had any effect. Smaller sampling sizes always produced the same length superstring as the greedy algorithm. And at about 150 sample size, the results only got longer and not better.
I pretty much believe that for the given test inputs, the greedy algorithm actually is able to find the true shortest superstring.
As summarized by Alanko and Norri, the greedy algorithm has been proven to give a superstring at worst 3 1/2 times longer than the true shortest superstring. There is also a conjecture that the greedy algorithm never produces a superstring worse than 2 times longer. Nobody has ever observed the greedy algorithm to perform worse than this.
The complicated research algorithms mentioned above have been proven to have approximation ratios 2 1/2, 2 11/30 and 2 11/23. However, I'd like to point out that the main design motivation in constructing those algorithms has probably been that the designers can prove these properties for these algorithms. So they have a provable upper bound for the resulting superstring. But they are not necessarily designed to even try to search for the shortest possible superstring.
Also, as far as their runtime complexity is O(n^4) or worse, they would probably be impractically slow already for the longest example input here, which is 500 words (or 449 words after removing duplicates and substrings).
Also, Martinová (2014) notes (in Conclusions) that she tried another algorithm, based on finding cycle covers on a prefix graph. This algorithm has been proven to have the approximation ratio of 3 in the worst case, whereas the greedy algorithm can be 3 1/2 at worst. She observed that the cycle cover algorithm performed worse than the greedy algorithm, except when the input data was from the binary alphabet. So an algorithm that has better proven bound doesn't necessarily mean that it produces better results in practice. Only that it's worst case bound has been proven to be better.
Alanko J., Norri T. (2017): Greedy Shortest Common Superstring Approximation in Compact Space. In: Fici G., Sciortino M., Venturini R. (eds) String Processing and Information Retrieval. SPIRE 2017. Lecture Notes in Computer Science, vol 10508.
https://arxiv.org/abs/1707.07727
https://github.com/tsnorri/compact-superstring
Kaplan H., Lewenstein M., Shafnir N. and Sviridenko M. (2005): Approximation algorithms for asymmetric TSP by decomposing directed regular multigraphs. Journal of the ACM 52(4): 602-626.
http://www.cs.tau.ac.il/~haimk/papers/journal6.pdf
Martinová K. (2014): Algorithms for shortest superstring problems. Diploma Thesis, Faculty of Informatics, Masaryk University, Czech Republic.
https://is.muni.cz/th/nrg7j/diploma_thesis_final.pdf
Mucha M. (2013): Lyndon words and short superstrings. In: SODA '13 Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms. p. 958-972.
https://arxiv.org/abs/1205.6787
Paluch K. (2014): Better Approximation Algorithms for Maximum Asymmetric Traveling Salesman and Shortest Superstring.
https://arxiv.org/abs/1401.3670
Ukkonen E. (1990): A linear-time algorithm for finding approximate shortest common superstrings. Algorithmica 5(1-4): 313–323.
(For a pdf, see theory/
in this repo.)