Not sure why the Myers bit vector algorithm from 1999 is not included in these comparisons. Since it can operate on whole 32-bit or 64-bit registers at a time, it is really O(NM/w) which is in effect just O(N) and just gives you the whole edit distance.
Heikki Hyyrö has a few papers and a PhD thesis from circa 2001..2005 that walk you through this in pseudo-code that you could port to any low level language and also generalize the edit distance. Multi-precision arithmetic also allows extending to languages with big words over the alphabet in question.
It seems to at least not have escaped the attention of biologists/DNA analyzers in bioinformatics [1]. There are also "banded" algorithms that can exit early even with the classic dynamic programming edit distance if you just limit distances to < k after Ukkonen 1985 (No, NOT the suffix tree algorithm).
> It seems to at least not have escaped the attention of biologists/DNA analyzers in bioinformatics
TBF, the sequences they use are probably large enough that I wouldn't be surprised if they redeemed some (formerly) galactic algorithms for this task. It's really no wonder that they've scoured the literature for every optimization available.
Agreed proteins as "words" encoded in a 2-bit DNA alphabet surely do get much bigger than even human languages like German. :-)
That said, "galactic algorithms" suggests impracticality/inefficiency "in the small" which is not really true of the Myers bit-vector approach. So, I don't think "only faster for big problems" is a great excuse for the wider world. Myers bit-vector is faster all the time (compared to many other things in far more frequent use in the wild). It's also not that involved. Here is a link to an implementation in about 34 lines of Nim code: https://github.com/c-blake/suggest/blob/master/suggest.nim#L...
Fair enough. IMO, Hyyro's description is easier to follow than Myers' own which may be the real explanation for its relative obscurity.
Re: concision - an incomplete summary of why I like Nim is that it is more concise than Python, as fast as C, and safe, both in its metaprogramming and otherwise.
(2010), in case you couldn’t tell from the tiny, low-contrast text and use of cleartext HTTP! (12px, #666 on white, and even lower contrast for links.)
One of the most efficient ways (that I know of) is to create a trie [0] from your corpus, then incorporate the levenshtein algo into the tree traversal algo to give you a levenshtein automata over a trie. Since a trie is a binary tree, if it's properly balanced, this timing of this operation is O(log(n)).
A less efficient way is to traverse the trie and calculate the Levenshtein distance for each word you encounter.
An even less efficient solution might be to sort all strings in your corpus by length and calculate distance for all of them.
Heikki Hyyrö has a few papers and a PhD thesis from circa 2001..2005 that walk you through this in pseudo-code that you could port to any low level language and also generalize the edit distance. Multi-precision arithmetic also allows extending to languages with big words over the alphabet in question.
It seems to at least not have escaped the attention of biologists/DNA analyzers in bioinformatics [1]. There are also "banded" algorithms that can exit early even with the classic dynamic programming edit distance if you just limit distances to < k after Ukkonen 1985 (No, NOT the suffix tree algorithm).
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5408825/