Damn Cool Algorithms: Levenshtein Automata

cb321 · on March 7, 2022

Not sure why the Myers bit vector algorithm from 1999 is not included in these comparisons. Since it can operate on whole 32-bit or 64-bit registers at a time, it is really O(NM/w) which is in effect just O(N) and just gives you the whole edit distance.

Heikki Hyyrö has a few papers and a PhD thesis from circa 2001..2005 that walk you through this in pseudo-code that you could port to any low level language and also generalize the edit distance. Multi-precision arithmetic also allows extending to languages with big words over the alphabet in question.

It seems to at least not have escaped the attention of biologists/DNA analyzers in bioinformatics [1]. There are also "banded" algorithms that can exit early even with the classic dynamic programming edit distance if you just limit distances to < k after Ukkonen 1985 (No, NOT the suffix tree algorithm).

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5408825/

vanderZwan · on March 7, 2022

> It seems to at least not have escaped the attention of biologists/DNA analyzers in bioinformatics

TBF, the sequences they use are probably large enough that I wouldn't be surprised if they redeemed some (formerly) galactic algorithms for this task. It's really no wonder that they've scoured the literature for every optimization available.

cb321 · on March 7, 2022

Agreed proteins as "words" encoded in a 2-bit DNA alphabet surely do get much bigger than even human languages like German. :-)

That said, "galactic algorithms" suggests impracticality/inefficiency "in the small" which is not really true of the Myers bit-vector approach. So, I don't think "only faster for big problems" is a great excuse for the wider world. Myers bit-vector is faster all the time (compared to many other things in far more frequent use in the wild). It's also not that involved. Here is a link to an implementation in about 34 lines of Nim code: https://github.com/c-blake/suggest/blob/master/suggest.nim#L...

vanderZwan · on March 7, 2022

Oh no, sorry for the confusion: I wasn't referring to Myers algorithm, you made it clear in your comment that it is generally faster!

That's impressively concise, although I get the feeling that I should read the papers you alleged to earlier to actually grok it :)

cb321 · on March 7, 2022

Fair enough. IMO, Hyyro's description is easier to follow than Myers' own which may be the real explanation for its relative obscurity.

Re: concision - an incomplete summary of why I like Nim is that it is more concise than Python, as fast as C, and safe, both in its metaprogramming and otherwise.

chrismorgan · on March 7, 2022

(2010), in case you couldn’t tell from the tiny, low-contrast text and use of cleartext HTTP! (12px, #666 on white, and even lower contrast for links.)

efitz · on March 7, 2022

Great content, sucky web page- saved by reader view.

amelius · on March 7, 2022

How would you use this to find a small string (or string close to it) in a large fixed corpus?

Usually you want to find many small strings in a large corpus.

kreeben · on March 7, 2022

One of the most efficient ways (that I know of) is to create a trie [0] from your corpus, then incorporate the levenshtein algo into the tree traversal algo to give you a levenshtein automata over a trie. Since a trie is a binary tree, if it's properly balanced, this timing of this operation is O(log(n)).

A less efficient way is to traverse the trie and calculate the Levenshtein distance for each word you encounter.

An even less efficient solution might be to sort all strings in your corpus by length and calculate distance for all of them.

[0] https://en.wikipedia.org/wiki/Trie

TrapLord_Rhodo · on March 7, 2022

For those not so technically inclined but still want to implement a bare version of levenstein distancing, I found the fuzzy wuzzy[1] python library.

I used it while at tesla to compare part numbers and obtain pricing variance.

https://pypi.org/project/fuzzywuzzy/

atonse · on March 7, 2022

How many people looked this up after the recent meili and typesense posts? :-)

I have to say, I truly feel more at home at HN than anywhere else on the internet.

cb321 · on March 7, 2022

Also mentioned only 12 hours ago: https://news.ycombinator.com/item?id=30578267

brightball · on March 7, 2022

Very cool. I spent a lot of time with Levenshtein distance a few years back on an antispam system (detecting similar messages). Works well.

Ygg2 · on March 7, 2022

Is there any other blogpost that deals with cool algorithm?

Blog is great, but pretty old at this point.

lewisjoe · on March 7, 2022

Very interesting! Is there any library that uses this algorithm underneath?

kreeben · on March 7, 2022

Lucene: https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/...

TrapLord_Rhodo · on March 7, 2022

https://pypi.org/project/fuzzywuzzy/