PostgreSQL will not perform well on genome data compared to specialized algorith...

panabee · on July 12, 2022

thanks for the suggestion.

the reason to create a BLAST alternative is to gain more granular control over matches.

unless i'm mistaken, BLAST won't help on searches akin to this reg ex: GTC[ATGC]{3,10}AACCGT

(i.e., GTC followed by any sequence of 3-10 nucleotides then ending with AACCGT.)

another way of framing the goal is to enable reg ex searches over genome/protein data.

do you know if anything allows this today?

tveita · on July 12, 2022

In theory you can support this search with a suffix tree [1] or n-gram index [2], (but you will need n>3 because your alphabet is small.)

I'm not sure what ready-made tools or libraries you might find for this - last time I checked there were a lot more papers than production-quality open source code.

Or you could always just try a fast RE engine like Hyperscan and see if it is fast enough for your use case.

[1] https://dl.acm.org/doi/10.1145/235809.235810

[2] https://swtch.com/~rsc/regexp/regexp4.html

hnfong · on July 12, 2022

I had some friends who worked on research of data structures specifically for genome sequences. No idea about the details, but they're in the same tune as this paper: https://hub.hku.hk/handle/10722/60628

(the first author was their supervisor, they published a series of similar computational biology papers that you might be interested in, so if you check his other publications you (GP) might find something interesting or a least find an entry point to dig into references and citations that may result in something you'd be interested in -- that said if you find any data structure useful you'll probably have to implement it yourself...)

panabee · on July 12, 2022

thanks for sharing. will check this out!

panabee · on July 12, 2022

thanks for the suggestions.

to confirm, it sounds like you agree that BLAST wouldn't work for regex searches?

prefer not creating a specialized search tool if avoidable, but there doesn't seem to be any alternative.