Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

PostgreSQL will not perform well on genome data compared to specialized algorithms like BLAST and its alternatives. For exact search a suffix array should allow fast lookup, but DNA search usually requires approximate matches.

I don't know what is state of the art these days, but this is probably a good place to start: https://en.wikipedia.org/wiki/BLAST_(biotechnology)#Alternat...



thanks for the suggestion.

the reason to create a BLAST alternative is to gain more granular control over matches.

unless i'm mistaken, BLAST won't help on searches akin to this reg ex: GTC[ATGC]{3,10}AACCGT

(i.e., GTC followed by any sequence of 3-10 nucleotides then ending with AACCGT.)

another way of framing the goal is to enable reg ex searches over genome/protein data.

do you know if anything allows this today?


In theory you can support this search with a suffix tree [1] or n-gram index [2], (but you will need n>3 because your alphabet is small.)

I'm not sure what ready-made tools or libraries you might find for this - last time I checked there were a lot more papers than production-quality open source code.

Or you could always just try a fast RE engine like Hyperscan and see if it is fast enough for your use case.

[1] https://dl.acm.org/doi/10.1145/235809.235810

[2] https://swtch.com/~rsc/regexp/regexp4.html


I had some friends who worked on research of data structures specifically for genome sequences. No idea about the details, but they're in the same tune as this paper: https://hub.hku.hk/handle/10722/60628

(the first author was their supervisor, they published a series of similar computational biology papers that you might be interested in, so if you check his other publications you (GP) might find something interesting or a least find an entry point to dig into references and citations that may result in something you'd be interested in -- that said if you find any data structure useful you'll probably have to implement it yourself...)


thanks for sharing. will check this out!


thanks for the suggestions.

to confirm, it sounds like you agree that BLAST wouldn't work for regex searches?

prefer not creating a specialized search tool if avoidable, but there doesn't seem to be any alternative.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: