I'd be curious as well to see how the performance holds up getting into the terabytes as I haven't tested that. Remember too that there are a lot of parameters for the matching algorithm here (https://github.com/worldveil/dejavu/blob/master/dejavu/finge...) which allow you to trade off accuracy, speed, and storage in different ways. I've tried to document it throughly.
Finding duplicates is a great one! Actually generating a checksum for each audio file (minus the header and ID3 tags) and adding this as a column in the songs table for all the different filetypes Dejavu supports (mp3, wav, etc) would probably be the best way to do this.
I say this because so many songs today are built on sampling. Mashups and EDM music often samples from other work, and as such, the fingerprints and their alignment can be shared across different songs. Something more clever like seeing the percentage of hashes by song that are the same and comparing to a threshold might do the trick, though.
I'd be curious as well to see how the performance holds up getting into the terabytes as I haven't tested that. Remember too that there are a lot of parameters for the matching algorithm here (https://github.com/worldveil/dejavu/blob/master/dejavu/finge...) which allow you to trade off accuracy, speed, and storage in different ways. I've tried to document it throughly.
Finding duplicates is a great one! Actually generating a checksum for each audio file (minus the header and ID3 tags) and adding this as a column in the songs table for all the different filetypes Dejavu supports (mp3, wav, etc) would probably be the best way to do this.
I say this because so many songs today are built on sampling. Mashups and EDM music often samples from other work, and as such, the fingerprints and their alignment can be shared across different songs. Something more clever like seeing the percentage of hashes by song that are the same and comparing to a threshold might do the trick, though.
Happy hacking, and feel free to send in a PR! :)