If I was less lazy I could probably find this answer online, but how do you find the battery life these days? I'd love to make the switch, but that's the only thing holding me back...
I'd love to see some benchmarks for this on some common genomic formats (fa, fq, sam, vcf). Will be doubly interesting to see its applicability to nanopore data - lots of useful data is lost because storing FAST5/POD5 is a pain.
OpenZL compressed SAM/BAM vs. CRAM is the interesting comparison. It would really test the flexibility of the framework. Can OpenZL reach the same level of compression, and how much effort does it take?
I would not expect much improvement in compressing nanopore data. If you have a useful model of the data, creating a custom compressor is not that difficult. It takes some effort, but those formats are popular enough that compressors using the known models should already exist.
Do you happen to have a pointer to a good open source dataset to look at?
Naively and knowing little about CRAM, I would expect that OpenZL would beat Zstd handily out of the box, but need additional capabilities to match the performance of CRAM, since genomics hasn't been a focus as of yet. But it would be interesting to see how much we need to add is generic to all compression (but useful for genomics), vs. techniques that are specific only to genomics.
We're planning on setting up a blog on our website to highlight use cases of OpenZL. I'd love to make a post about this.
I will take a look as soon as I get a chance. Looking at the BAM format, it looks like the tokenization portion will be easy. Which means I can focus on the compression side, which is more interesting.
Another format that might be worth looking at in the bioinformatics world is hdf5. It's sort of a generic file format, often used for storing multiple related large tables. It has some built-in compression (gzip IIRC) but supports plugins. There may be an opportunity to integrate the self-describing nature of the hdf5 format with the self-describing decompression routines of openZL.
And a comparison between CRAM and openzl on a sam/bam file. Is openzl indexable, where you can just extract and decompress the data you need from a file if you know where it is?
Unfortunately, when you write a program that doesn't wrap output FASTAs, you have a bunch of people telling you off because SOME programs (cough bioperl cough) have hard limits on line length :)
I really want to like typer, and frequently go down the rabbit hole of rewriting all my argparse into typer, but I keep getting put off by it's high import cost and that development seems to be a bit up in the air (see https://github.com/fastapi/typer/issues/678#issuecomment-319...). A shame because otherwise it's a really nice library!
Agreed, there’s been some interesting developments in this space recently (e.g. AgroNT). Very excited for it, particularly as genome sequencing gets cheaper and cheaper!
I’d pitch this paper as a very solid demonstration of the approach, and im sure it will lead to some pretty rapid developments (similar to what Rosettafold/alphafold did)
Something I found useful is that you can create a much more minimal pandoc template for typst than for latex. Obviously if familiar with latex it probably won't be an issue, but when I tried to make my own barebones pandoc template (i.e., stripping out beamer) I gave up.
I've similarly found the combination of pandoc + typst to be quite exciting. I've found it particularly useful for typesetting academic papers - I'm quite averse to word in general, don't require extensive mathematical typesetting support, and find latex to generally be quite unapproachable (just look at the size of the default pandoc template!), and so it gives me a method of making a decent pdf whilst simultaneously producing a .docx for my collaborators. Being able to track changes with git is also a huge advantage, although never had the chance to work with someone who is comfortable using git :(
The recently added support for PDF/A is also quite exciting, as I've never found a satisfactory solution to this with latex. Now I just wish journals would support markdown submissions...
As an example of a more informative map of income/deprivation, I recently encountered the Scottish Index of Multiple Deprivation website (https://simd.scot). Only applicable to Scotland (obviously), but it is interesting to see how each city is a mosaic of social status. From personal experience, it is extremely accurate down to the street level!