Yeah mmap and leaving the caching up to the OS or storage drivers instead of mod...

Yeah mmap and leaving the caching up to the OS or storage drivers instead of model-informed placement is not going to yield the same results. Observing data access patterns, pre-empting inference data requirements and understanding hardware latency when distributing the model can yield some pretty significant results, as that same approach does in other domains. This is adjacent to compiler optimization.

It would be very meta to use AI to observe these access patterns and distribute the model accordingly based on usage, so that it can optimize placement for your given context domain.