I don't know how fast SuperMalloc is for other people, but in my experience it's not as fast as some other mallocs() (like Hoard or tcmalloc) and, in the end, it's still just another malloc().
Giving up the malloc() interface allows a very simple page-long allocator to outperform every malloc() I've ever tried (including SuperMalloc), and it would truly surprise me to learn there was some exotic memory technique (HTM, caching, etc) that could beat me.
Basically, ask for an amount of memory, and give back a pointer to memory. But it does not allow you communicate anything but the amount on an allocation, and you can only give a memory pointer on freeing. If, for example, malloc had an interface like:
One can imagine you could optimize based on the given hint. Once you open up one kind of hint, you can imagine many other kinds of things users could communicate explicitly to the memory allocator about the allocation and access patterns of their data.
All implementations of free also need to find the related metadata for the given pointer. We could also imagine an interface for free which required the user to maintain that, but could perhaps speed things up:
void free(void* ptr, meta_t d);
Or, we could also imagine communicating how important it is to free this memory:
In the latter case, maybe the memory allocator could put off doing delayable requests, and do them in a batch later. (Making it a bit more like garbage collection.)
I'm not sure any of these ideas would help, but the point is that because of the limited interface, we really can't explore them. We can, but then convincing the rest of the world to change such a basic building block of C code is quite hard.
> In the latter case, maybe the memory allocator could put off doing delayable requests, and do them in a batch later. (Making it a bit more like garbage collection.)
Amusingly, when I've used an IMMEDIATE/DELAYABLE style hint, it was for the opposite purpose: I had some batched deallocs that I would either delay (to spread out over multiple frames instead of handling as a single batch, to eliminate the framerate hitch we were getting), or perform immediately as a single batch (to achieve greater throughput when switching scenes as delayed deallocation was adding untenable amounts of overhead.)
> We can, but then convincing the rest of the world to change such a basic building block of C code is quite hard.
Changing such a fundamental building block if C is impossible.
However, providing a second alternative interface, for those applications which could really benefit from such fiddly high performance tweaks, already happens a good bit in games at least. Pool allocators, allocators with extra debug information, allocation of entirely different styles of memory (e.g. write combined memory for texture uploads)... lots of stuff out there. Some low level graphics APIs now make you decide if e.g. you want to put shader constants in their own GPU buffers, or just interleave them into the command buffers themselves...
"jemalloc has an alternative API that allows specifying the size of the allocation"
I would like to see an API for malloc where you don't need to specify the size of the allocation :-)
For those who wonder: it takes additional flags specifying alignment, whether to zero memory, whether to store data in a thread-specific cache, or am arena to use.
Another way of improvement is to use alloca() for small local (to function) objects but there is no direct way to know if a variable is local or not (in C and C++ at least).
> We can, but then convincing the rest of the world to change such a basic building block of C code is quite hard.
I can be made with static analysis or binary instrumentation.
malloc(sizeof(foo)) is slower than alloc_foo() because the latter can simply be a pointer increment, but there is no way to tell the malloc() interface that you're going to be doing nothing but allocating foo for a while.
malloc() can guess this with heuristics, but a good malloc() needs to perform well for a wide variety of use cases: Surely you can appreciate that balance has a cost that the specialised allocator simply doesn't have to pay.
Giving up the malloc() interface allows a very simple page-long allocator to outperform every malloc() I've ever tried (including SuperMalloc), and it would truly surprise me to learn there was some exotic memory technique (HTM, caching, etc) that could beat me.