I think what OP is hinting at here is cache invalidation/eviction caused by the additional thread's processing and memory operations. If your requirements are down to nanosecond granularities then cache misses are probably being measured and noticeable. A third party logging thread doing memory copies and other log processing sounds like a fine way to unintentionally evict a bunch of cache entries. You might be able to mitigate this to some extent with CPU affinities for threads I suppose. Another option would be to move logging to the network and record packets in/out with an out of band monitoring solution.
Logging the raw structured log entry and off-loading any string formatting to another system is a very interesting idea. Just push the whole "friendly message" stuff off to other logging infrastructure where the latencies matter less.
At a few places I've worked, logging and util stuff had it's own core to prevent l1/l2 cache pollution (either using threads or shared memory just as long as you got it out of the hot path).
Yes, but you need to have a location where to copy them to. They can’t be on the stack, since the logging task runs asynchronously. So each captured argument must be in some form heap allocated. Eg passing 2 strings as arguments and one integer might require 3 additional heap allocation - where the malloc/free overhead might outweigh the string formatting costs. You can try to optimize here with specialized allocators (eg Arenas) or trying to generate dedicated structures for each logging callsite where all arguments can be carried inside a single heap allocated struct (instead of 3 here). But that might lead to other disadvantages - eg code bloat.
That would be a terrible implementation strategy. You would simply copy each argument into the communication ring buffer. You can also use specialised strategies for some types, for example string contents can be transmitted inline instead of copying the strings themselves.
On my logger the thread queue was(is) node based, so it was just a matter of making a bigger contiguous allocation and placing things there contiguously, the log entry and the data copies. 1 allocation.
Unless you do something dumb (like copy a vector when you only need to print the size), copying is significantly faster than string formatting.