Another thing to check is how nginx was compiled. Using generic optimizations vs. x86_64 can do interesting things on VM's vs bare metal. nginx and haproxy specifically should be compiled generic for VM's. I don't have any links, just my own performance testing in the past.
A binary running in a VM is still executing native machine code, so compiler optimizations should have the same effect whether running on bare metal or a VM.
Should being the key word. In truth the implementations of each hypervisor vary. Try it on each hypervisor that you use. I found KVM to have the most parity to bare metal performance.
I'm struggling to think of a situation when running virtualized vs. bare metal where compiler optimizations would matter.
Certain hypervisors have the ability to disable features on the virtual CPU to enable live migration between different generation physical CPUs, in which case a binary that depends on a disabled virtual CPU feature (a.g., AVX-512) will simply crash (or otherwise fail) when it executes an unsupported instruction.
Other than that, I'm drawing a blank.
Hypervisor performance will vary, but I can't envision any scenario where a binary optimized for the processor's architecture would perform worse than one without any optimizations when running on a VM vs bare metal.
That's the point. It shouldn't matter but in fact it does. You can see this for yourself if you use the x86_64 optimizations on VM's in benchmark tests. And you will see various results depending on hypervisor used and what application is used. This will even change with time as updates are made to each hypervisor. What I am describing is exactly what is not supposed to happen which is why you are struggling to think of a situation where this should matter. You are being entirely logical.
IIRC, hypervisors have to preserve CPU registers and other processor-related state when switching between "worlds". This is why mitigations for CPU vulnerabilities affects hypervisors too.
Most compilers assume that emitting the code in certain modes (SSE/AVX etc.) have particular cost. That cost may drastically change depending on how the implementation of the hypervisors handles the registers in question.
The point of a hypervisor is that any instruction can potentially be trapped and either emulated or substituted with others. If your application uses a lot of instructions that get trapped and end up using slower emulation it will hurt performance.
Yeah but most the instructions that get emulated are not used in applications. They are instructions things that operating systems do like sending interrupts to physical cores (which ofc need the hyper visor).