Moreover the actual GPU doesn't work like that either. GPUs do not have the capability to run more than one work-unit-thing at a time. They have thousands of cores, yes, but much more in a SIMD-style fashion than in a bunch of parallel threads. They cannot split those cores up into logical chunks that can then individually do independent things.
The whole post isn't just oversimplified, it's just wrong. Across the board wrong wrong wrong. The point of Mantle, of Metal, and of DX12 is to expose more of the low level guts. The key thing is that those low level guts aren't that low level. The threading improvements come because you can build the GPU objects on different threads, not because you can talk to a bunch of GPU cores from different threads.
The majority of CPU time these days in OpenGL/DirectX is in validating and building state objects. DX12 and others now lets you take lifecycle control of those objects. Re-use them across frames, build them on multiple threads, etc... Then talking to the GPU is a simple matter of handing over an already-validated, immutable object to the GPU. Which is fast. Very fast.
Yeah, I'd have to agree it's hard to describe this post in more generous terms than just flat out wrong. DX12 is making it more efficient to spread CPU side rendering work across multiple cores but it's not about letting individual CPU cores talk to individual GPU cores. That isn't even really a coherent concept.
The whole digression on lighting is mostly just wrong too. Deferred renderers have been rendering with 100s of dynamic lights for years. DX12 may make it a bit more efficient to deal with the large amount of constant data that needs to be updated when dealing with 100s of dynamic lights but it isn't introudcing any fundamental changes to dynamic lighting.
Most GPUs group together individual ALUs into larger units (sometimes called compute units or clusters or SMs) and each compute unit can run independent work.
You can have multiple CUDA kernels running simultaneously on the same GPU, but you have no direct control over which SM(X) core is assigned to which kernel AFAIK. So it actually works pretty similarly to multi threaded programming on CPU if you take away thread affinity. In general I find a good way to approximate a top-of the line NVIDIA GPU is to think of it as an ~8 core with a vector length of 192 for single precision and 96 (half of full length) for double precision. It has a high memory bandwidth which has the limitation of requiring memory accesses to 32 neighbours simultaneously in order to make full use of the performance. The CUDA programming model is particularly set up in a way that the programmer doesn't have to handle this manually - (s)he just needs to be aware of it. I.e. you program everything scalar, map the thread indices to your data accesses and make sure that the first thread index (x) maps to the fastest varying index of your data.
> In general I find a good way to approximate a top-of the line NVIDIA GPU is to think of it as an ~8 core with a vector length of 192 for single precision and 96 (half of full length) for double precision.
But that's not entirely correct either. Yes you can use it like that, but you can also use it as a single core with a vector length of 1536.
In the context of over-simplification these are better thought of as single-core processors. The reason being if you have method foo() that you need to run 10,0000 times, it doesn't matter if you use 1 thread, 2 threads, or 8 threads - the total time it will take to complete the work will be identical. This is very different from an 8-core CPU where using 8 threads will be 8x faster than using 1 thread (blah blah won't be perfectly linear etc, etc).
You still have to be aware of it when optimizing the shaders and workloads though. On consoles where the hardware is fixed this is easily profiled.
The GPU is creating threads and tasks internally and it's not always easy to balance this workload so no parts of the GPU becomes saturated while following parts in the chip's pipeline are idly waiting for work.
The PowerVR chips we're working with have dozens and dozens of different profile metrics corresponding to the different areas of its pipeline, each one being a potential bottleneck.
You could do something as silly as render a ball with 12k vertices instead of 24 and expecting the vertex processing to be much slower, but after profiling you find out its the fragment part lagging way behind because the data sequencer is overloaded trying to generate fragment tasks. In both cases you're rendering about the same amount of pixels.
With unified shader architectures, its very frequent for vertex and fragment tasks from different draw calls to overlap simultaneously. We're even seeing tasks from different render targets overlapping! Such as fragment tasks from the shadow pass still running when the solid geometry pass is processing its vertices.
That was the point of the blog post. Maybe described wrong, but that seems to be the point. The possibility of uploading stuff to the GPU on multiple threads on both CPU side, and the ability of the GPU to store those uploads in parallel. Maybe even render shadow-maps in parallel?
I wonder is there already the OpenGL equivalent of this? One of the main hurdles of OGL was issuing all the calls from the main thread, if that is gone now, that would be awesome.
Uploads are already done asynchronously, that's driver optimization 101 level stuff. It's also one of the very few operations that has an independent core to handle it (the copy engine)
Rendering shadow-maps in parallel would be pointless. If you render 2 at the same time, then each map gets half the GPU so an individual render takes twice as long, resulting in the same total time as if you gave each render 100% of the GPU and rendered in sequence.
> I wonder is there already the OpenGL equivalent of this? One of the main hurdles of OGL was issuing all the calls from the main thread, if that is gone now, that would be awesome.
The whole post isn't just oversimplified, it's just wrong. Across the board wrong wrong wrong. The point of Mantle, of Metal, and of DX12 is to expose more of the low level guts. The key thing is that those low level guts aren't that low level. The threading improvements come because you can build the GPU objects on different threads, not because you can talk to a bunch of GPU cores from different threads.
The majority of CPU time these days in OpenGL/DirectX is in validating and building state objects. DX12 and others now lets you take lifecycle control of those objects. Re-use them across frames, build them on multiple threads, etc... Then talking to the GPU is a simple matter of handing over an already-validated, immutable object to the GPU. Which is fast. Very fast.