Moreover the actual GPU doesn't work like that either. GPUs do not have the capa...

mattnewport · on Feb 2, 2015

Yeah, I'd have to agree it's hard to describe this post in more generous terms than just flat out wrong. DX12 is making it more efficient to spread CPU side rendering work across multiple cores but it's not about letting individual CPU cores talk to individual GPU cores. That isn't even really a coherent concept.

The whole digression on lighting is mostly just wrong too. Deferred renderers have been rendering with 100s of dynamic lights for years. DX12 may make it a bit more efficient to deal with the large amount of constant data that needs to be updated when dealing with 100s of dynamic lights but it isn't introudcing any fundamental changes to dynamic lighting.

jra101 · on Feb 2, 2015

Most GPUs group together individual ALUs into larger units (sometimes called compute units or clusters or SMs) and each compute unit can run independent work.

http://www.anandtech.com/show/8526/nvidia-geforce-gtx-980-re...

bhouston · on Feb 2, 2015

I believe that is true but I haven't seen this exposed via DX. Does dx12 expose this functionality? Does cuda or OpenGL expose this type of xontrol.?

m_mueller · on Feb 3, 2015

You can have multiple CUDA kernels running simultaneously on the same GPU, but you have no direct control over which SM(X) core is assigned to which kernel AFAIK. So it actually works pretty similarly to multi threaded programming on CPU if you take away thread affinity. In general I find a good way to approximate a top-of the line NVIDIA GPU is to think of it as an ~8 core with a vector length of 192 for single precision and 96 (half of full length) for double precision. It has a high memory bandwidth which has the limitation of requiring memory accesses to 32 neighbours simultaneously in order to make full use of the performance. The CUDA programming model is particularly set up in a way that the programmer doesn't have to handle this manually - (s)he just needs to be aware of it. I.e. you program everything scalar, map the thread indices to your data accesses and make sure that the first thread index (x) maps to the fastest varying index of your data.

kllrnohj · on Feb 3, 2015

> In general I find a good way to approximate a top-of the line NVIDIA GPU is to think of it as an ~8 core with a vector length of 192 for single precision and 96 (half of full length) for double precision.

But that's not entirely correct either. Yes you can use it like that, but you can also use it as a single core with a vector length of 1536.

In the context of over-simplification these are better thought of as single-core processors. The reason being if you have method foo() that you need to run 10,0000 times, it doesn't matter if you use 1 thread, 2 threads, or 8 threads - the total time it will take to complete the work will be identical. This is very different from an 8-core CPU where using 8 threads will be 8x faster than using 1 thread (blah blah won't be perfectly linear etc, etc).

jra101 · on Feb 2, 2015

No, this is something the GPU front end controls.

jeremiep · on Feb 2, 2015

You still have to be aware of it when optimizing the shaders and workloads though. On consoles where the hardware is fixed this is easily profiled.

The GPU is creating threads and tasks internally and it's not always easy to balance this workload so no parts of the GPU becomes saturated while following parts in the chip's pipeline are idly waiting for work.

The PowerVR chips we're working with have dozens and dozens of different profile metrics corresponding to the different areas of its pipeline, each one being a potential bottleneck.

You could do something as silly as render a ball with 12k vertices instead of 24 and expecting the vertex processing to be much slower, but after profiling you find out its the fragment part lagging way behind because the data sequencer is overloaded trying to generate fragment tasks. In both cases you're rendering about the same amount of pixels.

With unified shader architectures, its very frequent for vertex and fragment tasks from different draw calls to overlap simultaneously. We're even seeing tasks from different render targets overlapping! Such as fragment tasks from the shadow pass still running when the solid geometry pass is processing its vertices.

bhouston · on Feb 3, 2015

This is fascinating. I would love to know more. You should write an article about advanced optimization for mobile GPUs or something.

Arwill · on Feb 2, 2015

>build them on multiple threads

That was the point of the blog post. Maybe described wrong, but that seems to be the point. The possibility of uploading stuff to the GPU on multiple threads on both CPU side, and the ability of the GPU to store those uploads in parallel. Maybe even render shadow-maps in parallel?

I wonder is there already the OpenGL equivalent of this? One of the main hurdles of OGL was issuing all the calls from the main thread, if that is gone now, that would be awesome.

kllrnohj · on Feb 2, 2015

Uploads are already done asynchronously, that's driver optimization 101 level stuff. It's also one of the very few operations that has an independent core to handle it (the copy engine)

Rendering shadow-maps in parallel would be pointless. If you render 2 at the same time, then each map gets half the GPU so an individual render takes twice as long, resulting in the same total time as if you gave each render 100% of the GPU and rendered in sequence.

> I wonder is there already the OpenGL equivalent of this? One of the main hurdles of OGL was issuing all the calls from the main thread, if that is gone now, that would be awesome.

Yes, NV_command_list extension:

http://www.slideshare.net/tlorach/opengl-nvidia-commandlista...

bhouston · on Feb 2, 2015

It isn't possible to render shadow maps in parallel in dx12 I believe.