The architecture of GPUs is less exotic than people would have you believe. A given graphics card will have a couple dozen relatively slow cores (arranged in a NUMA hierarchy) each with 10 logical threads and 32 or 64-wide SIMD. The 10 logical threads/core enables many concurrent memory operations to be in flight at the same time, while the wide SIMD enables massive parallelism.
There's some cleverness in the programming model however: the code the programmer writes is executed on a single SIMD lane so 32 or 64 copies of it can be run in lockstep. In total, to keep every lane of every logical thread of every core busy requires thousands of concurrent threads.
(There is also some special purpose hardware for graphics related tasks, but that is less relevant to GPGPU workloads)
There's some cleverness in the programming model however: the code the programmer writes is executed on a single SIMD lane so 32 or 64 copies of it can be run in lockstep. In total, to keep every lane of every logical thread of every core busy requires thousands of concurrent threads.
(There is also some special purpose hardware for graphics related tasks, but that is less relevant to GPGPU workloads)