with latest (eg TSMC) processes, someone could build a regular array of 32-bit FP transputers (T800 equivalent):
- 8000 CPUs in same die area as Apple M2 (16 TIPS) (ie. 36x faster than an M2)
- 40000 CPUs in single reticle (80 TIPS)
- 4.5M CPUs per 300mm wafer (10 PIPS)
the transputer async link (and C001 switch) allows for decoupled clocking, CPU level redundancy and agricultural interconnect
heat would be the biggest issue ... but >>50% of each CPU is low power (local) memory
Before you know it you'll be going down the compute fabric and 'fleet' rabbit hole. For a long time I thought that was the future (I even worked with Transputers back in the day) but now I'm not so sure. GPUs have gotten awfully powerful and are relatively easy to work with compared to trying to harness a large number of independently operating CPUs. Debugging such a setup is really hard. That said, I still have this hope that maybe one day such an architecture will pay off in a bigger way than what has happened so far. If someone cracks the software nut in a decisive manner then it may well happen.
well - yes ... that's the point of occam[1] ... if it can hang, it will hang deterministically
we have to zoom out from the 1980s when 4 CPUs were a lot ... but now you can build 40,000 (ie 200 x 200 array) of CPUs within the single reticle limit (ie same as a big NVIDIA) then a big MIMD must be coded with algorithmic patterns like map-reduce, pipelining, etc.
but the general CPU nature and HLL coding means that this is far easier than CUDA to get close to theoretical max performance
[1] or any CSP with both input and output descheduling - ie no queueing
with latest (eg TSMC) processes, someone could build a regular array of 32-bit FP transputers (T800 equivalent):
the transputer async link (and C001 switch) allows for decoupled clocking, CPU level redundancy and agricultural interconnectheat would be the biggest issue ... but >>50% of each CPU is low power (local) memory