Moore's Law, AI, and the pace of progress

williamkuszmaul · on Dec 12, 2021

Here is a (semi)recent paper in Science about the end of Moore's law. As I understand it (but I'm not an expert), Figure 2 seems to give pretty compelling evidence that Dennard scaling (i.e, the phenomenon that historically allowed for smaller chips to run at higher clock speeds without an increase in power usage) seems to have stopped around 2005, and that subsequent speed ups have largely come from in-chip parallelism.

https://www.science.org/doi/10.1126/science.aam9744

buildbot · on Dec 12, 2021

That’s what I was taught back in school, the end of Dennard scaling pushed the industry to multi-core, as well as huge temporary power limits to race to the finish as needed for better average efficiency. IIRC the current power density for modern chips is near that of a nuclear reactor.

Edit: yep, this was the source of the claim, slide 4: https://www.glsvlsi.org/archive/glsvlsi10/pant-GLSVLSI-talk.... A 3090 will spike to 500W+ which ends up being ~80W/cm2

visarga · on Dec 12, 2021

Didn't say anything about algorithms improving faster than hardware. With better initialization and regularization, more efficient layers discovered by AutoML, sparsity, lower precision, or by directly combining models trained separately on diverse tasks instead of training sequentially we could have better models at the same hardware cost.

Even by improving the way people collaborate on model training we could make a huge leap. Instead of retraining from scratch we could reuse and update existing models. [1]

[1] A Call to Build Models Like We Build Open-Source Software https://colinraffel.com/blog/a-call-to-build-models-like-we-...

sgt101 · on Dec 12, 2021

I keep reading about how AutoML will make things more efficient, but in my experience it's been hideously expensive in both time and cost (and I am therefore guessing energy), and had yielded marginal improvements at best. Often I have encountered AutoML in the form of an overfit model with test data leaks that has underperformed in prod having overperformed in test.

So - I don't see how AutoML is going to discover more efficient layers without expending loads of energy, can you describe how it will contribute to lowering the whole energy budget?

On the models as oss thing - I think everyone likes this idea, but models are not the same as code andoss doesn't always produce good code cf. desktop linux.

dmingod666 · on Dec 12, 2021

Very naive question, but, why do we need to go smaller and smaller, why can we not go bigger in terms of processor size? why would double the size of the processor == double the number of transistors be worse than packing more transistors in the same space. What are the limiting factors in going bigger?

tux3 · on Dec 12, 2021

Yield and reticle size!

You can have more chiplets in your processor, but the manufacturing process won't let you make one really big monolith.

You can have something like Cerebras' wafer-scale chip. It's a whole wafer!

But with that approach you still really have a network of identical processor subunits, not one big processor.

kurthr · on Dec 12, 2021

Also, for a given resistivity/dielectric and thickness Resistance and Capacitance also scale with length. That means that RC settling time goes as length squared (width roughly cancels since R goes down while C goes up)... ignoring any transistor scaling, just the wiring limits speed very quickly.

Reticle size (limited by lens DoF) is roughly 30mm square, and GPUs have been close to that for years.

DennisP · on Dec 12, 2021

Thus the thought experiment near the end of the article:

> assume two iPhones floating in space were simulating two connected neurons, with direct laser links between them. In order for the two to communicate with worse than the 1/200 second latency as neighboring neurons in our brains do...the two phones would need to be over 1000 miles away from each other, about the radius of the moon.

> Thus, for the silicon advantage to start hitting scale out limits relative to what we know is biologically necessary, we would need to be building computers about the size of the moon.

nine_k · on Dec 12, 2021

I suppose that in the brain most neurons talk to their relatively close neighbors.

To get something comparable to brain tissue, we need to be able to disperse the computing power and memory across a large number of small nodes. Currently both CPUs and RAM are hyper-centralized.

robbedpeter · on Dec 12, 2021

Consider what we started with - vacuum tube transistors - and the tasks being optimized, and learn about the first few iterations as processors were being developed. If each transistor is a bit then the number of transistors is the number of bits you can process per cycle. Each time you cut the transistor size in half, you can fit 4 times as many transistors in the same space, and it takes less time for signal to propagate between individual transistors. At a certain point, physics limitations will prevent shrinking the 2d area over which you can arrange transistors, so you'll have to start building into 3d arrangements.

Bigger is slower and uses less transistors. It's also hotter, since the electrical connections are longer and thus have more resistance.

rl3 · on Dec 12, 2021

It's not consumer-oriented, but:

https://en.wikipedia.org/wiki/Wafer-scale_integration

sanxiyn · on Dec 12, 2021

Cost. Bigger is more expensive, so going bigger improves performance but does not improve performance per dollar.

dtgriscom · on Dec 12, 2021

Power and speed. Bigger transistors need more electrons to switch and have larger threshold voltages. More voltage and current sloshing back and forth means more time per transition and more power.

ksec · on Dec 12, 2021

>why do we need to go smaller and smaller

Cost, performance, Energy efficiency.

civilized · on Dec 12, 2021

You can't make a CPU faster by making it bigger. All you can do is have more CPUs working at the same time. But for the most part computers don't need more parallel processors, they need to figure out how to use the ones they already have.

klelatti · on Dec 12, 2021

> You can't make a CPU faster by making it bigger.

Modern CPUs identify and take advantage of potential parallelism in apparently sequential code. When you compare Apple’s large cores with the small ones on the same SoC the extra transistors are used to do this.

So you can make a CPU faster by making it bigger.

sdenton4 · on Dec 12, 2021

Well, the last decade of graphics processing and ml are pretty solidly on the side of more+better parallelism. And brains, for that matter.

visarga · on Dec 12, 2021

Human cells/neurons also run in parallel.

spixy · on Dec 12, 2021

bigger = components are far away to each other = electric signals have to travel for a longer time

bullen · on Dec 12, 2021

People don't understand that the peak of energy is more important for progress than anything else.

The economy is dependent on eternal growth and will fail unless you climb the tree of progress faster than the rate of energy decline which is accelerating VERY fast.

Once customers realize that the low hanging fruits have been poached and they don't get better hardware when their old hardware dies (or most likely no hardware at all) they will look at buying quality that is open (not gatekept with software) and the inflationary rent seeking cycle will spark.

Buy what you can while it's still open enough to last you 100 years.

EUV is a dead end because the costs of that complexity cannot be maintained when the energy supply is failing, production speed is too slow and I suspect quality will deteriorate quickly with heat and time!

Raspberry 4 has 2Gflops/W at 28nm and M1 has 2.5Gflops/W at 5nm, you make the math!

RAM latency will never improve and increased bandwidth is meaningless for all interesting applications (not embarrasingly parallelizable ones).

axiosgunnar · on Dec 12, 2021

Can somebody rephrase this comment? I don't understand it.

dr_zoidberg · on Dec 12, 2021

I'm not quite sure I follow it myself, but here are the key points as I understood:

* Energy decline is accelerating.

* Perf-per-watt is stagntting (hence the comparisson of RPi @ 28nm vs M1 @ 5nm).

* This trend leads to economic decline (Note: I don't understand why, maybe it's explained but I didn't get it).

* Eventually hardware won't be replaceable.

* Software should be improved to get better efficiency (Note: I'm not sure this is a claim made by OP, but it's something I understood).

Edit: having taken the time to summarize what I understood from the comment, I must say I don't quite agree. The RPi vs M1 comparisson seems way way off, they both have a similar power envelope but the M1 has about 10x better results in various benchmarks. I don't agree that hardware won't be replaceable (save for a civilization shattering event, which is not what we're discussing here).

I do believe software _could_ be more optimized, but then again I'm not sure that's a point OP was making (they did mention something about "software gatekeeping" which I took that way, probably misunderstood what they really meant). Now that I've thought about this some more, this is probably something I conflated into OPs post based on my opinions.

I'm not quite sure what to think about energy decline. I think they meant about using more energy for things that used to take less energy before? If that is the correct interpretation, I'm not sure what we're talking about. On micro scales (eg: computers) I don't think that holds, but maybe it is happening on macro scales (eg: large industries). But even for macro scales, I think there's been a shift to efficiency, if only to reduce energy costs. Then again, that's not an area where I have the faintest idea, so maybe I'm blatantly wrong and OP is right.

jaratec · on Dec 12, 2021

I think the OP argument is: if you have less energy available, then you will manufacture fewer things. Hence the advice to seek long lasting, energy efficient, replaceable/repairable hardware.

dr_zoidberg · on Dec 12, 2021

The argument as you've written it here is interesting and makes sense, but I have trouble getting to that conclusion from OPs comment.

bullen · on Dec 13, 2021

Spot on!

What I meant with gatekeeping was closed software (hard and firm too) that limits what the owner can and cannot do.

Use linux!

dr_zoidberg · on Dec 13, 2021

Got it. What about the RPi vs M1 numbers? I've not found anything in that regard that puts both chips on the same ballpark, M1 seems to be getting at least twice (and in general about 10x) the performance for a similar power envelope. Then again, I did have some trouble finding benchmarks that had ben tested on both, I ended up going for a few Phoronix benchmarks that showed the same tests[0, 1]

[0] RPi: https://www.phoronix.com/scan.php?page=article&item=raspberr...

[1] M1: https://www.phoronix.com/scan.php?page=article&item=apple-ma...

bullen · on Dec 14, 2021

There are no CPU Gflops/W benchmarks online for M1, but my friends who have M1 laptops used this code to test it: https://github.com/brianolson/flops

The M1 is only good at highly specialized tasks that they have custom designed hard/firm/soft-ware for; which are all locked both legally and technically into a tomb where they will remain until the end of humanity.

If you spend one second even thinking about them; you are wasting time for eternity! Infinite waste!

RISC-V with an open GPU and linux is the only saving grace. I doubt we will manage before it's too late at a price/performance that can compete with Raspberry Pi 2/4 for server duty and Jetson Nano for client duty.

Fingers crossed and may you spend your money wisely!

dr_zoidberg · on Dec 14, 2021

Check, not a fan of M1 being super closed, but "it's the Apple way" (and I don't buy Apple products). RISC-V is very interesting, but so far it seems to have been too niche and not exploited to its full capacity. Here's hoping that changes in the (near?) future!

ksec · on Dec 12, 2021

>This is not merely a populist view by the uninformed.

Ok I will bite.

It is always easy to suggest Moore's Law isn't dead using Logarithmic scale. But if you only look at recent data. Let's take TSMC 10nm, the moment which TSMC achieved its leading edge status from Intel, you then have 7nm, 5nm which we are currently on, the 3nm which might [0] ship in early 2023, and the expected 2nm in 2025. That is 2017 to 2025. There is nothing 2x / 2 year within this period, even if you only use the best / peak quoted [1] density matrices.

Let me just give this quoted density number using their node name [2]; in Million Transistors per mm2.

2015 - Intel 14nm - 44.67 [3]

2017 - TSMC 10nm - 52.51

2018 - TSMC 7nm - 91.20

2019 - 178.68 ( Hypothetical of Intel 14nm lineage at 2x / 2 year )

2020 - TSMC 5nm - 171.30

2023 - TSMC 3nm - 292.21 ( EST )

2025 - IBM Research GAA 2nm - 333.33 ( EST ), TSMC GAA 2nm - ~500 ( EST ),

2025 - 1430 ( Hypothetical of Intel 14nm lineage at 2x / 2 year )

Notice where the trend starts to break? 2019 - 2020. [4]

And unless Intel or TSMC could adjust their 2025 - 2030 roadmap to somehow increase transistor density by 2.9x every 2 years, they would not be back in the same trajectory as the original trend. So it either follows Power Law, or the next 5 - 10 years will be a beep / outliner in Moore's law history.

And it is not only Jensen, CEO of AMD Dr. Lisa Su made similar comments on Moore's Law. And they are not wrong ( or uninformed, in fact they are too well informed ). In order to achieve 2x performance increase or 50% reduction in die size. Their Cost of Die, Cost of R&D purely in terms of design and fabrication are increasing. Their total unit cost are increasing. GPU vendors are much more sensitive to this since their performance scales extremely well with transistor count. That is why chiplet and packaging has become important to solve this cost issues. ( They are not silver bullet )

There are also problems with 3D Stacking and layering. Which I have seen far too many people being completely dismissive of it, is thermal dynamics. You cant have a hundred layer of compute with each layer using 10W if not higher. It wasn't until AMD made it absolutely clear with their V-Cache implementation, you cant have your SRAM layer on top of your compute layer due to heat issue did people start to realise their dream of a hundred layered GPU might not actually work. At least not between now and 2030.

Not only are DRAM not getting any price / bit reduction in the past 10 years. NAND may see similar fate. We are getting faster, and lower power DRAM, but we are certainly not getting any cheaper DRAM [5]. And that is ignoring a majority of DRAM revenues comes from LPDDR and not normal DRAM. Which has a higher price per GB. NAND may have one or two generation to go in terms of cost reduction. ( Also worth looking at HDD Cost / GB with similar trend. ) But those hundred layers of NAND are done by string stacking. Stacking up multiple of 60 / 70 layers of stacks which has higher yield. Currently Samsung is the only one doing 128 layer single stack. Cell sizes hasn't shrink much either due to error rate and cost until they moved to EUV. Moving from TLC to QLC and later PLC has diminishing returns. It may be worth pointing out the obvious, DRAM and NAND are commodities, and follows the rules of any commodity market.

It is not that we are not getting any more cost / performance or IO or storage improvements. It's just their rate are slowing.

[0] Originally scheduled for 2022 iPhone launch ( as usual ) but TSMC announced they had a three month ( one quarter ) delay. Assuming yield were good enough and no contractual obliteration for Apple to be the first using their 3nm you might see other vendor launch using 3nm in early 2023.

[1] Peak Quoted Transistor Densities - Different Fabs may have different counting methodologies, they are estimated Logic Density. But give the best number from a marketing perspective.

[2] You should know these node number are marketing numbers, Every time we have a node discussion on HN there are people jumping over the node number are marketing number and how these numbers are wrong. And Samsung since 2020 ( again ) have their marketing spin on node numbers post 4nm.

[3] I already factor in Intel's 14nm being 2.7x density increase over their 22nm. So the date started in 2015 instead of what should have been 2014. Intel's marketing at one point were eager to push this narrative in 2019 during their 10nm fiasco.

[4] Which happens to be where the graph ends. In the past ~3 years there has been some sort of shift in the definition of Moore's Law. Some people ( PR ) now use it to mean "transistor improvements", whether you agree or disagree with that definition, no one is arguing we are not getting any more improvements.

[5] Until China joins the game and start dumping DRAM and NAND on to the market. Which is what they are already doing within China. They just haven't succeeded in catching up to the latest DRAM / NAND quality yet.

systemvoltage · on Dec 12, 2021

They've been saying Moore's law is dead for 20+ years. Reasons usually quoted for Moore's law's death are quite narrowly focused on a particular technology stack, whereas semiconductor fabrication happens on a complex web of technologies all moving forward with their own S-curves.

It is worth watching Jim Keller's lecture at UC Berkeley literally titled "Moore's law is not dead". He talks on a different level alltogether and far more convincing than the pedantic look at the details that enable Moore's law.

https://www.youtube.com/watch?v=oIG9ztQw2Gc

dane-pgp · on Dec 12, 2021

> Reasons usually quoted for Moore's law's death are quite narrowly focused on a particular technology stack

To give just one example, here's a paper[0] from 2002 titled "End of Moore's law: thermal (noise) death of integration in micro and nano electronics" which claims "increasing thermal noise voltage (Johnson–Nyquist noise) ... has the potential to break abruptly Moore's law within 6–8 years, or earlier."

[0] https://www.sciencedirect.com/science/article/abs/pii/S03759...

ksec · on Dec 12, 2021

And if you watch his later talk and interviews, his definition of "Moore's law" is basically transistor improvement. And not scaling anything with 2x.

natded · on Dec 12, 2021

Was about to come in and poast Keller, I see it's done, gj & have a nice day.

eliasmacpherson · on Dec 12, 2021

So Moore states his law as doubling of components per circuit per dollar every year in 1965, revises it to approx every two years in 1975 and now 'lesswrong.com' seems to think doubling every 2.5 years is on track with that.

What a waste of my time.

morpheos137 · on Dec 12, 2021

Some facts and opinions: (first one is not ad hominem, just letting people know the website owner)

Elizeer Yudkowski never graduated high school or attended college. While there are genuine well-educated autodidacts, there are also people in the world who have perfected the art of sounding like they know what they are talking about to the right crowd while lacking any real depth of understanding in the subjects on which they speak. Having read and read about Yudkowski I believe he is more the latter than the former. So now I take lesswrong articles with a gigantic grain of salt. They may be entertaining and thought provoking but they can hardly be considered to contain expert or learned opinions or projections that correspond to the real world as it actually is as opposed to some puerile scifi fantasy or pseudo-religion dressed up with fancy language. My personal opinion is that there will be no "singularity" and "AI" is massively over-hyped and not intelligence. Intelligence is a part adaptive part emergent phenomenon of sentient biological life and computers are certainly neither sentient or biological yet. The adaptive part is more important than the emergent part. General intelligence does not just emerge accidentally. Rather it needs to be molded by the constraints of Darwinian goal seeking in diverse hostile environments.

Second, Moore's law was not ever a law. Rather it was the extrapolation of an observation. It held for some time. But as the discussion of multi-core chips below intimates it is not really holding any more. Adding more cores or more speed does not make a computer more intelligent because a computer lacks the motivation to reproduce that biological creatures have, that is the root of intelligence as an adaptive phenomenon.

Third every growing thing, including technological progress, is ultimately bound by constraints of the relatively simple laws of physics: limits to energy, space, time, etc. Thus what appear to be exponential growth functions in the long run turn into logistic functions. There is no evidence that advancing technology can indefinitely increase the returns to engineering applications of the laws of physics.