Intel delays chip-making changes

walterbell · on July 16, 2015

Bunnie Huang on Moore's Law, http://www.wired.co.uk/magazine/archive/2015/08/features/moo...

“.. WIRED buys the iPhone 6 schematic book (above) for 25 RMB (£2.70) .. It's like getting a master class on circuit design .. for £2.70, engineers in China can get a leg up on the best and brightest university-educated kids by studying these designs ..

.. there are established, legitimate businesses that earn their keep creating schematics from circuit boards. As the pace of Moore's law diminishes, learning through reverse engineering will become increasingly effective, as me-too products will have a larger market window to amortise reverse-engineering efforts before the next new thing comes along.

.. Even a modest deceleration of Moore's law can have a dramatic effect: a five per cent reduction in the pace of gate-length shrinkage -- from 16 percent to 11 per cent per year -- increases the available time to develop products within a technology generation by 50 per cent, from two years up to three.

.. Instead of running in fear of obsolescence, open-source hardware developers now have time to build communities around platforms; we can learn from each other, share blueprints and iterate prototypes before committing to a final design.”

benjaminl · on July 16, 2015

.. Even a modest deceleration of Moore's law can have a dramatic effect: a five per cent reduction in the pace of gate-length shrinkage -- from 16 percent to 11 per cent per year -- increases the available time to develop products within a technology generation by 50 per cent, from two years up to three.

Great observation except a reduction of 16% to 11% a year is closer to a 30% reduction in pace. Although this only makes the point stronger. If we are indeed on a 3 year cadence instead of 2. This is a significant change in the rate of More's law.

ghoul2 · on July 16, 2015

Does that also not imply that the this "freedom" to clone would start coming under increasing pressure from IP owners?

Till now, chip companies were too busy upgrading and innovating their products - as long as that was profitable (cause clones were always a few steps behind), they weren't too intent on IP action. But with the clones catching up, maybe we will see increasing RIAA-style actions? No more easy imports of clones from aliexpress, etc?

wtallis · on July 16, 2015

We already went through that scare and got the Semiconductor Chip Protection Act of 1984 to supplement what was already protected by patents.

adventured · on July 17, 2015

Which is another way of saying, even more value will shift to the software side and away from hardware.

tzs · on July 16, 2015

Where did the designers who designed the iPhone 6 learn?

rer0tsaz · on July 17, 2015

House for the Feeble Minded.

shalmanese · on July 17, 2015

This is a reference to the classic Asimov short story Profession (http://www.inf.ufpr.br/renato/profession.html) and should not have been downvoted.

walterbell · on July 17, 2015

Thank you for that story, akin to China Mieville's concept of "breach" in The City and The City.

walterbell · on July 16, 2015

Monty Python, "Four Yorkshiremen", https://youtube.com/watch?v=Xe1a1wHxTyo

analognoise · on July 16, 2015

If you owned a printer and you reduce the size of your printing by half (in each direction) you quadruple your efficiency - it is an area equation, not a linear ine. I'm surprised that slipped past Wired...

KeytarHero · on July 16, 2015

Well, they did mention that elsewhere:

> Shrinking gate lengths have meant 30 per cent more transistors per year for the same-sized fleck of silicon (transistors are laid out in a two-dimensional array, so gate-length scaling improves density in two dimensions).

I'm guessing they deliberately went with a not 100% accurate statement just to get their point across more elegantly - especially since the statement is still correct if you take "font size" to mean "font area" (even though that's not the typical understanding of font size).

Animats · on July 16, 2015

Putting huge numbers of ordinary CPUs on a chip only helps until the memory bus runs out of bandwidth. For GPU-type devices, the computation/memory request ratio is higher, so GPUs can have many more compute units. GPUs are now used for lots of other parallel computations, and there's room for expansion in pure-computation GPU-like devices that don't drive a display.

Historically, unusual massively parallel architectures have been commercial failures. "Build it and they will come" doesn't work there. There's a long history of weird supercomputer designs, starting with the ILLIAC IV and continuing through the Ncube. The only mass-market product in that space was the PS3's Cell, which was too hard to program and didn't have enough memory per CPU. (The Cell had only 256KB (not MB) per Cell CPU.)

The next big thing may be parts optimized for machine learning. That's massively parallel. We may, at last, see "AI chips".

da_chicken · on July 16, 2015

Die size isn't really about packing more cores into a CPU. it's about packing more CPUs on a wafer. The material cost is the largely the same. Smaller die sizes are about your quality control with masks and deposits.

Given a 300mm diameter wafer and a 10mm square CPU die at 14nm you get about 700 CPUs per wafer (Pi * diameter * diameter) / (4 * die * die) (https://en.wikipedia.org/wiki/Wafer_(electronics)#Analytical...). If you change that 10mm CPU to 10nm, you'll have about a 7.2mm square CPU die. That's nearly 1400 CPUs per wafer. Even if you get 100% of CPUs from the 14nm die, once your yield hits 50% on 10nm the 10nm process produces more for the same cost. Now you can either reap the profits, or reduce costs.

orlp · on July 16, 2015

I'm no expert, but isn't there another technique in play as well?

As you get smaller you can duplicate CPU components to make your chip fabrication more robust against errors. If a component is faulty on the die, the CPU can be patched to use the other (identical) component.

Animats · on July 16, 2015

The Cell did that. They really had 9 Cell processors on the chip, but only 8 of them were enabled on any given PS3.

Patching at a lower level has been tried, but it's usually more trouble in manufacturing than it's worth. There's a long history of workarounds for low yield, but the fab industry has usually been able to fix the fab problems and get the yield up.

Except for memory devices, where patching out bad columns is standard.

daemin · on July 17, 2015

You're off by 1 there, there were 8 SPU's on a Cell chip, 1 was disabled to account for manufacturing defects, and another was reserved for exclusive use of the O. Leaving the developer with 6 SPU's to use.

monocasa · on July 17, 2015

I think he's including the PPE in his count of 9 processors.

TheLoneWolfling · on July 17, 2015

Unfortunately, you end up having to make the interconnects longer to accommodate the additional redundant components, which ends up slowing everything down.

It's a tradeoff that's not always worth it.

monocasa · on July 16, 2015

This is done in quite a few places including GPUs.

bravo22 · on July 16, 2015

That only works for things like on-board SRAM

nickpsecurity · on July 16, 2015

I mainly agree with that sentiment. What do you think about the TOMI tech, though?

http://www.venraytechnology.com/Making%20a%20Frosted%20Cake%...

I think it was a brilliant reframing of the problem: move CPU's to memory with its cost-benefit analysis rather than vice versa (status quo). The design also achieves what some exascale proposals are trying to achieve with R&D in terms of better integrating CPU & memory with lower energy. It's also massively parallel (128 cores) and optimized for big data. Close to your next big thing.

It's main risk right now is that DRAM vendors are more conservative and mass market than most fabs. There's not all this MOSIS, multi-project runs, and so on. Their low-volume cost is currently high (tens of thousands). They might be facing a chicken-and-egg problem in terms of hitting enough volume to get a nice, production deal. I do like their tech and think it has far more potential than what they're doing right now.

minthd · on July 16, 2015

>> It's main risk right now is that DRAM vendors are more conservative and mass market than most fabs.

recently micron released a memory based processor/state-machine architecture called "automata". This might be a good sign that the problem you mention will be solved.

nickpsecurity · on July 17, 2015

That was a really neat processor. That's what happens when hardware vendors look at FSM problems their way instead of support software developer's way (eg C lang). The best bet might for those using memory fab tech to haggle one of the fabs to do MPW's on, say, one production line. Fab as a whole can keep cranking out tons of memory chips, new players can crank out theirs, and any risk is very limited.

The problem has already been solved outside memory fabs several times over. The memory fabs just need to take some steps, themselves. If I were them, I'd push I.P. vendors to follow path of Micron and Venray just to get more fab customers.

ak217 · on July 16, 2015

Except you still need a fast bus for the CPUs to talk to each other and to access shared memory. So for all but the most embarrassingly parallel workloads, you just move the bottleneck from the memory bus to the shared cache bus, do you not?

white-flame · on July 16, 2015

A memory bus has long delays to set up a transfer, is typically only 64 bits wide, and only achieves good bandwidth on large burst operations.

The Venray design allows single-cycle random access to full 4096 bit cache lines, at least as described in the earlier iterations. Contention is far less an issue in this model, for many cores on 1 large memory chip. Multi-chip sticks are then akin to multi-socket motherboards.

nickpsecurity · on July 17, 2015

Good answer.

vardump · on July 16, 2015

Intel CPUs resemble GPUs more and more over time. I think just scatter, GPU style ultra slow (high latency) but wide memory interface and texture lookup is missing in Skylake (Xeon).

Gather was already added in Haswell, although it performs badly so far.

Skylake (Xeon AVX-512) handles 16 float wide vectors (512 bits) and can dual issue per clock, bringing effective rate to 32. That's definitely comparable to modern GPUs.

Wasn't Nvidia WARP just 16 float wide per clock cycle? Or 32? For comparison, high end Nvidia 980 GTX GPU has only 16 of such SIMD execution cores. However, they count those 16 cores as 2048 in their marketing literature.

I do wonder if Intel is planning to unify CPU and GPU in 10 years or less. Things sure seem to be moving that way.

If Intel can add significant amounts of eDRAM in package, x86 CPUs aren't that far from being capable of handling GPU duties as well.

valarauca1 · on July 16, 2015

Vector Instructions != Scalar Instructions

"WARP Scheduler" gives you a hint.

Okay, so how this works is you have a processor that is 16 scalar cores wide. Each scalar core is really just an out of order scheduler, for 32 in-order pipelined, boring, ALU's. These ALU's can each execute the same instruction, together, giving you the illusion that the scalar core is doing vector processing.

The reality is far weirder. I.E.: If you encounter a branch, the scalar processor can, and will execute both branches on different ALU's, and execute the branch statement on another, allowing for a 10 instruction section of code to run in ~3 instructions time. Trying doing that with a vector processor.

Technically in CUDA you can schedule each ALU itself, thus marketing stuff.

Would you like to know more? http://haifux.org/lectures/267/Introduction-to-GPUs.pdf

vardump · on July 17, 2015

> The reality is far weirder. I.E.: If you encounter a branch, the scalar processor can, and will execute both branches on different ALU's

That's not so different than on x86 SSE/AVX. You'd execute both sides of the branch (dual issue) and blend / mask the results away you don't want. This is typically much faster than having a data dependant, unpredictable branch.

Another way is to SIMD sort data according to criteria to different registers and process them separately. This completely sidesteps having to execute both sides of the branch, although some computational resources are still wasted.

valarauca1 · on July 17, 2015

>That's not so different than on x86 SSE/AVX. You'd execute both sides of the branch (dual issue) and blend / mask the results away you don't want.

What your talking about is how x86_64 processors can optimize away some branches. Which it does this with the cmov instruction. This has nothing to do with SSE/AVX. Its common to confuse this b/c intel says the branches are executed in parallel (and they often are), just in parallel as the OoO pipeline allows, which is actually quite a few.

Both sides of the branch are pre-computed, then the branch is computed. But its output is sent to a cmov, which just re-assigns a register, instead of jmp into a branch. This avoids pipeline flushes. cmov isn't prefect still costing ~10 cycles, but compared to the ~100 of a pipeline flush its still cheaper.

Provided the same operations are being done on both branches then SSE/AVX can be used. As both branches are just values, and that is literally what vector processors are good at. The chain will end with a cmov.

vardump · on July 18, 2015

It has absolutely nothing to do with CMOV. I'm talking about computing, say, 16 results in parallel in a SIMD register, for both sides of "if"-statement. Then masking unwanted results out. SSE/AVX can simulate "CMOV", but for 128/256/512 bit wide vectors.

To make it even more clear, there's not a single CMOV in my code, anywhere. The data doesn't usually even touch general purpose (scalar) registers, because that'd totally destroy the performance.

What you are talking about is how things were done until 1997-1999 or so. SSE in 1999 and especially SSE2 in 2001 changed radically the way you compute with x86 CPUs.

I'm talking about things like vpcmpw [1] (compare 8/16/32 of 16 bit integers and store mask), vpcompressd (compress floats according to a mask, for example for SIMD "sorting" if and else inputs separately), vpblendmd (blending type combining, this example is for int32), vmovdqu16 (for just selectively moving according to mask).

You can do most operations on 8, 16, 32, 64 unsigned and signed, and of course 32-bit and 64-bit floats. Some restrictions apply especially to 8 and 16 bit operands. When appropriate, it's kind of cool to process 64 bytes in one instruction. :)

[1]: https://software.intel.com/sites/landingpage/IntrinsicsGuide... SSE/AVX instruction and intrinsics guide.

m_mueller · on July 17, 2015

GPUs have evolved with about the same pacing. Nvidia's Kepler architecture has a vector length of 192 (single prec.) per core and up to 15 of these cores on one chip.

The question really is, do you optimize the chip for heavily data parallel problems, saving overhead on schedulers and having a very wide memory bus, or do you optimize for single threaded performance of independent threads and give it some data parallelism (Xeon). As a programmer, when you're actually dealing with data parallel programs, doing so efficiently on a GPU is actually quite a bit easier since you have one less level of parallelism to deal with.

orbifold · on July 17, 2015

Um no 192 = 6 * 32 each streaming multiprocessor operates on warps of size 32, the 6 is the number of different functional units

m_mueller · on July 17, 2015

I think we're mixing up terminologies here. One SMX operates on up to 192 values in parallel (Nvidia calls this 192 "threads" per SMX). Functional units AFAIK is only used in terms of "special functional units" which isn't relevant for this discussion. One SMX has 6 Warp schedulers, but I'm not sure on how independant these can operate. My guess is that branch divergence will only NOP out one whole Warp, but I'm not sure whether the Warps can enter different routines or even kernels (my guess is yes for routines/no for kernels).

orbifold · on July 17, 2015

So the different functional units (this has a specific meaning in hardware design) are 32 wide and indeed if the instructions to be executed can utilize all 6 of them at the same time the smx will operate on 192 values but that wont be the case if you only need to executed a large number of double precision floating point operations.

ak217 · on July 16, 2015

What defines "unusual"? The CPU/GPU split is a distinction without a difference there. NVIDIA and ATI have both been selling massively parallel architectures in their GPUs for most of a decade now, and NVIDIA has great traction in the supercomputing and machine learning space due to its HPC business development, excellent tooling and developer support. Intel is trying to do the same with Xeon Phi, and they're certainly throwing their weight behind it.

Both Intel and NVIDIA are addressing the memory bus bottleneck with chip packages that stack memory chips and shorten/widen the bus. The two are converging to similar designs and I foresee a big fight as they go head to head (remarkable, given NVIDIA's size, but not unprecedented, given how badly the ARM crowd has smoked Intel in mobile).

throwaway2048 · on July 17, 2015

the issue is not so much memory BANDWIDTH as it is an issue of memory latency on heavily branching code.

pmalynin · on July 17, 2015

With CUDA you get 32 banks, all of which can be accessed in parallel.

dragonwriter · on July 16, 2015

> When first formulated by Intel co-founder Gordon Moore 50 years ago, this suggested that chip power could double every 12 months.

Actually, when first formulated by Moore, it was that transistor density doubled every 12 months (he later revised it to 2 years.)

Per Wikipedia, it was David House who applied it to performance (and, when he did, it was with an 18 month timeframe.)

tedsanders · on July 16, 2015

>Actually, when first formulated by Moore, it was that transistor density doubled every 12 months

No, it was the number of integrated components on a chips, not transistor density.

Anyway, I think the article's statement is fine. Doubling the number of components would suggest that performance doubles too.

brianwawok · on July 16, 2015

Does a car with 12 cylinders drive twice as fast as one with 6? Or accelerate?

Performance is too complicated to just say "CPU X doubles perf".. because likely some perf cases stayed the same, some got better, but some got worse (think deep pipelines).

jolan · on July 16, 2015

I hope Intel uses this pause to implement things like DJB's suggestions for the Intel instruction set:

http://blog.cr.yp.to/20140517-insns.html

pbsd · on July 16, 2015

They already have. The upcoming AVX-512 extensions [1] introduce the VPMADD52LUQ and VPMADD52HUQ instructions, which add the high or low 52 bits of a 52x52-bit multiplication to a 64-bit integer. Presumably this is done via the preexisting double precision floating-point multipliers.

[1] https://software.intel.com/sites/default/files/managed/0d/53...

jolan · on July 16, 2015

IIRC, upcoming to Xeon Skylake processors not mainstream ones.

rch · on July 16, 2015

Of course one can get a new 2 socket Xeon server* for less than the price of a MPB, so the mainstream distinction has shifted considerably over the last decade.

* (E5-2620V3 6C 2.4Ghz, 24GB RAM, 800GB SSD)

_ugfj · on July 16, 2015

Where are you getting a dual E5 V3 server for about 2000 dollars or less?

hga · on July 16, 2015

That does sound low, but isn't out of the question. From Newegg:

E5-2620V3 quantity 2: 428.49

SUPERMICRO MBD-X10DRL-I 319.99

Kingston KVR21R15S4/8 x 3 90.99 (although 3 sticks doesn't make any sense)

$1450 total.

Or 6 sticks of KVR21R15S8/4 direct from Kingston at $62 each is $372, for 1549, which still leaves some room for a SSD and chassis.

ADDED: cheapest Supermicro 1U for this would seem to be the SC514-505, which for SSDs would seem to be quite fine if you can find it, found one price that looks correct based on the next higher grade with fancy hot swap disk trays of $233. So for the 6x4GB memory config, $1782, leaving $218 for the SSD. And you'd have to buy a couple of fanless heat sinks.

Hmmm, Intel has an enterprise Intel DC S3510 at 599.99. So we're close if you buy in any quantity, then again 24 GB is unbalanced for a 2 socket server. But this looks in the ballpark, these systems are getting to rather nice prices.

vardump · on July 16, 2015

> although 3 sticks doesn't make any sense

Yeah, indeed it doesn't make much sense when you have 4 memory channels. Memory performance is going to be very bad. Four sticks would be just fine.

> Or 6 sticks

Neither does 6. Go for 4 or 8 sticks. Not sure if it applies here, but on some motherboards having 2x more than <number of memory channels> sticks means there's a bit more latency.

rch · on July 17, 2015

I configured the Colfax CX1205a-X6 to confirm -- the spec is suboptimal but compares favorably to the most expensive MPB (which I see as just over $3K).

notatoad · on July 16, 2015

When a company delays a product launch, it's usually because they have too much work to do, not not enough.

awalton · on July 16, 2015

Intel didn't delay a product launch, they delayed a PROCESS launch, which is a much bigger deal given their previous cadence towards the bottom. Physics is kicking the silicon industry's ass.

Meanwhile, the guys doing HDL and verification at Intel don't even notice the process delay, they just go about what they're doing, leaving them plenty of time to implement whatever (namely GPU improvements, which is what most of the industry has been requiring for newer Windows/Mac OS Xes).

RandomBK · on July 16, 2015

No, it means that some departments have too much work. Other teams and task forces can use this time to experiment with new stuff.

higherpurpose · on July 16, 2015

> When first formulated by Intel co-founder Gordon Moore 50 years ago, this suggested that chip power could double every 12 months.

What? I don't think it was ever 12 months, but 18-24 months.

> Now, improvements are expected approximately every 24 months.

No, they said it will take 2.5 years now instead of two.

http://seekingalpha.com/article/3329035-intels-intc-ceo-bria...

I wouldn't expect the 10nm chips at least until mid-2017, which should put Intel neck in neck with TSMC and Samsung's 10nm processes, which at most might be launched a few months later.

It also looks like IBM and its partners (Samsung/Global Foundries) will actually beat Intel to 7nm (probably early/mid 2019 for Samsung/GF - late 2019/early 2020 for Intel).

While IBM is talking about EUV lithography, Intel still seems to be complaining about multi-pattern lithography, so it seems pretty clear that Intel is behind IBM for 7nm. They also haven't said anything yet about what materials they'll use for 7nm transistors, just that they will require something different than Silicon.

The best part about this is that now Intel can't hide the x86 baggage in mobile behind its one generation ahead process node, which made Atom more or less competitive (at least in performance and power consumption, but not in price).

With ARM chips on very similar process nodes and arriving at the same time on the market, there should be no contest for the ARM chips in mobile.

qdog · on July 16, 2015

Not clear what IBM doing 7nm means, really. They no longer own a foundry, and don't have any chips in normal products that compete with Intel. IBM also demonstrated some really high clock speeds previously, but the Power8 is still a 4GHz part and is built on 22nm.

Now, if Global Foundry actually starts making full 14nm chips this year (which it doesn't sound like it, seems to be a 20nm back end), that might things pretty interesting.

I previously worked for a company that used both TSMC and UMC, the jump to the next process level is always extremely hard, and the biggest deal is yield. Until the yields are high enough to actually get to a profit a lot of companies with specialized chips will stick to the older process (set-top box chips for instance) where it's good enough.

FD: I'm at Intel (not hardware), cannot comment on Intel itself.

bitmapbrother · on July 17, 2015

Why does IBM need to own a foundry to design chips? Making chips is a commodity business. The skill is in the design and patenting.

sangnoir · on July 17, 2015

> Making chips is a commodity business. The skill is in the design and patenting

There's no point in designing (& patenting) a fancy chip that can't be manufactured in yields high enough to get a profit, is there? An approximate analogy would be wireframing a web application prototype (design) vs. egnineering the site so that it scales to millions of users (making chips).

More on-topic, IBM probably announced a 7 chip that can't be manafactured with acceptable yields yet. Hypothetically, Intel could design a 5nm chip with < 0.1% yields: but what would be point?

bitmapbrother · on July 23, 2015

Of course there is. Even if the chip is never released to the public, the IP and processes used to produce a 7nm chip have great patent value.

qdog · on July 18, 2015

Without owning the foundry, it's not clear if this is an IBM tech demo that goes nowhere, or if it effects GF's roadmap. GF uses a Samsung process for 14nm (really 14/20 hybrid), so it really looks like IBM is out of the foundry business as far as taking a process all the way to manufacturing.

GF might leapfrog down to 7nm, but if they aren't even fully 14nm at this point, does that really seem plausible?

KateLawson · on July 16, 2015

Gordon Moore first proposed 12 months in his article, then dropped back to 24 months a few years later. That number has held since the 80's, until now.

robin_reala · on July 16, 2015

Also it’s not ‘chip power’ but ‘number of transistors’. Common misunderstanding.

jfoutz · on July 16, 2015

Yeah, even without the step to 10nm, the third generation 14nm chips will probably be a bit faster than the second generation. Density can't be improved but there is surely efficiency to be realized.

nickpsecurity · on July 16, 2015

It doesn't really bother me as we've seen it coming for years. Anyone playing smart is not relying on a process node jump rather than better architected systems. I think the fact that they bought Altera and might do on-board FPGA logic more than balances out here. I'll take an FPGA-accelerator on their on-chip NOC over an incremental increase in performance/energy any day. Even big companies such as Microsoft are wisening up to the fact that a proper split between CPU and FPGA logic has significant advantages.

I'm curious to see what AMD will do in response to both the delays and Altera acquisition. Wait, a quick Google shows AMD to be so bad off that a FPGA company (Xilinx) might buy them. Lol. Ok, well the market is about to get interesting again one way or the other.

AnimalMuppet · on July 16, 2015

I don't even want to think about the antivirus implications of running FPGA code.

nickpsecurity · on July 17, 2015

It could get interesting. Of course, for secure CPU developers, FPGA's have been the solution for stopping malware rather than the problem. Too bad we only have prototypes to work with so far. Remember, though, that there's always anti-fuse (write-once) FPGA's available for a design you don't plan to change.

nickpsecurity · on July 17, 2015

Oh I forgot to give you this to help you sleep better at night:

http://www.cis.upenn.edu/~jms/papers/fpgavirus.pdf

monocasa · on July 17, 2015

Eh, as long as it's on the other side of a hardware based IO-MMU, I'm not super concerned.

nickpsecurity · on July 17, 2015

Those are overhyped so as to be buzzwords. All the IO-MMU really does is make the data come to and go from a certain point of memory. What it does when it hits the system software or applications is a whole, different risk area. See OS process separation vs all the ways it's bypassed with app & kernel-level vulnerabilities.

So, you consider that the FPGA might sabotage data you send through it or what comes from it. Be monitoring or doing validation on both.

bravo22 · on July 16, 2015

FPGA eats a ton of power. You won't see it anything consumer. It is already used in Server/Pro applications.

nickpsecurity · on July 17, 2015

There's plenty of low watt FPGA's, even mW's. The Cyclones, etc get 4,000+MIPS in under 2 watts. Microsemi already does FPGA's that run milliwatts with enough slices to support 5-6 accelerators concurrently. As the example below shows, you can have two Stratix's, a CPU, and a whole board's worth of stuff using only 45watts. Merely adding a few layers to an SOC at Intel's new nodes should use way less power than all that. I'm sure Altera's, with Intel's help, will be pretty sweet given they're mainly targeting datacenter use where performance-per-watt is key criteria.

Still, worst case, I'll gladly add 45watts to a gaming/workstation rig... which probably already has a several hundred watt power supply... in exchange for Stratix IV's 500,000+ LE's, 20Mbit SRAM, 20+ 8.4Gbps transceivers, and more direct CPU-to-FPGA link. You have no idea what I can do with that amount of logic and interface speed. A whole OpenSPARC T1 setup only takes 173,000 LE's. Most extensions to make highly secure processors take up a small fraction of that.

So many possibilities and so little time/money. :)

http://www.theplatform.net/2015/05/28/the-other-cray-launche...

techdragon · on July 17, 2015

I wonder how many people forget (or never learned) that unlike the X86, MIPS, ALPHA, and ARM architectures, SPARC is an open standard and anyone can make a 100% ISA compatible SPARC CPU...

SPARC deserves much more love from the open hardware community than I see it receiving.

nickpsecurity · on July 17, 2015

Upvote for you seeing as you're educated on the subject! :) I've been telling these secure CPU teams to use it for a while. A few have but most didn't. SAFE went with discontinued, but still I.P., Alpha. (rolls eyes) Shit, the Oracle SPARC T1 & T2 processors were first real CPU's to go open source! Here was my recommendation for open hardware crowd in another HN thread:

"The RISC-V activity is very interesting. I particularly love that they did a 1.4GHz core on 48nm SOI and are working on 28nm level. This knocks out some of the early, hard work of getting competitive hardware at an advanced process node. I'd like to see two things in this work: microprogrammed variant with an assembly or HLL to microcode compiler; tagged variant like SAFE/CHERI processors with IO/MMU that seemlessly adds & removes tags during DMA. That would be way better for security-critical applications than most of what's out there. Multi-core would help, too.

Meanwhile, Gaisler's SPARC cores are commercial, open-source, customizable, support up to 4 cores in Leon4, integrated with most necessary I.P., and can leverage the SPARC ecosystem. Anyone trying to do an open processor can get quite the head-start with that. A few academic and commercial works are already using it. Plus, the SPARC architecture is open as such that you only pay around $100 for the right to use its name.

So, Gaisler's SPARC cores with eASIC's Nextreme seems to be the best way to rapidly get something going. The long-term bet is RISC-V and they could do well copying Gaisler's easy customization strategy. Might be doing that already with their CPU generator, etc: I just read the Rocket paper so far. The solutions built with one can include a way to transition to the other over time."

bravo22 · on July 17, 2015

You are way, way off on your calculations.

The Cyclone IV "MIPS" you are talking about comes from the built-in ARM cores so you're comparing ASIC to ASIC here.

An FPGA vs ASIC, FPGA will lose every time. It is a simple matter of architecture. You have more crap to make the FPGA an FPGA, whereas in ASIC you have optimized gates. So it doesn't make sense for Intel to ship FPGA + CPU for consumer/mobile. Period. You burn more power and no one needs it. FPGAs are also more expensive because they take a ton of silicon space for a comparable function.

FPGA SoCs like Xilinx Zynq or Altera's offerings are widely used but for specific applications, none of it really consumer/mobile/desktop.

Stratix IV FPGAs that you are talking about are many thousands of dollars, each. You can get them today and hang them off your PCIe bus and do whatever you want with them. There is nothing stopping you. Putting it on the same die as a CPU won't buy you anything you don't get today -- except a lot of more heat you have to get rid of.

I don't think electrical engineering works the way you think it does.

For reference: I've designed ASICs and have done a lot of FPGA work in the past 10 years.

TrevorJ · on July 17, 2015

What are the advantages of an FPGA in say, my PC? Because it sounds sexy as hell.

nickpsecurity · on July 17, 2015

In lay terms, an FPGA is a chip that rewrites itself to be customized for a given use (or uses). Can rewrite themselves as many times as needed and very fast, too. Flexibility means more price, more watts, and less performance than truly custom chip. Yet, having custom circuits for the job can make an application SCREAM with performance. Here's some concepts for you.

1. Compression, encryption, etc that's many times faster (50x on some algorithms).

2. Do high-end streaming workloads (eg HD video, NIDS) on embedded hardware with hardly any watts.

3. Put specialized audio, AI, whatever engines in the FPGA for video games that take it to the next level with a whole CPU left for main game logic.

4. Implement a concurrent, hardware, garbage collector to write your whole OS in memory-safe language and not have freezes due to GC.

5. Use onboard I/O, often many lanes at Gbps, to get crazy throughput on any number of disk, networking, wireless, etc use-cases.

6. Use custom I/O for real-time applications.

7. Simulate other forms of hardware on the FPGA for personal learning or product development. Can deploy it in production on FPGA boards later.

8. Forget software emulators: build the hardware itself on the FPGA and have accurate simulation.

9. Cutting edge techniques do something similar with mockups of normal hardware which are modified to spot the exact time and place certain bugs happen. Then you can see everything from the input that caused it to the internal state of the processor. And fix it.

10. My use-case: processors modified to prevent code injection or data leaks to run applications that hackers can't hit. Altera's I'd use for prototyping then put them on anti-fuse FPGA's: write-once FPGA's that blow circuits intentionally to prevent attackers or glitches from modifying the system's logic. ROP that, bitches!

In short, you can use FPGA's for anything you can use a custom circuit for. They'll just be a bit weaker and usually depend on a host system to set them up. Common case is to have main CPU do most of the work with FPGA's accelerating it or handling interfaces (I/O) in a way they're better at. I'd say Google on FPGA hobby projects, use cases, "applications," etc to see the $3+ billion worth of uses for them. In case you're not drooling, here's a startup that's currently the top performer:

http://www.achronix.com/products/speedster22ihd.html

That joker has over a Gigahertz speed, 6 DDR3 lanes, up to 64 lanes of 12.75Gbps serial I/O, up to 16 of 28Gbps serial I/O, up to 400Gbps Ethernet, 400Gbps Interlaken (datacenter thing), and 2 PCI/Express controllers. Add 1.7 million LUTS for custom logic w/ 138 Mbits of registers or cache... noting that one of those mighty Oracle SPARC T1 processors only needs about 400,000 LUTS... to get a beast of a machine. Those are $10,000 unsurprisingly. Yet, even units that are several hundred dollars can do a hell of a lot and Altera on Intel's process node will do more.

I agree with another commenter that we'll see them in servers and datacenters first. The reason is that they'll have to charge more to recoup the initial cost of making them. Chips aint cheap: probably $5-10+ million per silicon test on Intel's node with mistakes requiring you to spend again. Tools to reduce that start at $1+ million a seat. Good thing FPGA's don't cost all that and have free/cheap synthesis tools. ;)

TrevorJ · on July 17, 2015

Sold. So what are the chances that we get PC architecture like this someday? Do you think it may sneak in the backdoor through graphics card manufacturers? Games seem like a huge application.

nickpsecurity · on July 17, 2015

Previously, they went through the PCI bus (FPGA cards), custom memory bus (SGI's Altix), or PC memory bus (Pico Mini). The statements I've been reading on Intel's acquisition of Altera indicate they might integrate them at the SOC level. That knocks out most of what latency exists over the memory buses. The resulting performance for apps split between the CPU and FPGA should be much higher if they do this.

" through graphics card manufacturers? "

They actually compete with graphics card manufacturers with different tradeoffs. Most likely, your system will have a graphics card and FPGA logic.

"So what are the chances that we get PC architecture like this someday?"

http://picocomputing.com/products/picocube/picomini/

I have no idea how much it costs. You could probably buy a powerful server cheaper given it has four, good FPGA's (unit prices always high). Yet, that's a Core i7, up to 32GB of RAM, and 6 FPGA's worth of custom logic connected to both. Mainly aimed at FPGA developers and the niche that use them for acceleration. I'm sure hobbyists with cash might enjoy it, too. :)

I'll end with an illustration. A company once made a dedicated physics chip (Physx) to dramatically improve game physics while CPU did other things. NVIDIA acquired it & added it to GPU. Latest demo (below) on ever-difficult water rendering shows what a custom chip can do for an element of gaming. Now, just list off in your head all the other things that make a game work and imagine what it might be like if they had custom circuits too. And each game had its own custom circuits. Probably like PS3 vs PSX in difference. Not sure if it will happen, but we can keep dreaming, right?

https://www.youtube.com/watch?v=JcgkAMr9r5o

ep103 · on July 16, 2015

Isn't IBM announcing they've discovered how to build 7nm chips? Things are still progressing

nickpsecurity · on July 16, 2015

They'll certainly progress. Far as IBM's announcement, they said they could do it in a lab but not mass manufacturing. Intel says they can't mass manufacture things yet. So, the two could be in a similar situation where they have lab-proven concepts but nothing further. Too little data for me to know at this point.

RockyMcNuts · on July 17, 2015

Kind of wondering what this means relative to competitors.

Intel is at 14nm now since 2014 and Samsung recently got to 14nm in Galaxy S6? And Samsung and TSMC have similar roadmaps for 10nm? So conceivably others might be catching up to Intel in fab? Say it ain't so? (I'm asking because I don't follow these things closely but that was the picture I got.)

If Intel doesn't maintain their big edge in fab I'm not sure they can charge the kind of premium they have been. 64-bit ARM is going to start looking mighty attractive for a lot of use cases.

amaranth · on July 17, 2015

Samsung's 14nm isn't as dense as Intel's because it's sort of a hybrid of a 20nm process and a 14nm one. TSMC's 16nm is apparently just their 20nm process but with FinFETs so will be the least dense of the three. I suspect their 10nm will also end up less dense than what Intel puts out but we'll have to wait and see on that.

As far as timetables for production, recent rumors put TSMC's 10nm plans at 2017 as well, delayed from their original Q4 2016 goal. Haven't see anything about delays on Samsung's side so they might actually get there first.

whazor · on July 16, 2015

I think the current techniques are becoming too complicated and Intel is now fully going to focus on EUV. The new EUV machines are reaching 70% availability on average, according to the producer ASML[1].

[1] http://www.asml.com/asml/show.do?lang=JA&ctx=5869&rid=52080

narrator · on July 16, 2015

If Moores law ends with 14nm, I wonder what we'll be doing 10 years from now.

noir_lord · on July 16, 2015

Using languages with low(er) complexity parallelism, optimising existing code as much as possible and (possibly) writing code for FPGA's (modern processors are pretty good at doing lots of things reasonably well but workload tuned FPGA's kill them and 10 years of FPGA research after a boatload of cash could be interesting).

Also if we are bottoming out on Silicon then investment into other technologies should pick up, Graphene and the like.

It should be exciting, Intel has had a lock on the desktop processor market for over 30 years (by volume if not technical excellence at some points).

craigjb · on July 16, 2015

There is also a lot of space between general purpose CPU and completely blank-slate FPGAs. GPUs are essential very wide data word processors (one instruction but a large data vector).

And, there are also configurable pipeline processors that consist of multiple ALUs (can be vector data) with reconfigurable connections between them. So rather than eat the overhead of generic bit LUTs in an FPGA, you reconfigure the interconnect in a fabric of ALUs. The fabric can contain specialized ALUs (or execution units), and varying densities depending on typical usage. This avoids a the technology mapping and place & route of FPGA design. Translating a description of hardware (HDL) into actual lookup table data is a massive compute problem for large modern FPGAs. However, if we collapse the routing to just data buses between compute units, the problem can be solved in real-time. This way, a compute pipeline could be reorganized by an application at run-time without using a full-fledged hardware description language with all kinds of very low-level constructs. Higher-level language compilers already exist for this kind of architecture in academia.

EDIT: The sea of ALUs can also contain memories, FIFOs, and other block elements. However, they would all operate on the same word size to reduce the routing problem and allow maximum density implementation.

In fact, this is the essence of how micro-coded instructions in a modern CPU work anyway--instruction 'scheduling' is basically figuring out how to route data transfers in a sea of execution units.

The article "Fundamental Underpinnings of Reconfigurable Computing Architectures" in the March 2015 Proceedings of the IEEE contains a wonderful introduction to all these concepts.

white-flame · on July 16, 2015

A sea of ALUs with an instruction set to directly route them together, is effectively describing the Mill CPU.

http://millcomputing.com/

fdej · on July 16, 2015

I guess we'll be removing cruft from the web.

astrodust · on July 16, 2015

Porting to HTML6.

roghummal · on July 17, 2015

HTML5 was an improvement. HTML6?

mrb · on July 16, 2015

It is very obvious what we will be doing, and this has been discussed many times: add more metal layers.

Intel processors only have a dozen layers which stack up to less than ~1 um. The numbers of transistors/layers could in theory be increased by 1000x if the layers were stacked ~1 mm high. How to do this with lithographic processes is an open question, but it is absolutely doable in theory. No physical limits prevent us from doing that.

nhaehnle · on July 16, 2015

> How to do this with lithographic processes is an open question

You don't, for economic reasons: the latency of chip manufacturing, i.e. time from initial wafer to finished, is already on the order of weeks. Producing more layers lithographically is going to multiply that latency. Not to mention that it would be a nightmare in terms of yield/manufacturing defects.

What is doable (and already done, I believe) is producing multiple chips in parallel and then stacking them on top of each other. Don't think cores spread across multiple layers; think alternating layers of cores and caches, or a layer of cores with layers of memory stacked on top. (This approach doesn't have the yield problem because you can test the chips before you stack them together. It's a technology that is going to see a lot of improvement still.)

Heat transfer is still a problem, though, simply because the number of transistors scales with the volume, i.e. cubically in the "radius", while the surface area available for heat transfer only scales quadratically.

P.S.: When chip people talk about "metal layers", they mean the layers of wiring (also called the BEOL, back-end-of-line). So increasing the number of metal layers does not actually increase the number of transistors. Also, when chip companies talk about "using N layers" in their current technology, that does not mean that N transistors are stacked on top of each other. It means that there is one layer of transistors, and N layers for connecting wires above.

TheLoneWolfling · on July 17, 2015

I wonder when we'll start seeing CPUs with integrated heat pumps. Stacking a Peltier junction as a layer could start to make sense. Or even having (non-conductive) fluidic cooling. (Although you have to be careful designing the chip or else capacitive effects can cause problems)

Also, we currently have a heat ceiling on the order of 10s of watts. An integrated liquid-cooled heat sink could drastically up that, at least for server-like applications.

tedsanders · on July 16, 2015

This is not obvious at all. Going 3D makes sense for memory, because these devices aren't heat-limited, but for CPUs it's not at all clear that stacking layers would be a win. CPUs are limited by heat, and heat transport away from the CPU is proportional to area, not volume.

Also, CPU costs are driven mostly by lithography, and if you're doubling the number of high resolution lithography steps, you aren't saving much money compared to just making larger area CPUs.

knd775 · on July 16, 2015

Heat would be a massive issue with more layers, right?

chucky_z · on July 16, 2015

From a purely seat-of-my-pants-im-trying-to-remember-college point of view, I believe the heat is caused far more by pumping tons of electricity, rather than physical size/layout.

A big problem (and also probably why they're so thin) is that at a certain size latency becomes a huge issue. This is where clock speed increases come into play (electrons move faster) along with die decreases (electrons can travel shorter distances).

If you can run 100mph, but you have to run 200 miles vs. someone who can walk 1mph, but only has to walk 10 feet... I wish I had some more concrete examples but I can't come find anything off the top of my head.

t0mbstone · on July 16, 2015

Heat is caused by resistance when the electricity moves through the metal. The more metal the electricity is flowing through, the more material there is incur resistance and generate heat, correct? If so, then more layers = more heat.

ars · on July 16, 2015

> Heat is caused by resistance when the electricity moves through the metal.

Sort of. But if you reduce resistance you increase heat since more current will flow.

On the other hand, if you increase resistance AND also keep the current the same then you'll get more heat. But! to keep that current constant you must increase the voltage.

So it's not so simple as "resistance = heat".

> The more metal the electricity is flowing through, the more material there is incur resistance

Depends on if the metal is in parallel or series. If in parallel, then the more metal the lower the resistance, if in series then higher resistance.

> If so, then more layers = more heat.

Right result, wrong method of getting there. More layers (which would be in parallel) would be less resistance. But less resistance means more current flow (since they'll keep the voltage the same), and more current flow means more heat.

spyder · on July 16, 2015

There could be a heat-conducting layer too (but in practice it's probably not that easy).

AnimalMuppet · on July 16, 2015

There's also the little matter of defect density. Let's say your 10-layer process yields 50%. Now you go to a 20-layer process, and you should expect your yield to drop to 25%.

(I know, it's not that simple, because there are some defects that are in the underlying substrate. I'm ignoring those for purposes of this discussion.)

sliverstorm · on July 16, 2015

What? How does adding metal layers allow you to increase the number of transistors? Transistor density is a fixed function of transistor size, and die size is a variable function ruled mostly by defect density. Unless you are talking about die stacking?

paulmd · on July 16, 2015

He's referring to 3D density. Die stacking is one way to do that, and it's the only one that isn't just wild fantasy.

Another would be for the die to have multiple layers within a single die - which can't be done with current technology, but isn't physically impossible. Basically what you'd need to do is similar to PCB fabrication - you need to create "blind" and "buried" features within a series of stacked layers.

Obviously that capability doesn't exist right now, but you could look at using an ion beam to create a planar mask that targets only a single layer within a multilayer wafer. The particles travel through material without depositing energy until they reach a critical threshold [1], after which they rapidly deposit nearly all of their energy. This allows you to give the beam a specific "depth", in other words it can be targeted in 3D. Perhaps - because it's a particle beam, not an EM wave - it might also be less susceptible to some of the problems caused by the wave nature of light, even in a more conventional application.

I know it from proton radiotherapy (a good general rundown of the technology here [2]), and I wonder if it couldn't also be used for something like this. A lot of problems would have to be solved. You'd be talking about a substantially different process from standard wafer production, you'd have to tighten up the Z-resolution, etc. No idea if it'd work or not in the end, but I think it'd be interesting to look.

Or alternately maybe we figure out some additive/subtractive method that lets us create multiple layers on a single die another way. For the short terms it's going to be die stacking though.

[1] https://en.wikipedia.org/wiki/Bragg_peak

[2] http://www.aapm.org/meetings/05AM/pdf/18-4016-65735-22.pdf

TheLoneWolfling · on July 17, 2015

Unfortunately, particles are also waves. Look at de Broglie waves.

Now, that being said, one real advantage is that the effective wavelength tends to be much smaller. For example, an electron at 0.9c has an effective wavelength of 1.2pm (picometers).

JohnBooty · on July 16, 2015

As others have noted, workload-specific FPGAs are very promising.

Another big bottleneck is memory latency. For each cache miss, a modern CPU spends hundreds or even thousands of clock cycles doing nothing. A majority of the transistors in your CPU are dedicated to mitigating this in one way or another - the L1/L2/L3 caches, transistors dedicated to branch prediction, etc.

I don't know what the outlook for improving the memory latency situation is. It's probably going to involve gobs of on-chip embedded RAM, which is expensive.

roghummal · on July 17, 2015

What good is an FPGA if it isn't 'workload-specific'? That's what they do. That workload might be a processor, it might be an interface, ... you get it.

Workload-specific FPGAs are here, now. They're not 'promising' because they're doing exactly what they were made to do!

>I don't know what the outlook for improving the memory latency situation is. It's probably going to involve gobs of on-chip embedded RAM, which is expensive.

They already have a name for that and you already named it.

It's called cache ;)

JohnBooty · on July 17, 2015

    Workload-specific FPGAs are here, now. They're not 
    'promising' because they're doing exactly what they 
    were made to do!

Right. I was thinking in terms of them being more integrated into hardware and software toolchains. We already have software runtimes (various Javascript VMs, .NET, the JVM and its offshoots, etc) that can optimize code at runtime based on the hardware that's present, hotspots in the code, etc.

Now imagine if FPGAs were integrated into your typical laptop/desktop logic board. Now imagine those software runtimes we talked about above could optimize your hardware at runtime as well. Or maybe it wouldn't be transparent; maybe FPGA hardware would be targetable like GPU hardware is today. Anyway, there's a lot of things that could happen there...

    They already have a name for that and you already 
    named it. It's called cache ;)

I should have said "embedded DRAM" instead of "embedded RAM."

eDRAM is different from your typical CPU cache. Your CPU cache is almost always SRAM. But SRAM takes up 3x as much die space as an equivalent amount of eDRAM. Of course... some processors like IBM's POWER chips use eDRAM as L3 cache so the lines are blurred a little.

A lot of game consoles use embedded DRAM for stuff like graphics processing. On the XBox One I believe it's used as a framebuffer.

roghummal · on July 17, 2015

>We already have software runtimes (various Javascript VMs, .NET, the JVM and its offshoots, etc) that can optimize code at runtime based on the hardware that's present, hotspots in the code, etc.

Are you a hardware guy? :)

Thanks for elaborating. I had no idea how you went from "memory latency is a problem, that's why we have L_n caches" to "memory latency is a problem, we'll probably solve it ... (with cache)." This helps.

The biggest problem I see with integrating FPGAs into designs (HW) is educational. An on-die accessory FPGA doesn't amount to much if it doesn't get used. It's coming though. One way or another we're going to see more flexible HW. Taking advantage of it won't be something we're used to.

JohnBooty · on July 17, 2015

    An on-die accessory FPGA doesn't amount to much if 
    it doesn't get used

I know! It's the usual chicken-and-the-egg hardware+software problem, right?

Quequau · on July 16, 2015

There was an excellent presentation at HotChips 2013, given by Robert Colwell from DARPA on that topic.

solomatov · on July 16, 2015

It's available on youtube https://www.youtube.com/watch?v=JpgV6rCn5-g

ghaff · on July 16, 2015

That's an interesting presentation in that it outlines a lot of the directions--including rather novel ones--that processor technology could potentially be taken. He also emphasized more system-level thinking.

At the same time, he also emphasizes just what an amazing technology CMOS has been and how that's made it such an important part of the computer revolution of the last 30 years. So, no, it's not like progress will stop but without the tailwind that CMOS scaling supplied, I expect the rate of advance in a lot of areas will slow down. (As, indeed, we're seeing on desktops and laptops today even if there are also other dynamics in play.)

minthd · on July 16, 2015

Maybe(in 10-20 years) We'll build analog imitations of the brain , with similar density and high power efficiency.

Or maybe we'll build other types of compute machines using analog electronics.

Maybe we'll replace digital multiplies with pseudo digital multipliers that use analog computation the background, and with other forms of non-exact computation.

Or We'll use photonics to analogically compute FFT's to multiply matrixes.

Or maybe we'll finally seriously and successfully invest in optical digital computing.

Or we'll crack quantum computing.

pjc50 · on July 16, 2015

Analog and "non-exact" computation is not something that sounds very useful. The only reason we've been able to minimise so much is the tremendous noise rejection of binary systems.

Besides, modern computers are spending most of their time on actions of marginal or negative utility, mostly data-bureaucratic rather than computational.

minthd · on July 16, 2015

Non-exact computation is useful for some stuff: machine vision, deep-learning, and probably other types of machines learning.

Those have a wide variety of uses and they demand lots of compute.

Also in places where performance is critical and money is spent on optimization, isn't lots of this bureaucracy removed ?

kardos · on July 17, 2015

> Analog and "non-exact" computation is not something that sounds very useful.

Floating point operations are non exact in general. Physics simulations work at various degrees of approximation. Non exact computations would be useful as long as we can quantify the "non-exactness".

roghummal · on July 17, 2015

>Maybe(in 10-20 years)

I've seen 20 years of progress. Yeah... not to discourage you; Keep on shooting for the stars.

the8472 · on July 16, 2015

Moore's law ending just means progress will slow, not cease.

But in addition to that there are other things we could optimize. For example a lot of transistor budget is spent on caches and very clever out of order execution. If memory bandwidth could be increased to the point where those deep out of order pipelines aren't needed anymore to mask memory latency then maybe we could utilize the chip area better.

SRAM used for caches also eats a tremendous amount of space and power due to its 6T, always-powered design.

Imagine MRAM (1T, non-volatile), manufactured in the same process node as the CPU itself, stacked right onto the CPU and/or connected with light-based transports. Since MRAM doesn't need to be powered when it's idle it would also allow for rapid power-switching, thus giving the chip designers more thermal headroom.

Another thing: Remember how chips used to shrink and increase in clock speed? Well, the gigahertz race has been over for a while. But if they use the longer process-nodes to introduce new technologies (different semiconductors, carbon nanotube conductors, ...) then there might be new room to push a few extra GHz.

Progress will certainly slow and at some point it'll require radical re-designs, but there's still a lot of room for improvements.

jjoonathan · on July 17, 2015

> SRAM used for caches also eats a tremendous amount of space and power due to its 6T, always-powered design.

What gets in the way of using SRAM for main memory? I've always been told that it's the space multiplier (6T vs 1T+1C) but that never seemed terribly convincing because DRAM is cheap enough (and has been cheap enough for a while) that it would make sense to pay a 3x-5x premium to avoid page open/close latency even in the high-end consumer market. Heck, many consumer devices are already 2-4x overprovisioned with DRAM just in case someone doesn't like closing chrome tabs (or whatever).

By process of elimination I tend to suspect it's the power draw, but it seems odd that power from leakage current on inactive SRAM gates would dramatically exceed that of DRAM's active refreshes.

What gives?

bsder · on July 17, 2015

Simply put: nobody cares for big microprocessors.

There is so little performance gain from switching from DRAM to SRAM that it just doesn't really matter. Caches are very good at hiding accesses to memory, and, if you need to hit a flash drive/hard disk, you are hosed anyway.

Where people care a lot is in embedded microprocessors. SRAM is very scarce, and it's very definitely about area and power as DRAM basically isn't used.

The problem is that you can put almost 10x more flash in the same area as SRAM (and I'm being generous--it's probably more like 20x now) and memory (flash and SRAM) is more than 50% of the die area on a modern microcontroller.

Basically, every modern microcontroller ships with the maximum amount of SRAM that it is economically feasible to make.

In addition, for "modern" technology nodes, SRAM cost is going up. Moore's Law broke back at the 28nm node.

http://electroiq.com/blog/2014/02/the-most-expensive-sram-in...

SEJeff · on July 16, 2015

Writing CRUD apps, the same thing that we've always been doing overall as an industry :)

Keyframe · on July 16, 2015

Add more chips / cores!

logicallee · on July 16, 2015

look back and laugh at how long we were able to just shrink transistors within the 1 plane, totally ignoring all depth? Remember, moore's law is about transistor count per square inch. I'm kind of shocked we don't use 3D chips, though I understand heat is an issue.

pjc50 · on July 16, 2015

Manufacturing process is the main issue. Too many layers and it's impossible to keep flat enough to build the next layer on. Chip-on-chip is however becoming widespread, keeping your DRAM nice and close.

Retric · on July 16, 2015

Chips are 3d, Intel for example uses 13 layers. There are massive heat and cost issues when adding layers, but flatter chips have latency issues. Stacking chips is also common in cellphones and other low power designs.

on July 16, 2015

[deleted]

nhaehnle · on July 16, 2015

The 13 layers Retric was referring to are not layers of transistors, but layers of wiring. All chips currently manufactured only have a single layer of transistors. Above this single layer of transistors are many layers for wires which are alternatingly used for wires mostly in X- and mostly in Y-direction.

logicallee · on July 16, 2015

That's what I had thought (and reflected in my brief comment - "transistors within the 1 plane") - it seemed they corrected me, to my surprise. Thanks.

kllrnohj · on July 16, 2015

AMD's new desktop GPUs use 4 tall stacked memory modules. Stacked chips are definitely a thing. When this makes its way to compute modules remains to be seen, though.

roghummal · on July 17, 2015

Intel's 7nm? Anyone?

MCRed · on July 16, 2015

I saw another article that referred to this as a sign of the slowing of Moore's law. However, Moore's law has its variances so I'm not really worried about that.

What I do wonder about is why we are not seeing more cores as we go down to smaller nodes. 2 and 4 cores are common now, but have been seemingly for a decade. Why aren't we seeing 8, 16, 32 and 64 cores? If you reduce the feature size and effectively thus double the surface area available, why not increase the cores? It seems the new area is going into integrated GPUs (never as good as discrete GPUs, but certainly cheaper) and cache. Cache can improve performance but not like more cores. While intel does a tick/tock design revision, each core seems to grow to fit the available space, with only adding a few instructions, which only add marginal utility. Not the same as doubling the number of cores, which would double theoretical performance.

Under Erlang/Elixir, we are able to get close to linear speedup from additional cores. I understand other languages struggle with multi-threaded programming, but should CPU designs be hobbled simply because the software industry is not where it should be?

Am I missing something?

dragonwriter · on July 16, 2015

> Why aren't we seeing 8, 16, 32 and 64 cores?

Intel has 8-core i7s, up to 18-core Xeon processors, and up to 61 cores in Xeon Phi coprocessor packages.

> If you reduce the feature size and effectively thus double the surface area available, why not increase the cores?

Because, outside of narrow domains, software that can effectively use lots of cores doesn't exist, so you don't provide good bang for the customers buck by doing that. For most customers, there's no utility for that.

> Cache can improve performance but not like more cores.

And vice versa -- more cores can improve performance, but not like cache. Which is more useful depends on software and workloads.

> Under Erlang/Elixir, we are able to get close to linear speedup from additional cores.

On the right kind of workloads. And, the people with those kind of workloads, have many-core processors available.

jotm · on July 17, 2015

More cores is always a good choice for consumer chips - even if one program doesn't effectively use them, running multiple applications in parallel sure as hell does.

And these days, we're running everything under the virtual sun and expect it to work fast and smooth :-)

cft · on July 16, 2015

Any web server that serves at scale can readily use 1024 cores... Nginx, servers built in Go, Erlang and many more. It would reduce hosting costs and complexity substantially.

dragonwriter · on July 16, 2015

Lets say, for the sake of argument, that's true; what percentage of computers are "web servers that serve at scale". What's the cost of setting up production lines for 1024-core chips for just that market -- and how many chips are you going to be able to amortize that cost over.

Even if it was possible, I think the costs would be exhorbitant, even if the marginal cost per chip wasn't.

JoachimSchipper · on July 16, 2015

It's not as easy as you seem to think to e.g. (de-)multiplex a single TCP port to/from 1024 cores, and that's before you actually try to keep any application state in sync. Lots of stuff breaks at that level of concurrency.

socceroos · on July 17, 2015

First part makes sense, but the last part doesn't. Keeping application state in sync doesn't need to affect the throughput of events coming out of a multi-threaded application.

kllrnohj · on July 16, 2015

Those servers are also typically not CPU bottlenecked and while they would have 1024 threads (or even vastly more) they don't really benefit from 1024 cores.

MCRed · on July 16, 2015

Further, we're in an industry of VPSes where cores are virtualized and divided up. a 128 core CPU could produce 128 single core VPSes with probably better performance.

Of course this is consistent with your point, but your comment is negative, as is my parent comment.

Substantive discussion is not appreciated on hacker news.

habitue · on July 16, 2015

Two things I can think of: for consumers, applications don't currently make the best use of multiple cores. There just isn't enough work to do on them. For gaming at least, DirectX 12 will remove a limitation that only one CPU core can control the GPU at a time. That may push demand for more cores upwards over time.

Second is that cache coherence gets really ugly the more cores you have with a single shared memory. From what I understand, this is a limitation of assumptions in the x86 architecture. A completely new architecture might sidestep the problem, and indeed I've heard of 1024 core machines etc, which I doubt have just brute-forced cache coherence.

socceroos · on July 17, 2015

DX12 - catching up to Mantle and Vulcan but ahead in marketing.

detaro · on July 16, 2015

for desktop systems, small number of fast cores is still better than multiple slower ones, since a lot of the software that would profit most isn't very well optimized for making use of many cores. IMHO there is not that much demand for better performance in bread-and-butter CPUs anyways, things like embedded graphics, power saving etc matter more.

The top desktop CPUs have 6 or 8 cores, and Intel Xeons go up to 18.

And there are the Intel Xeon-D and Atom Cxxxx, which go in the direction of having (at the moment up to 8) smaller cores and low energy usage, for tightly packed servers and networking devices. I guess once they get more established they might get higher-power counterparts with more cores.

And the Xeon Phi might get there, if Intel manages to do what they promise for future generations.

> each core seems to grow to fit the available space, with only adding a few instructions, which only add marginal utility. Not the same as doubling the number of cores, which would double theoretical performance.

The progress between generations isn't big enough to allow doubling the core count in the same envelope with the same per-core performance. So they only make the cores a bit better & slowly add a few cores (mostly to the high end).

high-end phones actually have quite a few cores in the meantime, but there they are staggered for different usage scenarios: a few slow, efficient cores combined with faster and more energy-hungry ones for high-performance tasks. But for most x86 systems that would "waste" to much of the chip I guess.

> but should CPU designs be hobbled simply because the software industry is not where it should be?

Designing CPUs for "the perfect software" is nice, but not what sells to the vast majority of the market. As listed above, there are quite a few niche lines that go in that direction, even if you just look at only Intels offerings.

TheOtherHobbes · on July 16, 2015

There's no demand for better performance given current architectures and applications.

It's not hard to imagine more interesting architectures and applications - for example associative language processing, supercomputer-powered robotics, games and UIs that have true cinematic 3D photorealism, or gestural 3D interfaces with very large retina displays.

The problem is really the pace of OS and UI design. We're still using the Xerox Parc GUI model forty years after it was invented, the POSIX OS model some fifty years after it was invented, and the sequential Von Neumann architecture sixty years after it was invented.

These were all breakthroughs at the time, but they should never have become the last word. Research of equivalent creativity seems to have stalled now. IMO it's not just because the low-hanging fruit has been picked, but because there are commercial, academic, and political pressures holding it back.

There may be a game changer generation ten or twenty years from now. But it looks like the pace will be slow until then.

hmottestad · on July 16, 2015

18 cores seems to be the best that Intel does. Which is fairly decent. http://ark.intel.com/products/84683/Intel-Xeon-Processor-E7-...

In laptops, there has been a trend for thinner, lighter and long battery life. So the easiest is to lower power demand. If you increase the number of transistors (by adding more cores), power consumption wouldn't go down enough.

robin_reala · on July 16, 2015

Xeon Phi currently has 61 cores[1] (62 on-die, 1 deactivated for yield reasons), and the next generation coming soon pushes that to 72 cores.

[1] http://www.intel.com/content/www/us/en/processors/xeon/xeon-...

yareally · on July 16, 2015

We do have 8, 12 and 18 core CPUs made by Intel. They're just expensive and not made for the typical consumer. My PC at work has the 12 core Xeon. It's nice for compiling large projects, but at home I'm content with a 4 core i7.

http://ark.intel.com/m/products/75283/Intel-Xeon-Processor-E...

http://ark.intel.com/m/products/81061/Intel-Xeon-Processor-E...

HappyTypist · on July 16, 2015

Smaller die areas have complications of its own -- heat has less space to travel. It's not as simple as "half the die area, double the cores" -- if you do that, you will end up overheating.

But I think the main issue, as you mentioned, is multi-thread programming. Everyday users spend most of their time in a browser, running Javascript which is mostly synchronous. There are web workers, etc, but the vast majority of sites are still running single threaded JS.

jsprogrammer · on July 16, 2015

People may want to run multiple simultaneous threads though, which is what additional cores excel at.

pixl97 · on July 16, 2015

No, you may want to run multiple threads, most people want to use their computer. Most people are not limited by their CPU (or even memory these days), the limits are IOPs which an SSD can help with or the speed of their internet connection. Most computers can run 2 or 4 threads at the same time already. Just look at your average Windows computer, even when it's busy, it's probably because 1 core is in use and the program doesn't multithread in a decent manner because it is very hard to program that way.

jsprogrammer · on July 16, 2015

>No, you may want to run multiple threads, most people want to use their computer.

Whether most people know how their computer works or not, many want to be able to simultaneously: listen to music and/or watch a streaming video, browse multiple web pages, use text editors, specialized business applications, IDE's, and more.

You are going to have a bad time trying to do all of that on a single core, much less a single thread.

adwn · on July 17, 2015

> You are going to have a bad time trying to do all of that on a single core, much less a single thread.

Even desktop CPUs have had the ability to run several threads concurrently on a single core for decades.

jotm · on July 17, 2015

And having multiple cores just improves that ability. There's an upper limit to efficiency, but don't tell consumers that while you figure out your quantum processor :-)

jsprogrammer · on July 17, 2015

Nowhere will you find me arguing that. In fact, this case was already captured in my comment. What you won't find is a single core desktop CPU that has the ability to run several threads concurrently well. The performance will necessarily decrease drastically as you add more concurrent threads.

There is a reason dual core has been standard for awhile. The performance of a single core on a multi-tasking machine is deplorable compared to a machine with additional cores.

pixl97 · on July 17, 2015

Almost any modern desktop you buy (heck even phone) has around 4 cores, it's been this way for years now. After that point the average user doesn't see any particular gain in adding more cores. An 8 core machine really only helps power users. Most users are not running hundreds of threads, look at most your applications, they may have any number of threads running, but only one or two are ever loaded with data. This is the serial world we currently live in.

jsprogrammer · on July 17, 2015

I'd guess 2 cores is still the majority of machines being sold, though 4 cores is probably a close second by now. 4 cores was certainly not standard on the majority of devices in the last few years.

MCRed · on July 16, 2015

I have something useful to say to this but fuck it. I get nothing but down votes here, so why bother investing time in "Hacker" news?

You shouldn't be downvoted like you are, this site is full of assholes apparently. Have an upvote.

jsprogrammer · on July 17, 2015

I think most people just read what they want to read, regardless of what you actually wrote.

adwn · on July 17, 2015

> Under Erlang/Elixir, we are able to get close to linear speedup from additional cores.

The two main obstacles with parallelization are

1) dividing a problem into a large number of sufficiently parallel tasks, and

2) communication costs.

Erlang isn't a magical bullet that solves these problems. An inherently sequential algorithm doesn't become parallel only because you can cheaply spawn thousands of green threads, and inter-core communication doesn't become faster just by having a convenient message passing mechanism.

Regarding the downvotes: My guess is that you're not being downvoted because of your dissenting opinion, but because you're presenting a misinformed opinion as fact.

marcosdumay · on July 16, 2015

You are missing that yes, CPU designs are hobbled by the software industry shortcomings.

Also, you are overlooking that on the traditional architectures memory access becomes slower the more cores you add.

Better integrated GPU and better power consumption on PCs will win on almost all scores until the integrated GPU becomes good enough, and power consumption gets near the theoretical minimum. Then we'll probably see alternative architectures (more alternative than a GPU).

pjc50 · on July 16, 2015

Clearly we should boot the operating system straight onto the GPU.

(Joking, but also a great project if you have an army of graduate students to keep busy)

monocasa · on July 17, 2015

The PowerVR guys appear to be running a full RTOS w/ memory protection on their GPU FWIW.

sfilipov · on July 16, 2015

I think you found the answer already. Already there are server processors with a lot of cores (i.e. 20) - if you can utilise them all then buy one.

However, for most client workloads, more cores does not translate to better performance because many of those cores will just idle. Single-threaded performance is still the king.

jsprogrammer · on July 16, 2015

We do see 8 and 16 cores.

Newegg currently lists 10 different 8-core desktop processors and 120 different 8-core server processors. There are also 10-, 12-, 14-, and 16-core processors available to purchase right now.

harshreality · on July 16, 2015

Chris Mack has been vocal on this. And not only was he probably right about Moore's Law (defined as scaling that decreases the cost per transistor) being dead last year[1] (the cost per transistor curve has been flattening out), but now Intel can't even keep pace with process shrinks at any price the market will bear.

That doesn't mean progress will stop, however[2], but rather that progress (and a revised definition of Moore's Law) will be defined by chip redesigns and particularly specialized functionality that doesn't take many gates to implement but makes chips more valuable.

[1] https://www.youtube.com/watch?v=IBrEx-FINEI#t=1m13s

[2] http://spectrum.ieee.org/semiconductors/processors/the-multi...

aburan28 · on July 16, 2015

I just read about how they were delaying the AVX512 until 'Knights Landing' (Xeon Phi) and with Haswells mediocre improvements Intel has really not come up with anything substantial in years. Hopefully the recent Altera acquisition will help them innovate

chinhodado · on July 16, 2015

> with Haswells mediocre improvements Intel has really not come up with anything substantial in years

Not really. This statement is true regarding desktop CPUs, but we all know for at least five years now Intel's focus is not desktop CPU performance anymore, but mobile. Hence most of the improvements we see are for mobile, like better embedded graphic (Sandy/Ivy Bridge forward) and improved power consumption (Haswell).

wtallis · on July 16, 2015

Intel really shouldn't get credit for the Sandy Bridge IGP, because that's just where they caught up to the NVidia IGPs that they'd banned. The first generation of Core series mobile chips represented a major step backwards from the user's perspective because Intel would no longer let NVidia make chipsets, so Intel's IGPs no longer had to compete against the GeForce 9400M and 320M. Ivy Bridge was the first time Intel actually moved the market forwards in terms of mobile IGP performance.

vardump · on July 16, 2015

> with Haswells mediocre improvements Intel has really not come up with anything substantial in years

Unlocking Haswell's performance (up to 2x integer/FPU) requires using AVX2 instructions. Which means at least recompiling and to truly extract the performance, optimizing to AVX2 intrinsics or assembler.

darksaints · on July 16, 2015

But will it have non-disabled HTM?

EDIT: Its a real question. Working HTM would negate pretty much any bad news from Intel in my mind.

oldmanjay · on July 16, 2015

Did the BBC forget a comma in the title? I'm having a hard time parsing it.

Edit: or perhaps a dash? Really it's just a horrible title all around.

Cyph0n · on July 16, 2015

Yes, a dash; “chip-making changes” makes much more sense. Horrible title indeed.

mikeash · on July 16, 2015

I don't think so. They are delaying changes to their chip making process.

aburan28 · on July 16, 2015

Moore's law plateaued years ago. Sure they are adding more cores but thats because they have made no progress on the clock speed in nearly a decade. FPGA's are going to be the next go to for performance until cpu clock speeds improve

JohnBooty · on July 16, 2015

    Sure they are adding more cores but thats because 
    they have made no progress on the clock speed in 
    nearly a decade.

Per-core IPC (instructions executed per clock cycle) have soared since then. That's why a modern 4-core Core i7 is roughly 20-30x faster than a "Netburst" Pentium 4 from 10 years ago.

Looked at another way, each individual core of a modern i7 is about 5x-7x faster than those old Pentium 4 CPUs.

http://ark.intel.com/products/27495/Intel-Pentium-4-Processo...

    FPGA's are going to be the next go to for performance until 
    cpu clock speeds improve

Maybe. What would also be cool is decreasing the latency of main memory. That is the current single biggest bottleneck in processors today. A cache miss usually means that the CPU sits around doing nothing for hundreds of cycles while it's waiting for data to be fetched from main memory.

mikeash · on July 16, 2015

Moore's Law is about transistor count, not clock speed. They used to go together, now they don't, but Moore's Law is still doing OK. This is an indication that it may be coming to an end, finally.

bargl · on July 16, 2015

Richard from DotNetRocks does an Awesome show on this. https://www.dotnetrocks.com/default.aspx?showNum=1130

singlow · on July 16, 2015

I thought Moore's law had more to do with density than clock speed. I thought that increased density helped to increase clock speed but that was not really Moore's Law. Recent increases in density have not helped clock speed as much due to other constraints on clock speed, but the density has still been increasing, hasn't it?

astrodust · on July 16, 2015

Smaller transistors can flip state more quickly with less energy, so usually a die-shrink allows higher frequencies.

There's a limit to how much you can crank these things before power consumption becomes ridiculous and heat dissipation is impossible.

You can make a 10GHz CPU if you're prepared to require a massive liquid helium cooling system and have a power consumption of 15KW.