I believe it's something like an accumulator that has higher precision than the ...

tveita · on May 13, 2023

Specifically they are using a 1024 bit accumulator register to hold intermediate results for 64-bit posit operations. You can get more precision for IEEE float operations as well if you're willing to just add loads of extra bits!

It's expensive though:

  Synthesis results of the 64-bit PAU in Big-PERCIVAL
  have shown that it requires 2.5× as many resources as
  the double-precision FPNew FPU. Moreover, we studied
  the impact of the corresponding 1024-bit quire accumulator
  register, which increased the total hardware cost to a third
  of the area of the core. Detailed area results illustrated how
  the hardware resources are distributed among the different
  operations. In particular, the most resource-hungry elements
  are the quire-related units and the posit division and square
  root units.

I don't think this is a particularly positive results for posits.

pclmulqdq · on May 13, 2023

Yeah, I have had a tough time thinking posits are worth it in hardware - posit ops of length n seem to take almost as much hardware as floating point ops of length 2n.

Quad-precision float seems more general-purpose and honestly more promising for scientific computing, since the error analysis is easier.

jcranmer · on May 13, 2023

I think Gustafson would argue that it doesn't matter, since the storage cost impacts power more than the FPU computation cost. (Not that I would agree with him).

But in general, it seems that the strongest features of posits are basically recognizing that being strategic with where you need extra precision is advantageous, and if you apply the same techniques to IEEE 754 floats, you lose most of the seeming advantage of posits.

jlgustafson · on May 13, 2023

IEEE 754 is just the codification of the Intel 8087 coprocessor design that John Palmer and Bruce Ravenel came up with. They brought in William Kahan as a consultant, and Kahan disagreed with almost every aspect of their design (he wanted decimal representation, not binary, and 128-bit extended precision instead of 80-bit, and bitwise reproducibility, not 'better answers on Intel') but he lost every argument. Kahan's clout helped Intel's design become the IEEE Std 754, and John Palmer chortled over the fact that they'd foisted that on the world. I used to work for him, so this is first-hand info. IEEE 754 is not a mathematical design, and the exception conditions are a complete mess, which is why it takes almost 90 pages to describe the Standard. The Posit Standard (2022) is only 12 pages long.

The #1 issue in computer performance is The Memory Wall... it is orders of magnitude more expensive to move data between external DRAM and the processor than it is to do operations within the processor. The solution is to increase information-per-bit so that real numbers can be represented in 32-bit precision with sufficient accuracy. That more than doubles the performance over 64-bit floats since it allows more data to fit in cache at every level of the memory hierarchy.

pclmulqdq · on May 15, 2023

The 754 spec has somewhat moved past the 8087 at this point (3 revisions later). A lot of things have been fixed, including the whole language around exceptions (which used to define "traps" - a very processor-specific idea rather than an arithmetic-centered one). I am hoping we can be free of (required) exceptions in 2028.

As I understand it, your other complaints tend to center around overflow to infinity and precise summation of vectors. For applications that really care about that precision, there are ways to do it in floating point without a quire register - sorting before summing is the naive approach, but look into ReproBLAS for some better algorithms.

Also, I can't help but wonder if the memory wall idea here is centered only around synthetic benchmarks like gigantic dot products. A lot of code leans heavily on caches these days, which make the energy cost of operations a lot lower, and pretty much everything short of massive dot products uses them. I imagine you would have to make a very nuanced argument about why a 1k fixed point sum is saving energy here. Even matmuls are pretty cache-efficient now.

Elsewhere in computing, we are actually generally moving away from tightly-packed structs in performance-sensitive code despite the memory retrieval cost, because they are just easier to deal with in both hardware and software, and locality picks up all the slack.