More

stephencanon · 2026-03-26T19:37:46 1774553866

"I'd like to thank my mother, Ayn Rand, and God" is the usual example.

Yes, you can reorder the list to remove the ambiguity, but sometimes the order of the list matters. The serial comma should be used when necessary to remove ambiguity, and not used when it introduces ambiguity. Rewrite the sentence when necessary. Worth noting that this is the Oxford University Press's own style rule!

alistairSH · 2026-03-26T19:53:11 1774554791

I always heard this one...

We invited the strippers, JFK, and Stalin to the party. [three groups invited - strippers, a president, and a premier]

We invited the strippers, JFK and Stalin to the party. [the president and premier are strippers]

Very different visual conjured by those two sentences.

comprev · 2026-03-26T20:21:08 1774556468

"John helped his uncle, Jack off a horse"

"John helped his uncle Jack off a horse"

Two very different outcomes...

h4ch1 · 2026-03-26T21:05:48 1774559148

shouldn't there be another comma after Jack?

John helped his uncle, Jack, off a horse.

Because while speaking it I only pause after uncle and "Jack off a horse" together next. feels like there should be another pause after Jack?

PaulDavisThe1st · 2026-03-26T20:19:51 1774556391

I'd prefer:

We invited the strippers, JFK and Stalin, to the party [two strippers, named JFK and Stalin]

if the goal is to minimize ambiguity.

al_borland · 2026-03-26T20:44:57 1774557897

I can see it being tiresome to read text where the author is continuously interjecting clarification with brackets.

PaulDavisThe1st · 2026-03-26T21:42:45 1774561365

The square-bracket clarifications here are meta-text designed to absolutely clarify the intended reading of the preceding text, so that the reader can contrast their understanding with the intended one.

There is no suggestion that one would do this in "regular" text.

robertoandred · 2026-03-26T20:43:53 1774557833

If JFK and Stalin were strippers, there’d be a comma after Stalin to denote the parenthetical clause.

Avshalom · 2026-03-26T20:53:31 1774558411

I mean first off: no the exact same image is conjured because we are reading this in context of knowing who jfk and stalin are and we know they aren't strippers and all language is contextual.

That said:

We invited the stripper, JFK, and Stalin to the party.

We invited the stripper, JFK and Stalin to the party.

The supposed ambiguity is back. Although again there is no ambiguity to the reader. The juxtaposition of the two versions wouldn't work as a joke if there was any ambiguity

alistairSH · 2026-03-26T23:31:42 1774567902

Fine, I’ll be boring…

We invited the strippers, Krystal and Britney, to the party.

We invited the strippers, Krystal, and Britney, to the party.

And yes, if strippers is made singular, poor Krystal becomes Schrödinger‘s Stripper.

yellowapple · 2026-03-27T05:42:51 1774590171

Removing the Oxford comma not only fails to resolve the ambiguity, but introduces yet more ambiguity by implying Ayn Rand to be God.

Joker_vD · 2026-03-26T20:30:42 1774557042

Just put the colon there if you need to introduce a list, it's one of its functions. "I'd like to thank: my mother, Ayn Rand and God". The same goes for that "two strippers" example: "We invited the strippers: JFK and Stalin, to the party".

mmooss · 2026-03-26T22:14:15 1774563255

I want you to know that I would only write this in a discussion nitpicking about grammar: :)

> "I'd like to thank: my mother, Ayn Rand and God".

A colon should not connect a verb and its objects; generally you need an independent clause before the colon (i.e., a clause that could be a complete sentence). One could properly say,

  I'd like to thank the following: My mother, Ayn Rand and God.

Also, these examples leave ambiguity. Your mom could be Ayn Rand, and if she was, then you might very well think she was God, or be making a joke about it.

> "We invited the strippers: JFK and Stalin, to the party"

Nope. A colon isn't a parenthetical in the middle of a sentence; that is, you can't continue the sentence after a colonic phrase (there's no such thing so I made up that term :D ). And again, the clause before that colon is not an independent clause. One can use parentheses (of course) or em dashes for parenthetical phrases:

  We invited strippers (JFK and Stalin) to the party.
  We invited strippers - JFK and Stalin - to the party.

A proper colon might be as follows:

  We invited strippers to the party: JFK and Stalin!

But I'd put an em dash there (and to heck with LLMs and their em dash overusage).

stephencanon · 2026-03-19T14:10:40 1773929440

Enlarging a branch predictor requires area and timing tradeoffs. CPU designers have to balance branch predictor improvements against other improvements they could make with the same area and timing resources. What this tells you is that either Intel is more constrained for one reason or another, or Intel's designers think that they net larger wins by deploying those resources elsewhere in the CPU (which might be because they have identified larger opportunities for improvement, or because they are basing their decision making on a different sample of software, or both).

pbsd · 2026-03-19T18:35:20 1773945320

I mean, he's comparing 2024 Zen 5 and M4 against two generations behind 2022 Intel Raptor Lake. The Lion Cove should be roughly on par with the M4 on this test.

stephencanon · 2026-03-19T21:22:58 1773955378

That would fall under "more constrained", due to process limits.

stephencanon · 2026-03-16T15:49:46 1773676186

Your students should be able to figure out if a computation is exact or not, because they should understand binary representation of numbers.

stephencanon · 2026-03-11T19:52:42 1773258762

For throughput-dominated contexts, evaluation via Horner's rule does very well because it minimizes register pressure and the number of operations required. But the latency can be relatively high, as you note.

There are a few good general options to extract more ILP for latency-dominated contexts, though all of them trade additional register pressure and usually some additional operation count; Estrin's scheme is the most commonly used. Factoring medium-order polynomials into quadratics is sometimes a good option (not all such factorizations are well behaved wrt numerical stability, but it also can give you the ability to synthesize selected extra-precise coefficients naturally without doing head-tail arithmetic). Quadratic factorizations are a favorite of mine because (when they work) they yield good performance in _both_ latency- and throughput-dominated contexts, which makes it easier to deliver identical results for scalar and vectorized functions.

There's no general form "best" option for optimizing latency; when I wrote math library functions day-to-day we just built a table of the optimal evaluation sequence for each order of polynomial up to 8 or so and each microarchitecture and grabbed the one we needed unless there were special constraints that required a different choice.

Sesse__ · 2026-03-11T21:23:59 1773264239

Ah, I didn't know of either scheme. Still, am I right in that this mainly makes sense for degrees above five or six or so?

stephencanon · 2026-03-12T00:01:27 1773273687

You can often eke something out for order-four, depending on uArch details. But basically yeah.

stephencanon · 2026-03-11T16:12:33 1773245553

Ideally either one is just a library call to generate the coefficients. Remez can get into trouble near the endpoints of the interval for asin and require a little bit of manual intervention, however.

stephencanon · 2026-03-11T15:38:43 1773243523

These sorts of approximations (and more sophisticated methods) are fairly widely used in systems programming, as seen by the fact that Apple's asin is only a couple percent slower and sub-ulp accurate (https://members.loria.fr/PZimmermann/papers/accuracy.pdf). I would expect to get similar performance on non-Apple x86 using Intel's math library, which does not seem to have been measured, and significantly better performance while preserving accuracy using a vectorized library call.

The approximation reported here is slightly faster but only accurate to about 2.7e11 ulp. That's totally appropriate for the graphics use in question, but no one would ever use it for a system library; less than half the bits are good.

Also worth noting that it's possible to go faster without further loss of accuracy--the approximation uses a correctly rounded square root, which is much more accurate than the rest of the approximation deserves. An approximate square root will deliver the same overall accuracy and much better vectorized performance.

Pannoniae · 2026-03-11T15:41:48 1773243708

Yeah, the only big problem with approx. sqrt is that it's not consistent across systems, for example Intel and AMD implement RSQRT differently... Fine for graphics, but if you need consistency, that messes things up.

stephencanon · 2026-03-11T16:07:11 1773245231

Newer rsqrt approximations (ARM NEON and SVE, and the AVX512F approximations on x86) make the behavior architectural so this is somewhat less of a problem (it still varies between _architectures_, however).

def-pri-pub · 2026-03-11T17:41:01 1773250861

Wait, what? Do you have a resource I could read up on about that? That is moderately concerning if your math isn't portable across chips.

stephencanon · 2026-03-11T18:02:09 1773252129

When Intel specced the rsqrt[ps]s and rcp[ps]s instructions ~30 years ago, they didn't fully specify their behavior. They just said their relative error is "smaller than 1.5 * 2⁻¹²," which someone thought was very clever because it gave them leeway to use tables or piecewise linear approximations or digit-by-digit computation or whatever was best suited to future processors. Since these are not IEEE 754 correctly-rounded operations, and there was (by definition) no software that currently used them, this was "fine".

And mostly it has been OK, except for some cases like games or simulations that want to get bitwise identical results across HW, which (if they're lucky) just don't use these operations or (if they're unlucky) use them and have to handle mismatches somehow. Compilers never generate these operations implicitly unless you're compiling with some sort of fast-math flag, so you mostly only get to them by explicitly using an intrinsic, and in theory you know what you're signing up for if you do that.

However, this did make them unusable for some scenarios where you would otherwise like to use them, so a bunch of graphics and scientific computing and math library developers said "please fully specify these operations next time" and now NEON/SVE and AVX512 have fully-specified reciprocal estimates,¹ which solves the problem unless you have to interoperate between x86 and ARM.

¹ e.g. Intel "specifies" theirs here: https://www.intel.com/content/www/us/en/developer/articles/c...

ARM's is a little more readable: https://developer.arm.com/documentation/ddi0596/2021-03/Shar...

def-pri-pub · 2026-03-11T18:21:21 1773253281

Thanks!

arnoldjm · 2026-03-14T14:31:51 1773498711

Take a look at the "rsqrt_rcp" section of reference [6] in the accuracy report by Gladman et al referenced above. I did that work 10 years ago because some people at CERN had reported getting different results from certain programs depending on whether the exact same executables were run on Intel or AMD cpus. The result of the investigation was that the differing results were due to different implementations of the rsqrt instruction on the different cpus.

patchnull · 2026-03-11T16:05:59 1773245159

[flagged]

stephencanon · 2026-03-11T16:19:02 1773245942

For the asinf libcall on macOS/x86, my former colleague Eric Postpischil invented the novel (at least at the time, I believe) technique of using a Remez-optimized refinement polynomial following rsqrtss instead of the standard Newton-Raphson iteration coefficients, which allowed him to squeeze out just enough extra precision to make the function achieve sub-ulp accuracy. One of my favorite tricks.

We didn't carry that algorithm forward to arm64, sadly, because Apple's architects made fsqrt fast enough that it wasn't worth it in scalar contexts.

stephencanon · 2026-03-10T13:08:36 1773148116

Yann is definitely more well-known outside of academia. Inside academia, it's going to depend a lot on your specific background and how old you are.

stephencanon · 2026-03-05T13:37:03 1772717823

Mechanically, you're probably right, but the screen-centric controls of the newer generation are _awful_ by comparison to the F generation's physical buttons and dials (this isn't BMW though, it's the whole industry).

My wife and I both have F31s, which we will drive until we can no longer source replacement parts unless the industry comes to its senses first (unlikely). Any time we've ever looked at plausible replacements, the screen-based controls are an immediate hard no.

stephencanon · 2026-03-05T13:32:02 1772717522

Our (~2015) 3-series controls are just about perfect. Where they differ from Honda/Toyota's controls that I am also very familiar with, they're noticeably better now that I'm familiar with them. Everything is really well thought-out.

Of course, now they (and almost every other manufacturer) have followed Tesla off the cliff and made everything a screen, so the current generation cars have abysmal controls.

stephencanon · 2025-12-04T13:33:39 1764855219

Schubfach's table is quite large compared to some alternatives with similar performance characteristics. swiftDtoa's code and tables combined are smaller than just Schubfach's table in the linked implementation. Ryu and Dragonbox are larger than swiftDtoa, but also use smaller tables than Schubfach, IIRC.

If I$ is all you care about, then table size may not matter, but for constrained systems, other algorithms in general, and swiftDtoa in particular, may be better choices.