More

shihab · 2026-01-12T19:57:23 1768247843

For SIMD at least, the {mins[3], maxs[3]} representation aligns more naturally with actual instructions on x86. To compute a new bounding box:

new_box.mins = _mm_min_ps(a.mins[3], b.mins[3]);

astrange · 2026-01-13T02:53:50 1768272830

You would want [4] not [3], with the last one being padding. Of course, you can't always afford that.

delta_p_delta_x · 2026-01-12T20:08:07 1768248487

Indeed. This is classic array-of-structs versus struct-of-arrays.

shihab · 2026-01-09T01:03:48 1767920628

If you are from ML/Data science world, the analogy that finally unlocked FFT for me is feature size reduction using Principal Component Analysis. In both cases, you project data to a new "better" co-ordinate system ("time to frequency domain"), filter out the basis vectors that have low variance ("ignore high-frequency waves"), and project data back to real space from those truncated dimension ("Ifft: inverse transform to time domain").

Of course some differences exist (e.g. basis vectors are fixed in FFT, unlike PCA).

shihab · 2026-01-06T19:56:17 1767729377

I'd love to see a breakdown of what exactly worked here, or better yet, PR to upstream Abseil that implements those ideas.

AI is always good at going from 0 to 80%, it's the last 20% it struggles with. It'd be interesting to see a claude-written code making its way to a well-established library.

shihab · 2026-01-06T08:16:12 1767687372

I was surprised by that defiant tone there in an official page. But it's missing actual numbers, which makes it all pretty strange.

adrian_b · 2026-01-06T16:48:47 1767718127

The numbers and the benchmarking conditions are in the PDF linked on that page:

https://download.intel.com/newsroom/2026/CES2026/Intel-CES20...

shihab · 2025-11-02T18:19:21 1762107561

I'm curious what you did with the "active sorting range" after a push/pop event. Since it's a vector underneath, I don't see any option other than to sort the entire range after each event, O(N). This would surely destroy performance, right?

YesBox · 2025-11-02T21:17:21 1762118241

There's no need to resort when popping an item.

When adding an item, it gets added to the next unused vector element. The sorting range end offset gets updated.

Then it sorts (note you would actually need a custom sort since a PriorityQueue is a pair)

  std::push_heap( vec.begin() + startOffset, vec.begin() + endOffset ) [1]

Adding an item would be something like:

  endOffset++; vec.insert( vec.begin() + endOffset, $value );   [1]

Or maybe I just used

  endOffset++; vec[endOffset] = $value;  [1]

Popping an item:

  startOffset++;

[1] Im writing this from memory from my attempt many months ago. May have mistakes.. but should communicate the jist

shihab · 2025-10-13T05:19:31 1760332771

> To Koreans, they looked more like sauce bowls, leading them to conclude that the Japanese had starved themselves to stretch out the siege.

As a Bengali man, that's exactly how I felt when I came to USA and first visited japanese restaurants. Part of the reason we consume so much rice is that rice is kind of the main dish (not a side)- it literally takes up central and most of the space in your food plate.

https://commons.wikimedia.org/wiki/File:%E0%A6%87%E0%A6%B2%E...

teleforce · 2025-10-13T06:26:42 1760336802

Typical Japanese will devour their small rice bowl until there's none of rice grain is left over, since they're taught from the very young age not to waste food.

Most of other Asian nations will not eat their rice until it's completely finished. Even with their most delicious biryani dish there're always many rice grains left in the plate. I think the small bowl make it much easier to completely consume the rice unlike the big bowl or plate.

nirava · 2025-10-13T07:03:41 1760339021

The Japanese mostly eat sticky rice, which is very easy to eat and "clean up" even with a chopstick.

The Indian subcontinent eat long-grain Basmati or similar rice which fluff up into individual grains on the plate. It doesn't make sense to individually pick out single leftover grains.

In nearly every culture is the idea of "Annapurna" or the god of food, and wasting food is generally frowned upon and considered bad table manners. I've been scolded plenty of times as a child for not cleaning up my plate in Nepal.

I wouldn't attribute it to small bowls at least. The Japanese instilling good virtues into their children almost institutionally perhaps plays some part in it, but also some of it is just physics.

dayjaby · 2025-10-13T08:19:24 1760343564

Having had grandparents live through WWII (or any other war to be fair) also helps instill this attitude. I can barely imagine what kind of famines they had to endure.

ignoramous · 2025-10-13T11:08:10 1760353690

Wasting food is faux pas in many a eastern cultures, including South Asian & Middle Eastern.

hbarka · 2025-10-13T06:06:59 1760335619

Ilish fish also known as hilsa, the king of fish. That’s one delicious fish.

shihab · 2025-10-08T22:13:10 1759961590

Could you please elaborate on your example? Thanks.

Remnant44 · 2025-10-08T22:53:29 1759964009

Sure.. in detail and abstracted slightly, the byte table problem:

Maybe you're remapping RGB values [0..255] with a tone curve in graphics, or doing a mapping lookup of IDs to indexes in a set, or a permutation table, or .. well, there's a lot of use cases, right? This is essentially an arbitrary function lookup where the domain and range is on bytes.

It looks like this in scalar code:

transform_lut(byte* dest, const byte* src, int size, const byte* lut) { for (int i = 0; i < size; i++) { dest[i] = lut[src[i]]; } }

The function above is basically load/store limited - it's doing negligible arithmetic, just loading a byte from the source, using that to index a load into the table, and then storing the result to the destination. So two loads and a store per element. Zen5 has 4 load pipes and 2 store pipes, so our CPU can do two elements per cycle in scalar code. (Zen4 has only 1 store pipe, so 1 per cycle there)

Here's a snippet of the AVX512 version.

You load the lookup table into 4 registers outside the loop:

  __m512i p0, p1, p2, p3;
  p0 = _mm512_load_epi8(lut);
  p1 = _mm512_load_epi8(lut + 64);
  p2 = _mm512_load_epi8(lut + 128);
  p3 = _mm512_load_epi8(lut + 192);

Then, for each SIMD vector of 64 elements, use each lane's value as an index into the lookup table, just like the scalar version. Since we only can use 128 bytes, we DO have to do it twice, once for the lower and again for the upper half, and use a mask to choose between them appropriately on a per-element basis.

  auto tLow  = _mm512_permutex2var_epi8(p0, x, p1);
  auto tHigh = _mm512_permutex2var_epi8(p2, x, p3);

You can use _mm512_movepi8_mask to load the mask register. That instruction sets each lane is active if its high bit of the byte is set, which perfectly sets up our table. You could use the mask register directly on the second shuffle instruction or a later blend instruction, it doesn't really matter.

For every 64 bytes, the avx512 version has one load&store and does two permutes, which Zen5 can do at 2 a cycle. So 64 elements per cycle.

So our theoretical speedup here is ~32x over the scalar code! You could pull tricks like this with SSE and pshufb, but the size of the lookup table is too small to really be useful. Being able to do an arbitrary super-fast byte-byte transform is incredibly useful.

vincenthwt · 2025-10-09T01:31:47 1759973507

I love lookup tables. Thanks for sharing!

kbolino · 2025-10-08T22:36:51 1759963011

Here's a non-parallel and unoptimized implementation of that operation in Go:

  func _mm512_permutex2var_epi8(a, idx, b [64]uint8) [64]uint8 {
    var dst [64]uint8
    for j := 0; j < 64; j++ {
      i := idx[j]
      src := a
      if i&0b0100_0000 != 0 {
        src = b
      }
      dst[j] = src[i&0b0011_1111]
    }
    return dst
  }

Basically, for a lookup table of 8-bit values, you need only 1 instruction to perform up to 64 lookups simultaneously, for each 128 bytes of table.

shihab · 2025-10-05T05:47:01 1759643221

One feedback from someone interested in using this about the examples: I have looked at several and they seem too high level to get a sense of the actual API (i.e. the expected benefit of using this library vs the development complexity of using it).

For example, the cloth bending simulation is almost entirely: at __init__, call a function to add a cloth mesh to model builder obj, pass built model to initializer of a solver class; and at each timestep: call a collide model function, then call another function called solver.step. That's really it.

shihab · 2025-09-26T13:37:44 1758893864

That was government of qatar, this is Abu Dhabi (UAE). They had a diplomatic crisis, with full-scale blockade not so long ago.

meibo · 2025-09-26T13:44:31 1758894271

It's fine, I don't think he can tell the difference.

bgwalter · 2025-09-26T13:45:11 1758894311

UAE was engaged in crypto dealings instead:

https://www.nytimes.com/2025/09/15/us/politics/trump-uae-chi...

"Earlier this year, World Liberty, the crypto firm run by the Trumps and Witkoffs, announced an agreement with an investment firm backed by the ruling family of the U.A.E. The Emirati firm would conduct a $2 billion transaction using World Liberty’s digital coins, a deal that would provide a windfall to the Trump and Witkoff families."

alephnerd · 2025-09-26T15:25:51 1758900351

That's the Emirate of Dubai, not Abu Dhabi.

tdeck · 2025-09-26T15:36:22 1758900982

I wish there were some clause in the US constitution that broadly and expressly prohibited this kind of thing.

bix6 · 2025-09-26T15:09:52 1758899392

What does the UAE get out of this? Is it just a massive financial loss in exchange for US market access?

fib11235 · 2025-09-26T15:25:58 1758900358

One of NYT's recent podcasts (The Daily) covered this, basically the Biden administration was reluctant to give the UAE access to Nvidia chips because of their close dealings with China. 2 weeks after this crypto investment, the white house agrees to give the UAE access to the chips.

Here's an article if you're interested: https://www.nytimes.com/2025/09/15/us/politics/trump-uae-chi...

bix6 · 2025-09-27T15:38:17 1758987497

ceejayoz · 2025-09-26T16:56:57 1758905817

> What does the UAE get out of this?

What does the UAE gain by funneling $2B to Trump, who is notoriously a) transactional and b) one of the most powerful people on the planet?

bix6 · 2025-09-27T15:38:06 1758987486

yeah my question was around the return transaction which the other commenter answered

mrguyorama · 2025-09-26T23:02:00 1758927720

Being "transactional" requires you to hold up your end of the bargain, which Trump famously does not.

Trump is not transactional.

ceejayoz · 2025-09-27T01:42:11 1758937331

He's very transactional. He just regularly backstabs after he gets his end of things.

tempodox · 2025-09-27T06:49:41 1758955781

Is quibbling over a euphemism really worth the time? He’s corrupt, plain and simple.

shihab · 2025-09-01T20:49:40 1756759780

I’m a big, big fan of acquired podcast, listened to every single of their episodes.

But I remember their entire Microsoft episodes felt like a lengthy defense of Steve Ballmer. There were too many instances of “here’s why this bad decision of Steve made sense given the circumstances” or “ here is how people underestimate the contribution of Steve on this good decision.” They were all well argued points, of course, but so numerous that I found myself wondering if the hosts does not have a relationship with Steve.

The existence of this interview does not help with that suspicion.

SlowTao · 2025-09-02T00:16:31 1756772191

I think the big thing was Steve did make a lot of great decisions, some of the best the company could at that time in those respective fields but completely missed on everything that Apple did. Portable media players, smart phones and tablets and those are the three huge misses and that is really were it counted.

The old three envelope joke.

You become CEO and there are three envelopes on your office desk, a note says "Every time there is an issue you open them in order and do what is inside". First envelope says "Blame your predecessor.". The second says "Blame yourself". The third says "Prepare three envelopes".

opo · 2025-09-02T01:05:17 1756775117

>...The second says "Blame yourself".

The second step in this joke is something like "Reorganize.", not "Blame yourself".

https://kevinkruse.com/the-ceo-and-the-three-envelopes/

big_toast · 2025-09-01T21:57:21 1756763841

This is the problem with podcasts, but also modern media, in general. You have to play softball or be ideologically homogeneous to get access. Anything else has a negative k for a variety of reasons.

mrandish · 2025-09-02T01:02:31 1756774951

I haven't listened to all their MSFT coverage but it's possible they genuinely feel Ballmer's gotten a worse rap than he deserves and they're trying to contextualize some of the decisions and circumstances.

fruitplants · 2025-09-02T05:40:32 1756791632

Yes, but as sibling comment says there's that thing about softball.

I think Ballmer was better than how he was perceived. So I did expect some justification, etc in their MS episodes. But these points seemed, to use your word, numerous. I think they must have done this because in preparation for those MS episodes, they did talk to Ballmer, and expected him to listen to the episodes. Comparatively their Bernard Arnault, LVMH games take on episodes like LVMH, Hermes seemed somewhat balanced.