If you are from ML/Data science world, the analogy that finally unlocked FFT for me is feature size reduction using Principal Component Analysis. In both cases, you project data to a new "better" co-ordinate system ("time to frequency domain"), filter out the basis vectors that have low variance ("ignore high-frequency waves"), and project data back to real space from those truncated dimension ("Ifft: inverse transform to time domain").
Of course some differences exist (e.g. basis vectors are fixed in FFT, unlike PCA).
I'd love to see a breakdown of what exactly worked here, or better yet, PR to upstream Abseil that implements those ideas.
AI is always good at going from 0 to 80%, it's the last 20% it struggles with. It'd be interesting to see a claude-written code making its way to a well-established library.
I'm curious what you did with the "active sorting range" after a push/pop event. Since it's a vector underneath, I don't see any option other than to sort the entire range after each event, O(N). This would surely destroy performance, right?
> To Koreans, they looked more like sauce bowls, leading them to conclude that the Japanese had starved themselves to stretch out the siege.
As a Bengali man, that's exactly how I felt when I came to USA and first visited japanese restaurants. Part of the reason we consume so much rice is that rice is kind of the main dish (not a side)- it literally takes up central and most of the space in your food plate.
Typical Japanese will devour their small rice bowl until there's none of rice grain is left over, since they're taught from the very young age not to waste food.
Most of other Asian nations will not eat their rice until it's completely finished. Even with their most delicious biryani dish there're always many rice grains left in the plate. I think the small bowl make it much easier to completely consume the rice unlike the big bowl or plate.
The Japanese mostly eat sticky rice, which is very easy to eat and "clean up" even with a chopstick.
The Indian subcontinent eat long-grain Basmati or similar rice which fluff up into individual grains on the plate. It doesn't make sense to individually pick out single leftover grains.
In nearly every culture is the idea of "Annapurna" or the god of food, and wasting food is generally frowned upon and considered bad table manners. I've been scolded plenty of times as a child for not cleaning up my plate in Nepal.
I wouldn't attribute it to small bowls at least. The Japanese instilling good virtues into their children almost institutionally perhaps plays some part in it, but also some of it is just physics.
Having had grandparents live through WWII (or any other war to be fair) also helps instill this attitude. I can barely imagine what kind of famines they had to endure.
Sure.. in detail and abstracted slightly, the byte table problem:
Maybe you're remapping RGB values [0..255] with a tone curve in graphics, or doing a mapping lookup of IDs to indexes in a set, or a permutation table, or .. well, there's a lot of use cases, right? This is essentially an arbitrary function lookup where the domain and range is on bytes.
It looks like this in scalar code:
transform_lut(byte* dest, const byte* src, int size, const byte* lut) {
for (int i = 0; i < size; i++) {
dest[i] = lut[src[i]];
}
}
The function above is basically load/store limited - it's doing negligible arithmetic, just loading a byte from the source, using that to index a load into the table, and then storing the result to the destination. So two loads and a store per element. Zen5 has 4 load pipes and 2 store pipes, so our CPU can do two elements per cycle in scalar code. (Zen4 has only 1 store pipe, so 1 per cycle there)
Here's a snippet of the AVX512 version.
You load the lookup table into 4 registers outside the loop:
Then, for each SIMD vector of 64 elements, use each lane's value as an index into the lookup table, just like the scalar version. Since we only can use 128 bytes, we DO have to do it twice, once for the lower and again for the upper half, and use a mask to choose between them appropriately on a per-element basis.
auto tLow = _mm512_permutex2var_epi8(p0, x, p1);
auto tHigh = _mm512_permutex2var_epi8(p2, x, p3);
You can use _mm512_movepi8_mask to load the mask register. That instruction sets each lane is active if its high bit of the byte is set, which perfectly sets up our table. You could use the mask register directly on the second shuffle instruction or a later blend instruction, it doesn't really matter.
For every 64 bytes, the avx512 version has one load&store and does two permutes, which Zen5 can do at 2 a cycle. So 64 elements per cycle.
So our theoretical speedup here is ~32x over the scalar code! You could pull tricks like this with SSE and pshufb, but the size of the lookup table is too small to really be useful. Being able to do an arbitrary super-fast byte-byte transform is incredibly useful.
One feedback from someone interested in using this about the examples: I have looked at several and they seem too high level to get a sense of the actual API (i.e. the expected benefit of using this library vs the development complexity of using it).
For example, the cloth bending simulation is almost entirely: at __init__, call a function to add a cloth mesh to model builder obj, pass built model to initializer of a solver class; and at each timestep: call a collide model function, then call another function called solver.step. That's really it.
"Earlier this year, World Liberty, the crypto firm run by the Trumps and Witkoffs, announced an agreement with an investment firm backed by the ruling family of the U.A.E. The Emirati firm would conduct a $2 billion transaction using World Liberty’s digital coins, a deal that would provide a windfall to the Trump and Witkoff families."
One of NYT's recent podcasts (The Daily) covered this, basically the Biden administration was reluctant to give the UAE access to Nvidia chips because of their close dealings with China. 2 weeks after this crypto investment, the white house agrees to give the UAE access to the chips.
I’m a big, big fan of acquired podcast, listened to every single of their episodes.
But I remember their entire Microsoft episodes felt like a lengthy defense of Steve Ballmer. There were too many instances of “here’s why this bad decision of Steve made sense given the circumstances” or “ here is how people underestimate the contribution of Steve on this good decision.” They were all well argued points, of course, but so numerous that I found myself wondering if the hosts does not have a relationship with Steve.
The existence of this interview does not help with that suspicion.
I think the big thing was Steve did make a lot of great decisions, some of the best the company could at that time in those respective fields but completely missed on everything that Apple did. Portable media players, smart phones and tablets and those are the three huge misses and that is really were it counted.
The old three envelope joke.
You become CEO and there are three envelopes on your office desk, a note says "Every time there is an issue you open them in order and do what is inside". First envelope says "Blame your predecessor.". The second says "Blame yourself". The third says "Prepare three envelopes".
This is the problem with podcasts, but also modern media, in general. You have to play softball or be ideologically homogeneous to get access. Anything else has a negative k for a variety of reasons.
I haven't listened to all their MSFT coverage but it's possible they genuinely feel Ballmer's gotten a worse rap than he deserves and they're trying to contextualize some of the decisions and circumstances.
Yes, but as sibling comment says there's that thing about softball.
I think Ballmer was better than how he was perceived. So I did expect some justification, etc in their MS episodes. But these points seemed, to use your word, numerous. I think they must have done this because in preparation for those MS episodes, they did talk to Ballmer, and expected him to listen to the episodes. Comparatively their Bernard Arnault, LVMH games take on episodes like LVMH, Hermes seemed somewhat balanced.
new_box.mins = _mm_min_ps(a.mins[3], b.mins[3]);
reply