More

fuber2018 · on July 9, 2023

I knew a guy who worked at MS when they were developing the Barney doll. He signed up to beta/play-test the doll since he had a son in the target age range.

He left work on Friday with the new Barney doll.

When he came into work on the following Monday, he told his co-workers, "Looks like I'm going to HAVE to get a Barney doll for my son when they're released."

The power of Barney...

He also mentioned that when all the Actimates dolls and other consumer-related products were released, the internal-only Microsoft store looked like a techie-version of FAO Schwarz instead of a Microsoft-leaning Egghead software store.

The bean counters at MS killed a lot of product ideas when they came up with the high revenue bar for any possible new products - as if anyone could predict that stuff accurately.

qingcharles · on July 10, 2023

What's funny is that in the internal Microsoft Store you could buy a boxed copy of SQL Server Enterprise Edition for next to nothing, but a Microsoft Teddy Bear cost a fortune. I used all my Microsoft Dollars on stupid toys from the store. Their teddy bears were the softest ever. I lost them to First Wife in the divorce.

seanthemon · on July 10, 2023

How many microsoft bucks for a new wife + teddy?

fuber2018 · on July 8, 2023

The code in question has to process a string of variable length.

But the compiler/CPU can process bytes one at a time or much faster in groups. The code is trying to process as much as possible in groups of 128.

But since the caller can pass in a string which is not a mulitple of 128 chars, the first for-loop (& 127) will figure out how much of the string to process such that the remaining string length is a multiple of 128.

The second for-loop (>> 7) calculates divides by 128 (>> 7) to find out how many multiples of 128 there are to process. The inner for-loop processes 128 chars looking for 's' chars.

Now the for-loop within a for-loop doesn't look any faster than the plain single for-loop, but I'd assume that the heuristics of certain compilers can intuit that it can generate code to operate on multiple chars at the same time (SIMD instructions), since the result of one operation are independent of others.

On a compiler that cannot generate SIMD code, the code won't be much faster, if at all, than the naive straightforward manner.

fuber2018 · on July 7, 2023

I assume the M1's SIMD registers are wider/more numerous than just the couple of size_t registers used for the loading/masking/accumulating inner loop in your run_swtches().

You can speedup the code by unrolling your inner loop a few times (try 4x or 8x) - it does mean that your overflow prevention limit is lowered (to a multiple of the unrolled grouping number) and run a few more times. But the speedup offsets the increased bookkeeping.

A version I played with showed increased speed by saving the in-progress accumulation in an array and then doing the final accumulation after the main loop is done. But that may be due to the CPU arch/compiler I'm using.

fuber2018 · on July 7, 2023

If this code only runs on one compiler version/CPU arch, then ASSUMING the compiler will do the RIGHT THING and auto-vectorize the code is okay.

But if your code will be cross-platform/run on different OSes/CPU arch's, then a SWAR version may be more consistently performant - no need to guess if the compiler's optimization heuristics decided to go with the general purpose CPU registers or faster SIMD registers.

Downside is that the devs are exposed to the gnarly optimized code.

fuber2018 · on July 7, 2023

Almost the same as my SWAR version - which is what you're doing.

But aren't you reading off the end of the buffer in your memcpy(&w...)? Say with an empty input string whose start address is aligned to sizeof(size_t) bytes?

I just passed in the string length since the caller had that info, otherwise you'd scan the whole string again looking for the zero terminator.

orlp · on July 7, 2023

> But aren't you reading off the end of the buffer in your memcpy(&w...)?

If we go by the absolute strictest interpretation of the C standard my above implementation is UB.

But in practice, if p is word-aligned and is at least valid for 1 byte, then you will not pagefault for reading a whole word. In fact, this is how GCC/musl implement strlen itself.

> Say with an empty input string whose start address is aligned to sizeof(size_t) bytes?

Then the start address is valid (it must contain the null byte), and aligned to a word boundary, in which case I assume it is ok to also read a whole word there.

fuber2018 · on July 7, 2023

My SWAR version almost does what your vectorization algorithm description does - just that the SWAR-code looks rather gnarly because the compiler isn't auto-generating the vector code for you, it's hand-coded in C by me and I'm limited to 64 bits at a time.

fuber2018 · on July 7, 2023

I took the 64-bit SWAR ('S'IMD-'W'ithin-'A'-'R'egister) road and passed in the string length - the calling code has the length "right there"!!!

Using the original run_switches function, app took 3.554s (average of 10 runs).

With the SWAR-version with the string length passed in, app took 0.117s (average of 10 runs).

That's an overall 27.6x speedup.

fuber2018 · on July 7, 2023

If I unroll the main while loop to handle 4x as much each time through the loop in the SWAR-version, the runtime drops to 0.0562s (average 10 runs).

That's an overall 57.5x speedup.

fuber2018 · on July 7, 2023

If I convert the unrolled-64-bit SWAR function to use 32-bit chunks instead, average runtime almost doubles, approx. 0.1s now.

Need sleep now.

fuber2018 · on July 7, 2023

If I unroll the 64-bit SWAR version by 8x instead of 4x, the runtime is reduced by another 10% over the 4x-unrolled SWAR version. Diminishing returns...

fuber2018 · on Jan 20, 2022

There's a dark side to this pressure to perform well in all aspects - high suicide rate amongst your adults in the Palo Alto area.

see https://www.mercurynews.com/2017/03/03/cdc-report-youth-suic...

https://nextshark.com/children-of-affluent-parents-in-palo-a...

and

https://www.theatlantic.com/magazine/archive/2015/12/the-sil...

fuber2018 · on Jan 20, 2022

Some researchers suggest parents should praise effort instead of intelligence.

see https://www.nytimes.com/1998/07/14/science/praise-children-f...

Also children should be permitted to fail.

versteegen · on Jan 20, 2022

I haven't heard anyone disagree. Praising intelligence instead of effort is incredibly pernicious and foolish. I can speak from my own life experience.

fuber2018 · on Aug 6, 2020

Looking at the nutritional info for Soylent and Vite Ramen shows that they also contain Vitamin D - at the same DRV percentages.

If you're getting 105% of your magnesium from these items, then you're also getting 105% of your Vitamin D before taking any additional Vitamin D supplements/sun exposure.

Adding the 5000IU of Vitamin D3 would be increasing your intake to 5840IU (730% of DRV) - assuming you didn't ingest anything else fortified with Vitamin D.

fuber2018 · on Aug 4, 2020

If you are missing, then authorities usually like to have a recent photo of you to aid in the search.

I don't see an explicit reference to "recent image of me" in your "First step" list of data - photos in legal docs/credentials may not accurately represent your current physical appearance.

polote · on Aug 4, 2020

Hey, this is a very good remark, I'll try to do something about it, thanks