I can see this happening for things that run in entirely managed environments bu...

sitkack · on Jan 19, 2021

Very little user code generates binaries that can _tell_ it is running on non-x86 hardware. Rust is Arm Memory Model safe, existing C/C++ code that targets the x86 memory model is slowly getting ported over, but unless you are writing multithreaded C++ code that cuts corners it isn't an issue.

Running on the JVM, Ruby, Python, Go, Dlang, Swift, Julia or Rust and you won't notice a difference. It will be sooner than you think.

mhh__ · on Jan 19, 2021

It's not the memory model I'm thinking of but the cache design, ROB size etc.

Obviously this is fairly niche but the friction to making something fast is hugely easier locally.

sitkack · on Jan 19, 2021

The vast majority of developers never profile their code. I think this is much less of an issue than anyone on HN would rank it. Only when the platform itself provides traces do they take it into consideration. And even then, I think most perf optimization is in a category of don't do the obviously slow thing, or the accidentally n^2 thing.

I partially agree with you though, as the penetration of Arm goes deeper into the programmer ecosystem, any mental roadblocks about deploying to Arm will disappear. It is a mindset issue, not a technical one.

In the 80s and 90s there were lots of alternative architectures and it wasn't a big deal, granted the software stacks were much much smaller and more metal. Now they are huge, but more abstract and farther away from machine issues.

jerf · on Jan 19, 2021

"The vast majority of developers never profile their code."

Protip: New on the job and want to establish a reputation quickly? Find the most common path and fire a profiler at it as early as you can. The odds that there's some trivial win that will accelerate the code by huge amounts is fairly decent.

Another bit of evidence developers rarely profile their code is that I can tell my mental model of how expensive some server process will be to run and most other developer's mental models tend to differ by at least an order of magnitude. I've had multiple conversations about the services I provide and people asking me what my hardware is, expecting it to be run on some monster boxes or something when I tell them it's really just two t3.mediums, which mostly do nothing, and I only have two for redundancy. And it's not like I go profile crazy... I really just do some spot checks on hot-path code. By no means am I doing anything amazing. It's just... as you write more code, the odds that you accidentally write something that performs stupidly badly goes up steadily, even if you're trying not to.

nicoburns · on Jan 19, 2021

> Find the most common path and fire a profiler at it as early as you can. The odds that there's some trivial win that will accelerate the code by huge amounts is fairly decent.

I've found that a profiler isn't even needed to find significant wins in most codebases. Simple inspection of the code and removal of obviously slow or inefficient code paths can often lead to huge performance gains.

mattbee · on Jan 19, 2021

I mean I love finding those "obvious" improvements too but how do you know you've succeeded without profiling it? ;)

foobiekr · on Jan 20, 2021

Every piece of code I’ve looked at in my current job is filled with transformations back and forth between representations.

It’s so painful to behold.

Binary formats converted to JSON blobs, each bigger than my first hard drive (!), and then back again, often multiple times in the same process.

mhh__ · on Jan 19, 2021

This isn't really about you or me but the libraries that work behind the spaghetti people fling into the cloud.

vinay_ys · on Jan 19, 2021

Yes, and just like Intel & AMD spent a lot of effort/funding for building performance libraries and compilers, we should expect Amazon and Apple invest into similar efforts.

Apple will definitely give all the necessary tools as part of Xcode for iOS/MacOS software optimisation.

AWS is going to be more interesting – this is a great opportunity for them to provide distributed profiling/tracing tools (as a hosted service, obviously) for Linux that run across a fleet of Graviton instances and help you do fleet-wide profile guided optimizations.

We should also see a lot of private companies building high-performance services on AWS to contribute to highly optimized open-source libraries being ported to graviton.

fhrifjr · on Jan 19, 2021

So far I found getting started repo for Graviton with few pointers https://github.com/aws/aws-graviton-getting-started

vinay_ys · on Jan 19, 2021

What kind of pointers were you expecting?

I found it to have quite a lot of useful pointers. Specifically –https://static.docs.arm.com/swog309707/a/Arm_Neoverse_N1_Sof...

https://static.docs.arm.com/ddi0487/ea/DDI0487E_a_armv8_arm....

- these two docs gives lot of useful information.

And the repo itself contain a number of examples (like ffmpeg) that have been optimized based on these manuals.

lumost · on Jan 19, 2021

given a well designed chip which achieves competitive performance across most benchmarks, Most code will run sufficiently well for most use cases regardless of the nuance of specific cache design and sizes.

There is certainly an exception to this for chips with radically different designs and layouts, as well as folks writing very low-level performance sensitive code which can benefit from specific platform optimization ( graphics comes to mind ).

However even in the latter case, I'd imagine the platform specific and fallback platform agnostic code will be within 10-50% performance of each other. Meaning a particularly well designed chip could make the platform agnostic code cheaper on either a raw performance basis or cost/performance basis.

scythe · on Jan 19, 2021

If you use a VM language like Java, Ruby, etc, that work is largely abstracted.

tyingq · on Jan 19, 2021

True, though the work/fixes sometimes take a while to flow down. One example: https://bugs.openjdk.java.net/browse/JDK-8255351

foobiekr · on Jan 19, 2021

I honestly don’t know why you put Go or the JVM in this list. It isn’t that the language used properly has sane semantics in multithreaded code, it’s that generations of improper multithreaded code have appeared to work because the x86 memory semantics have covered up an unexpressed dependency that should have been considered incorrect.

Someone · on Jan 19, 2021

I would think the number of developers that have “that exact hardware” on their bench is extremely small (does AWS even tell you what cpu you get?)

What fraction of products deployed to the cloud even has its developers seen doing _any_ microbenchmarking?

ashtonkem · on Jan 19, 2021

Professional laptops don’t last that long, and a lot of developers are given MBPs for their work. I personally expect that I’ll get a M1 laptop from my employer within the next 2 years. At that point the pressure to migrate from x86 to ARM will start to increase.

mhh__ · on Jan 19, 2021

You miss my point - if I am seriously optimizing something I need to be on the same chip not the same ISA.

Graviton2 is a Neoverse core from Arm and it's totally separate from M1.

Besides, Apple don't let you play with PMCs easily and I'm assuming they won't be publishing any event tables any time soon so unless they get reverse engineered you'll have to do it through xcode.

ashtonkem · on Jan 19, 2021

Yes, the m1 isn’t a graviton 2. But then again the mobile i7 in my current MBP isn’t the same as the Xeon processors my code runs on in production. This isn’t about serious optimization, but rather the ability for a developer to reasonably estimate how well their code will work in prod (e.g. “will it deadlock”). The closer your laptop gets to prod, the narrower the error bars get, but they’ll never go to zero.

And keep in mind this is about reducing the incentive to switch to a chip that’s cheaper per compute unit in the cloud. If Graviton 2 was more expensive or just equal in price to x86, I doubt that M1 laptops alone would be enough to incentivize a switch.

mhh__ · on Jan 19, 2021

That's true but the Xeon cores are much easier to compare and correlate because of the aforementioned access to well defined and supported performance counters rather than Apple's holier than thou approach to developers outside the castle.

foobarian · on Jan 19, 2021

We have MBPs on our desks but our cloud are Centos Xeon machines. The problems I run into are not squeezing every last ms of performance, since it's vastly cheaper to just add more instances. The problems I care about is that some script I wrote suddenly doesn't work in production because of BSDisms, or Python incompatibilities, or old packages in brew, etc. Would be nice if Apple waved a magic wand and replaced its BSD subsystem with Centos* but I won't be holding my breath :)

* yes I know Centos is done, substitute as needed.

eloisant · on Jan 19, 2021

I just wish my employer would let me work on a Linux PC rather than a MBP, then I wouldn't have this mismatch between my machine and server...

singhrac · on Jan 19, 2021

I think this is a slightly different point from the other responses, but this not true: if I am seriously optimizing something I need ssh access to the same chip.

I don't run my production profiles on my laptop - why would I expect to compare how my i5 or i7 chip on a thermally limited MBP to how my 64 core server performs?

It's convenient for debugging to have the same instruction set (for some people, who run locally), but for profiling it doesn't matter at all.

benibela · on Jan 19, 2021

I profile in valgrind :/

lostapathy · on Jan 19, 2021

This is typical Hacker News. Yes, some people "seriously optimize" but the vast majority of software written is not heavily optimized nor is it written at companies with good engineering culture.

Most code is worked on until it'll pass QA then thrown over the wall. For that majority of people, an M1 is definitely close enough to a graviton.

mhh__ · on Jan 19, 2021

> typical hacker news

Let me have my fun!

saagarjha · on Jan 20, 2021

Instruments exposes a fair number of counters, though–what's wrong with using it?

astrange · on Jan 20, 2021

I actually recommend just using 'spindump' and reading the output in a text editor. If you just want to look through a callstack adding pretty much any UI just confuses things.

saagarjha · on Jan 20, 2021

I am currently working on a native UI to visualize spindumps :(

astrange · on Jan 20, 2021

Well try not to get the user lost in opening and closing all those call stack outline views, I'd rather just scroll in BBEdit ;)

saagarjha · on Jan 20, 2021

It’s outline views, but I’ll see if I can keep an option to scroll through text too. (Personally, a major reason why I made this was I didn’t want to scroll through text like Activity Monitor does…)

api · on Jan 19, 2021

I don't think it takes "exact" hardware. It takes ARM64, which M1 delivers. I already have a test M1 machine with Linux running in a Parallels (tech preview) VM and it works great.