This won't materialize into a legitimate threat on the NVIDIA/TPU landscape with...

trueismywork · 2025-12-09T19:59:50 1765310390

And data hosting rules

bri3d · 2025-12-09T18:19:46 1765304386

IMO the value of COTS software stack compatibility is becoming overstated: academics, small research groups, hobbyists, and some enterprises will rely on commodity software stacks working well out of the box, but large pure/"frontier"-AI inference-and-training companies are already hand optimizing things anyway and a lot of less dedicated enterprise customers are happy to use provided engines (like Bedrock) and operate at only the higher level.

I do think AWS need to improve their software to capture more downmarket traction, but my understanding is that even Trainium2 with virtually no public support was financially successful for Anthropic as well as for scaling AWS Bedrock workloads.

Ease of optimization at the architecture level is what matters at the bleeding edge; a pure-AI organization will have teams of optimization and compiler engineers who will be mining for tricks to optimize the hardware.

mrlongroots · 2025-12-09T18:19:11 1765304351

Hyperscalers do not need to achieve parity with Nvidia. There's a (let's say) 50% headroom in terms of profit margins, and plenty of headroom in terms of the complexity custom chip efforts need to implement: they don't need the complexity or generality of Nvidia's chips. If a simple architecture allows them to do inference at 50% of the TCO and 1/5th the complexity and reduce their Nvidia bill by 70% that's a solid win. I'm being fast and loose with numbers and Trainium clearly seems to have ambitions beyond inference, but given the hundreds of billions each cloud vendor is investing in the AI buildout, a couple billion on IP that you will own afterwards is a no brainer. Nvidia has good products and a solid head start but they're not unassailable or anything.

stogot · 2025-12-09T16:21:38 1765297298

This is addressed in the article.

> In fact, they are conducting a massive, multi-phase shift in software strategy. Phase 1 is releasing and open sourcing a new native PyTorch backend. They will also be open sourcing the compiler for their kernel language called “NKI” (Neuron Kernal Interface) and their kernel and communication libraries matmul and ML ops (analogous to NCCL, cuBLAS, cuDNN, Aten Ops). Phase 2 consists of open sourcing their XLA graph compiler and JAX software stack.

> By open sourcing most of their software stack, AWS will help broaden adoption and kick-start an open developer ecosystem. We believe the CUDA Moat isn’t constructed by the Nvidia engineers that built the castle, but by the millions of external developers that dig the moat around that castle by contributing to the CUDA ecosystem. AWS has internalized this and is pursuing the exact same strategy.

coredog64 · 2025-12-09T17:10:03 1765300203

I wish AWS all the best, but I will say that their developer-facing software doesn't have the best track record. Munger-esque "incentive defines the outcome" and all that, but I don't think they're well positioned to collect actionable insight from open GitHub repos.

almostgotcaught · 2025-12-09T17:29:33 1765301373

This isn't an "enormous software investment", this is table stakes which lose out heads up against Nvidia. See AMD.

ivape · 2025-12-09T17:18:08 1765300688

In terms of their seriousness, word on the street is they are moving from custom chips they could be getting from Marvell over to some company I've never heard of it. So, they are making decisions that appear serious in this direction:

With Alchip, Amazon is working on "more economical design, foundry and backend support" for its upcoming chip programs, according to Acree.

https://www.morningstar.com/news/marketwatch/20251208112/mar...