> I only ask because I've been running local models (using Ollama) on my RX 7900 XTX for the last year and a half or so and haven't had a single problem that was ROCm specific that I can think of.
It's probably using the Vulkan backend, that is pretty stable and performance is good.
Small models aren't entirely useless, and the NPU can run LLMs up to around 8B parameters from what I've seen. So one way they could be useful: Qwen3 text to speech models are all under 2B parameters, and Open AI's whisper-small speech to text model is under 1B parameters, so you could have an AI agent that you could talk to and could talk back, where, in theory, you could offload all audio-text and text-audio processing to the low power NPU and leave the GPU to do all of the LLM processing.
That seems like a really niche use case, and probably not worth the surface area? The power savings would have to be truly astonishing to justify it, given what a small fraction of compute time your average device spends processing voice input. I'd wager the 90th percentile siri/ok google/whatever user issues less than 10 voice queries per day. How much power can they use running on normal hardware and how much could it possibly matter?
It's just an example where it fits perfectly, and it's exactly what something like Alexa or Google home needs for low power machine learning, eg. when sitting idle it needs to consume as little power as possible while waiting for a trigger word.
Any context that needs some limited intelligence while consuming little power would benefit from this.
You could always offload some layers to the NPU for lower power use and leave the rest to the GPU. If the latter is power throttled (common for prefill, not for decode) that will be a performance improvement.
You want routing to be as quick as possible, because there are dependent loads of expert MoE weights (at least from CPU in most setups, potentially from storage) downstream of it. So that ultimately depends on what the bottleneck on that part of the model is: compute, memory throughput or both? If it's throughput, the NPU might be a bad fit.
From what I understand, ROCm is a lot buggier and has some performance regressions on a lot of GPUs in the 7.x series. Vulkan performance for LLMs is apparently not far behind ROCm and is far more stable and predictable at this time.
> Thiel's book has influenced so many entrepreneurs into believing Monopolies are Good.
Haven't read his book, but the idea that monopolies are good isn't typically made in a vacuum, it's made relative to alternatives, most often "ham-fisted government intervention". It's easier to take down a badly behaving monopoly than to change government, so believing monopolies are better than the alternatives seems like a decent heuristic.
>Haven't read his book, but the idea that monopolies are good isn't typically made in a vacuum, it's made relative to alternatives, most often "ham-fisted government intervention". It's easier to take down a badly behaving monopoly than to change government, so believing monopolies are better than the alternatives seems like a decent heuristic.
What? How is the first alternative poor government instead of multipolar competing companies? When was the last time a Monopoly was actually broken up in the US? ATT/Bell 50 years ago? lol
A Monopoly implies an organization powerful enough to stop competition. Seems like this solution that relies on competitors is fatally flawed. If there are enough competitors to meaningfully compete then there isn't a monopoly.
This seems like a classic straw man argument. Plutocratic oligarchs have been making the argument that private monopolies are better than representative democracy at basically any societal function for decades without any actual data.
This seems like a classic straw man argument. Plutocratic oligarchs have been making the argument that private monopolies are better than representative democracy at basically anything for decades.
> and they will spend infinite amounts of circular fake money to ensure hardware remains prohibitively expensive forever.
That's ridiculous, "infinite money" isn't a thing. They will spend as much as they can not because they want to keep local solutions out, but because it enables them to provide cheaper services and capture more of the market. We all eventually benefit from that.
> That's ridiculous, "infinite money" isn't a thing.
My reading of GP is that he was being sarcastic - "infinite amounts of circular fake money" is probably a reference to these circular deals going on.
If A hands B investment of $100, then B hands A $100 for purchase of hardware, A's equity in B, on paper, is $100, plus A has revenue of $100 (from B), which gives A total assets of $200.
Obviously it has to be shuffled more thoroughly, but that's the basic idea that I thought GP was referring to.
> it's not clear to me based on the description how this could all be done efficiently.
Depends how you define efficiency. The power use of this rig is a lot less than the large data centers that serve trillion parameter models. The page suggests that the final dollar cost per request is an order of magnitude lower than the frontier models charge.
reply