Also generally I think CoreML isn't the best. The best solution for ORT would probably be to introduce a pure MPS provider (https://github.com/microsoft/onnxruntime/issues/21271), but given they've already bought into CoreML the effort may not be worth the reward for the core team. Which fair enough as it's a pretty mammoth task
However one benefits of CoreML - it is the only way to be able for 3rd party to execute on ANE (Apple Neural Engine aka NPU). ANE for some models can execute even faster than GPU/MPS and consume even less battery.
But I agree CoreML in ONNX Runtime is not perfect - most of the time when I tested some models there were too many partitioning and whole graph was running slower compare when using only model in just CoreML format.
To be honest it's a shame the whole thing is closed up, I guess it's to be expected from Apple, but I reckon CoreML would be benefit a lot from at least exposing the internals/allowing users to define new ops.
Also, the ANE only allows some operators to be ran on it right? There's very little transparency/control on what can be offloaded to it and cannot which makes using it difficult.
My 2023 Macbook Pro (M2 Max) is coming up to 3 years old and I can run models locally that are arguably "better" than what was considered SOTA about 1.5 years ago. This is of course not an exact comparison but it's close enough to give some perspective.
I suspect Ollama is at least partly moving away open source as they look to raise capitol, when they released their replacement desktop app they did so as closed source. You're absolutely right that people should be using llama.cpp - not only is it truly open source but it's significantly faster, has better model support, many more features, better maintained and the development community is far more active.
Only issue I have found with llama.cpp is trying to get it working with my amd GPU. Ollama almost works out of the box, in docker and directly on my Linux box.
I havent tried agentic coding as I havent set it up in a container yet, and not going to yolo my system (doing stuff via chat and a utility to copy and paste directories and files got me pretty far over the last year and a half).
It helps that Codex is so much slower than Anthropic models, a 4.5 hours Codex session might as well be a 2 hour Claude Code one. I use both extensively FWIW.
It really depends. When building a lot of new features it happens quite fast. With some attention to context length I was often able to go for over an hour on the 20$ claude plan.
If you're doing mostly smaller changes, you can go all day with the 20$ Claude plan without hitting the limits. Especially if you need to thoroughly review the AI changes for correctness, instead of relying on automated tests.
I find that I use it on isolated changes where Claude doesn’t really need to access a ton of files to figure out what to do and I can easily use it without hitting limits. The only time I hit the 4-5 hour limit is when I’m going nuts on a prototype idea and vibe coding absolutely everything, and usually when I hit the limit, I’m pretty mentally spent anyway so I use it as a sign to go do something else. I suppose everyone has different styles and different codebases, but for me I can pretty easily stay under the limit without that it’s hard to justify $100 or $200 a month.
The DGX Spark is not good for inference though it's very bandwidth limited - around the same as a lower end MacBook Pro. You're much better off with a Apple silicon for performance and memory size at the moment but I'd recommend holding off until the M5 Max comes out early in the early as the M5 has vastly superior performance to any other Apple silicon chip thanks to its matmul instruction set.
Oof, I was already considering an upgrade from the M1 but was hoping I couldn't be convinced to go for the top of the line. Is the performance jump from the M# -> M# Max chips that substantial?
The main jump is from anything to M5; not because it's simply the latest but because it has matmul instructions similar to a CUDA GPU which fixes the slow prompt processing on all previous generation Apple Silicon chips.
Thanks! Quick overview: Paths are deterministic, not LLM-generated. I use OpenAI text-embedding-3-large to build a word graph with K-nearest neighbors, then BFS finds the shortest path. No sampling involved. The explanations shown in-game are generated afterward by GPT-5 to explain the semantic jumps. Planning to write up the full architecture in a blog post - will share here when it's ready.
Oh that makes a lot of sense, I'm glad it works that way actually - the explanations afterwards left me wondering if it was truly explaining the connections or if it was inferring what they would be (leading to a problem a bit like how "thinking" doesn't actually show the real connections to get to an answer) I'm glad it's not doing that. Neat game and learning opportunity. (Sorry for not wording that very well - long day!)
The API key powers Grov's features (Haiku for reasoning extraction + drift detection). It does work with claude max plans, for example I use it with my claude code instances, and I am a max user, but you just have to use an API key for the fundamental features of Grov.
If this is a deal-breaker for you, in the near future I'll let teams use our API key, so you can just install it and run it normally without having to set anything up other than connect to your team. If you have any other questions you can find my email in the repo.
reply