Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Cerebras launches inference for Llama 3.1; benchmarked at 1846 tokens/s on 8B (twitter.com/artificialanlys)
95 points by _micah_h on Aug 27, 2024 | hide | past | favorite | 42 comments


Here’s an AI voice assistant we built this weekend that uses it:

https://x.com/dsa/status/1828481132108873979?s=46&t=uB6padbn...


8b models won’t even need a server a year from now. Basically the only reason to go to the server a year or two from now will be to do what edge devices can’t do: general purpose chat, long context (multimodal especially), data augmented generation that relies on pre-existing data sources in the cloud, etc. And on the server it’s very expensive to run batch size 1. You want to maximize the batch size while also keeping an eye on time to first token and time per token. Basically 20-25 tok/sec generation throughput is a good number for most non-demo workloads. TTFT for median prompt size should ideally be well under 1 sec.

But I’m happy they got this far. It’s an ambitious vision, and it’s extra competition in a field where it’s severely lacking.


Yep it is fast. Now what exactly is Llama 8B useful is another matter - what are some good use cases?

One scenario I can think of is rolepaying - but I would assume that the slow streaming speed was kind of a feature there.


For agentic use cases, where you might need several round-trips to the LLM to reflect on a query, improve a result, etc., getting fast inference means you can do more round-trips while still responding in reasonable time. So basically any LLM use-case is improved by having greater speed available IMO.


The problem with this is tok/sec does not tell you what time to first token is. I've seen (with Groq) where this is large for large prompts, nullifying the advantage of faster tok/sec.


Speed is useful for batch tasks or doing a bunch of serial tasks quickly. E.g. "take these 1000 pitch decks and give me 5 bullets on each", "run this prompt 100 times and then pick the best response", "detect which of these 100k comments mention the SF Giants".


8B is not exactly great for roleplaying, if we put the bar any high. It is just not sophisticated enough, as it has very limited "reasoning"-like capabilities and can normally make sensible conclusions only about very basic things (like if it's raining, maybe character will get wet). It can and will hallucinate about stuff like inventories or rules - and it's not a context length thing. If there are multiple NPCs, things get worse, as they're starting to all mix up.

70B does significantly better in this regard. Nowhere close to perfection, but the frequency of WTFs about LLM's output are [subjectively] drastically lower.

Speed can be useful in RP if we'd run multiple LLM-based agents (like "plot", "goal checker", "inventory", "validation", "narrator") that function call each other to achieve some goal.


These wafers only have 44GB of RAM though. Very curious why the quantity is so low considering the chips are absolutely massive. It's SRAM though so very fast, comparable to cache in a modern CPU. But I assume being fast and loading the whole model there is the point.


probably largely because dram is a lot more dense than sea.


What kind of answer are you looking for? Just start asking it questions. The constant demand for a magic silver bullet use case applicable to every person in the country is wild. If you have to ask, you're not using it.

What exact use case did google.com enable you to do that made it worthwhile for everyone to immediately start using? It let you access nytimes.com? Access amazon.com? No, it let you ask off the wall, asinine, long tail questions no one else asked.


Surveillance states and intelligence agencies.

Or maybe a MMO with a town of NPCs.


Why can't the MMO with a town of NPCs have an intelligence agency too?


The winner will be one of two approaches: 1) Getting great performance using regular DRAM - system memory. 2) Bringing the compute to the RAM chips - DRAM is accessed 64Kb per row (or more?) and at ~10ns per read you can use small/slow ALUs along the row to do MAC operations. Not sure how you program that though.

Current "at home" inference tends to be limited by how much RAM your graphics card has, but system RAM scales better.


I'll probably get stoned for asking here, but... since you seem knowledgeable on the subject:

I just got llama3.1-8b (standard and instruct). However, I cannot do anything with it on my current hardware. Can you recommend the best AI model that I: 1) can self host 2) run on 16GB ram with no dedicated graphics card and an old intel i5 3) use on Debian without installing a bunch of exo-repo mystery code?

Any recommendation, directly or semi related would be appreciated - I'm doing my 'research' but haven't made much progress nor had any questions answered.


Running LLMs on that kind of hardware will be very slow (expect responses with only a few words per second, which is probably pretty annoying).

LM Studio [1] makes it very easy to run models locally and play with them. Llama 3.1 will only run in quantized form with 16GB RAM, and that cripples it quite badly, in my opinion.

You may try Phi-3 Mini, which has only 3.8B weights and can still do fun things.

[1] https://lmstudio.ai/


I don't find llama3.1 noticeably worse on 8 bit integer quantised than the original fp16 to be honest. It's also a lot faster.

Of course even then you're not going to reach the whole 128k context window on 16GB but if you don't need that it works great.


Much appreciated. Thanks for this!


Setting up Ollama via Docker was the easiest way for me to get up and running. Not 100% sure if it fits your constraints, but highly recommended.


Another option is to download and compile llama.cpp and you should be able to run quantized models at an acceptable speed.

https://github.com/ggerganov/llama.cpp

Also, if you can spend the $60 and buy another 32GB of RAM, this will allow you to run the 30GB models quite nicely.


Unfortunately motherboard is capped at 16Gb ram


+1. For inference especially compute is abundant and basically free in terms of energy. Almost all of the energy is spent on memory movement. The logical solution is to not move unaggregated data.


Completely eliminating the separation between RAM and compute is how FPGAs are so fast, they do most of the computing as a series of Look Up Tables (LUTs), and optimize for latency and utilization with fancy switching fabrics.

The downside of the switching fabrics is that optimizing a design to fit an FPGA can sometimes take days.


The winner, unfortunately, will be on cloud inference.


[dupe]

More discussion on official post: https://news.ycombinator.com/item?id=41369705


Wow one chip taking up a whole wafer. I bet their yields are low, though I assume they're not using the bleeding edge process but a slightly older one that's totally worked out.

Still the price of one of these would be nuts if they'd sell them. Upwards of 1 million?


Guessing it’s set up in a way where they can just disable dead cores


Process defects can be located and routed around statically on the chip, it's described e.g. here: https://youtu.be/8i1_Ru5siXc?t=810


Time to first token is as important to know for many use cases, rarely are people reporting it



Very interested in playing with their hardware and cloud. Also I wonder if it’s possible to try cloud without contacting their sales.


Why is it so gosh darned slow? If you've got enough transistors to hold 44 gigabytes of RAM, you've got enough to have the whole model in stored with no need for off-chip transfers.

I'd expect tokens out at 1 Ghz aggregate. Anything less than 1 Mhz is a joke.... ok, not a joke, but surprisingly slow.


Even if they could generate tokens at that speed on the chip (which maybe they can in theory?) you need to get user tokens onto the chip and the resulting model tokens off again and transport them to the user as well. This means at some point the I/O becomes the bottleneck, not the compute. I also suspect it will get faster still, from the announcement it didn't sound like it's "optimal" yet.


User tokens onto the chip and output tokens out are tiny.


Not if you're serving tens of thousands of users at the same time.


Still tiny at 100,000.


On die communication isn’t free, a lot of things here are sequential and within matrix multiplies the cores have to transfer output and mem loads have to be distributed. It’s really fast but not like one cycle


You could add a series of latches, and use the magic of graph coloring to eliminate any timing issues, and pipeline the thing sufficiently to get a GHz of throughput, even if it takes many cycles to make it all the way though the pipe.

Personally, I'd put all the parameters in NOR flash, then cycle through the row lines sequentially to load the parameters into the MAC. You could load all the inputs in parallel as fast as the dynamic power limits of the chip allow. If you use either DMA or a hardware ring buffer to push all the tokens through the layers, you could keep the throughput going with various sizes of models, etc.

Obviously with only one MAC you couldn't have a single stream at a GHZ, but you could have 4000 separate streams of 250,000 tokens/second.


Their numbers are for a single input, I assume the throughput is much higher given the prices they are quoting and the cost of a single cs3.


It only needs to compute about a trillion floating-point operations per token, and each layer relies on the previous one.

I wonder why it doesn't output a billion tokens per second.


The coarse estimate of compute in transformers is about as many MACs as there are weights, or twice as many flops (because multiplication and addition are counted as separate operations). So for llama 70b that’s about 70b MACs per token, which is manageable. What’s far less manageable is reading the entire model into RAM N times a second


This would only be the case if we ignore the multiplication between queries and keys, and the resulting vector being multiple with the values, and also the multiple heads.


No, that is always the case. Attention is only about one third the ops and qk is a fraction of that. Outside of truly massive sequence lengths it doesn’t matter a whole lot, even though it’s nominally quadratic. It’s trivial to run the numbers on this - you only need to do it for one layer.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: