Cerebras launches inference for Llama 3.1; benchmarked at 1846 tokens/s on 8B

russ · on Aug 27, 2024

Here’s an AI voice assistant we built this weekend that uses it:

https://x.com/dsa/status/1828481132108873979?s=46&t=uB6padbn...

ein0p · on Aug 27, 2024

8b models won’t even need a server a year from now. Basically the only reason to go to the server a year or two from now will be to do what edge devices can’t do: general purpose chat, long context (multimodal especially), data augmented generation that relies on pre-existing data sources in the cloud, etc. And on the server it’s very expensive to run batch size 1. You want to maximize the batch size while also keeping an eye on time to first token and time per token. Basically 20-25 tok/sec generation throughput is a good number for most non-demo workloads. TTFT for median prompt size should ideally be well under 1 sec.

But I’m happy they got this far. It’s an ambitious vision, and it’s extra competition in a field where it’s severely lacking.

freediver · on Aug 27, 2024

Yep it is fast. Now what exactly is Llama 8B useful is another matter - what are some good use cases?

One scenario I can think of is rolepaying - but I would assume that the slow streaming speed was kind of a feature there.

seldo · on Aug 27, 2024

For agentic use cases, where you might need several round-trips to the LLM to reflect on a query, improve a result, etc., getting fast inference means you can do more round-trips while still responding in reasonable time. So basically any LLM use-case is improved by having greater speed available IMO.

freediver · on Aug 27, 2024

The problem with this is tok/sec does not tell you what time to first token is. I've seen (with Groq) where this is large for large prompts, nullifying the advantage of faster tok/sec.

rgbrgb · on Aug 27, 2024

Speed is useful for batch tasks or doing a bunch of serial tasks quickly. E.g. "take these 1000 pitch decks and give me 5 bullets on each", "run this prompt 100 times and then pick the best response", "detect which of these 100k comments mention the SF Giants".

drdaeman · on Aug 27, 2024

8B is not exactly great for roleplaying, if we put the bar any high. It is just not sophisticated enough, as it has very limited "reasoning"-like capabilities and can normally make sensible conclusions only about very basic things (like if it's raining, maybe character will get wet). It can and will hallucinate about stuff like inventories or rules - and it's not a context length thing. If there are multiple NPCs, things get worse, as they're starting to all mix up.

70B does significantly better in this regard. Nowhere close to perfection, but the frequency of WTFs about LLM's output are [subjectively] drastically lower.

Speed can be useful in RP if we'd run multiple LLM-based agents (like "plot", "goal checker", "inventory", "validation", "narrator") that function call each other to achieve some goal.

wkat4242 · on Aug 27, 2024

These wafers only have 44GB of RAM though. Very curious why the quantity is so low considering the chips are absolutely massive. It's SRAM though so very fast, comparable to cache in a modern CPU. But I assume being fast and loading the whole model there is the point.

adgjlsfhk1 · on Aug 28, 2024

probably largely because dram is a lot more dense than sea.

halJordan · on Aug 27, 2024

What kind of answer are you looking for? Just start asking it questions. The constant demand for a magic silver bullet use case applicable to every person in the country is wild. If you have to ask, you're not using it.

What exact use case did google.com enable you to do that made it worthwhile for everyone to immediately start using? It let you access nytimes.com? Access amazon.com? No, it let you ask off the wall, asinine, long tail questions no one else asked.

bottlepalm · on Aug 27, 2024

Surveillance states and intelligence agencies.

Or maybe a MMO with a town of NPCs.

benopal64 · on Aug 27, 2024

Why can't the MMO with a town of NPCs have an intelligence agency too?

phkahler · on Aug 27, 2024

The winner will be one of two approaches: 1) Getting great performance using regular DRAM - system memory. 2) Bringing the compute to the RAM chips - DRAM is accessed 64Kb per row (or more?) and at ~10ns per read you can use small/slow ALUs along the row to do MAC operations. Not sure how you program that though.

Current "at home" inference tends to be limited by how much RAM your graphics card has, but system RAM scales better.

eth0up · on Aug 27, 2024

I'll probably get stoned for asking here, but... since you seem knowledgeable on the subject:

I just got llama3.1-8b (standard and instruct). However, I cannot do anything with it on my current hardware. Can you recommend the best AI model that I: 1) can self host 2) run on 16GB ram with no dedicated graphics card and an old intel i5 3) use on Debian without installing a bunch of exo-repo mystery code?

Any recommendation, directly or semi related would be appreciated - I'm doing my 'research' but haven't made much progress nor had any questions answered.

smokel · on Aug 27, 2024

Running LLMs on that kind of hardware will be very slow (expect responses with only a few words per second, which is probably pretty annoying).

LM Studio [1] makes it very easy to run models locally and play with them. Llama 3.1 will only run in quantized form with 16GB RAM, and that cripples it quite badly, in my opinion.

You may try Phi-3 Mini, which has only 3.8B weights and can still do fun things.

[1] https://lmstudio.ai/

wkat4242 · on Aug 27, 2024

I don't find llama3.1 noticeably worse on 8 bit integer quantised than the original fp16 to be honest. It's also a lot faster.

Of course even then you're not going to reach the whole 128k context window on 16GB but if you don't need that it works great.

eth0up · on Aug 27, 2024

Much appreciated. Thanks for this!

arcanemachiner · on Aug 27, 2024

Setting up Ollama via Docker was the easiest way for me to get up and running. Not 100% sure if it fits your constraints, but highly recommended.

programd · on Aug 27, 2024

Another option is to download and compile llama.cpp and you should be able to run quantized models at an acceptable speed.

https://github.com/ggerganov/llama.cpp

Also, if you can spend the $60 and buy another 32GB of RAM, this will allow you to run the 30GB models quite nicely.

eth0up · on Aug 27, 2024

Unfortunately motherboard is capped at 16Gb ram

ein0p · on Aug 27, 2024

+1. For inference especially compute is abundant and basically free in terms of energy. Almost all of the energy is spent on memory movement. The logical solution is to not move unaggregated data.

mikewarot · on Aug 27, 2024

Completely eliminating the separation between RAM and compute is how FPGAs are so fast, they do most of the computing as a series of Look Up Tables (LUTs), and optimize for latency and utilization with fancy switching fabrics.

The downside of the switching fabrics is that optimizing a design to fit an FPGA can sometimes take days.

rfoo · on Aug 27, 2024

The winner, unfortunately, will be on cloud inference.

ChrisArchitect · on Aug 27, 2024

[dupe]

More discussion on official post: https://news.ycombinator.com/item?id=41369705

wkat4242 · on Aug 27, 2024

Wow one chip taking up a whole wafer. I bet their yields are low, though I assume they're not using the bleeding edge process but a slightly older one that's totally worked out.

Still the price of one of these would be nuts if they'd sell them. Upwards of 1 million?

Havoc · on Aug 28, 2024

Guessing it’s set up in a way where they can just disable dead cores

twothreeone · on Aug 28, 2024

Process defects can be located and routed around statically on the chip, it's described e.g. here: https://youtu.be/8i1_Ru5siXc?t=810

bkitano19 · on Aug 27, 2024

Time to first token is as important to know for many use cases, rarely are people reporting it

Gcam · on Aug 27, 2024

See here for our TTFT metric benchmarks: https://artificialanalysis.ai/models/llama-3-1-instruct-70b/...

cheptsov · on Aug 27, 2024

Very interested in playing with their hardware and cloud. Also I wonder if it’s possible to try cloud without contacting their sales.

mikewarot · on Aug 27, 2024

Why is it so gosh darned slow? If you've got enough transistors to hold 44 gigabytes of RAM, you've got enough to have the whole model in stored with no need for off-chip transfers.

I'd expect tokens out at 1 Ghz aggregate. Anything less than 1 Mhz is a joke.... ok, not a joke, but surprisingly slow.

twothreeone · on Aug 27, 2024

Even if they could generate tokens at that speed on the chip (which maybe they can in theory?) you need to get user tokens onto the chip and the resulting model tokens off again and transport them to the user as well. This means at some point the I/O becomes the bottleneck, not the compute. I also suspect it will get faster still, from the announcement it didn't sound like it's "optimal" yet.

cma · on Aug 28, 2024

User tokens onto the chip and output tokens out are tiny.

twothreeone · on Aug 28, 2024

Not if you're serving tens of thousands of users at the same time.

cma · on Aug 28, 2024

Still tiny at 100,000.

chessgecko · on Aug 27, 2024

On die communication isn’t free, a lot of things here are sequential and within matrix multiplies the cores have to transfer output and mem loads have to be distributed. It’s really fast but not like one cycle

mikewarot · on Aug 27, 2024

You could add a series of latches, and use the magic of graph coloring to eliminate any timing issues, and pipeline the thing sufficiently to get a GHz of throughput, even if it takes many cycles to make it all the way though the pipe.

Personally, I'd put all the parameters in NOR flash, then cycle through the row lines sequentially to load the parameters into the MAC. You could load all the inputs in parallel as fast as the dynamic power limits of the chip allow. If you use either DMA or a hardware ring buffer to push all the tokens through the layers, you could keep the throughput going with various sizes of models, etc.

Obviously with only one MAC you couldn't have a single stream at a GHZ, but you could have 4000 separate streams of 250,000 tokens/second.

chessgecko · on Aug 27, 2024

Their numbers are for a single input, I assume the throughput is much higher given the prices they are quoting and the cost of a single cs3.

GaggiX · on Aug 27, 2024

It only needs to compute about a trillion floating-point operations per token, and each layer relies on the previous one.

I wonder why it doesn't output a billion tokens per second.

ein0p · on Aug 27, 2024

The coarse estimate of compute in transformers is about as many MACs as there are weights, or twice as many flops (because multiplication and addition are counted as separate operations). So for llama 70b that’s about 70b MACs per token, which is manageable. What’s far less manageable is reading the entire model into RAM N times a second

GaggiX · on Aug 27, 2024

This would only be the case if we ignore the multiplication between queries and keys, and the resulting vector being multiple with the values, and also the multiple heads.

ein0p · on Aug 27, 2024

No, that is always the case. Attention is only about one third the ops and qk is a fraction of that. Outside of truly massive sequence lengths it doesn’t matter a whole lot, even though it’s nominally quadratic. It’s trivial to run the numbers on this - you only need to do it for one layer.