Most people who use one of these will be doing so through an EC2 VM (or equivalent). Given that cloud platforms can spread load, keep these GPUs churning close to 24/7 and more easily predict/amortize costs, they’ll probably buy the amount that they know they need, and Nvidia probably has some approximately correct idea of what that number is.
As someone who've tried for some weeks, it really seems like it's out-of-stock literally everywhere. The demand seems to be a lot higher than the supply at the moment, so much that I'm considering buying one myself instead of renting servers with it.
Does it make sense that all the GPUs are bought out? They each provide a return for mining in the short-term. In the long term, they can be used to run A(G)I models, which will be very very useful
This still makes sense! TPUs are useful for AI, which itself will be very very useful. It’s almost like it’s the best investment. That’s why smart players buy them all. Maybe I’m going out-of-topic.
Yes, nor Lambda Labs or Exxact Corporation have them available last time I checked (last week). Both citing high demand as the reason for it being unavailable.
We (Lambda) have all of the different NVIDIA GPUs in stock ---- can you send a message to sales@lambdalabs.com and check in again with your requirements? We're seeing a lot more stock these days as the supply chain crisis of 2021 comes to an end.
I talked with you (Lambda Labs) just a week ago about the A100 specifically and you said that the demand was higher than the supply, and that people should check once a day or something like that to see if it's available in your dashboard. If you clearly have it available now, please say so outright instead of trying to push some other offer on me in emails :)
Howdy, I run [Crusoe Cloud](https://crusoecloud.com/) and we just launched an alpha of an A100 and A40 Cloud offering--we've got capacity at a reasonable price!
If you're interested in giving us a shot, feel free to shoot me an email at mike at crusoecloud dot com.
Someone should make a game like "Pokemon or Big Data" [1] except you have to choose which of two GPU names is faster. Even the consumer naming is bonkers so there's plenty of material there!
Isn't this the norm? Only AMD started the trend of naming the uArch with Numbers as Zen 4 or RDNA 3 fairly recently. With Intel it is Haswell > Broadwell > ..... Whatever Lake.
Usually the architecture name isn't the only distinguishing feature of the product name, you don't need to remember Intel codenames because a Core 12700 is obviously newer than a Core 11700
Nvidia's accelerators are just called <Architecture Letter>100 every time so if you don't remember the order of the letters it's not obvious
They could have just named them P100, V200, A300 and H400 instead
Intel was using Core i[3,5,7] names for multiple generations. A Core i7 could be faster or slower than a Core i5 depending on which generation each existed in.
It is nice when products have a naming scheme where natural ordering of the name maps to performance.
And an AMD 5700U is older than a 5400U as well. A 3400G is older than a 3100X. 3300X isn't really distinctive from 3100X, both are quad-core configurations (but different CCD/cache configurations, which is of course the name doesn't really disclose to the consumer). It happens, naming is a complex topic and there's a lot of dimensions to a product.
In general, complaining about naming is peak bikeshedding for the tech-aware crowd. There are multiple naming schemes, all of them are reasonable, and everyone hates some of them for completely legitimate reasons (but different for every person). And the resulting bikeshedding is exactly as you'd expect with that.
The underlying problem is that products have multiple dimensions of interest - you've got architecture, big vs small core, core count, TDP, clockrate/binning, cache configuration/CCD configuration, graphics configuration, etc. If you sort them by generation, then an older but higher-spec can beat a newer but lower-spec. If you sort by date then refreshes break the scheme. If you split things out into series (m7 vs i7) to express TDP then some people don't like that there's a bunch of different series. If you put them into the same naming scheme then some people don't like that a 5700U is slower than a 5700X. If you try to express all the variables in a single name, you end up with a name like "i7 1185G7" where it's incomprehensible if you don't understand what each of the parts of the name mean.
(as a power user, I personally think the Ice Lake/Tiger Lake naming is the best of the bunch, it expresses everything you need to know: architecture, core count, power, binning, graphics. But then big.LITTLE had to go and mess everything up! And other people still hated it because it was more complex.)
There are certain ones like AMD's 5000 series or the Intel 10th-gen (Comet Lake 10xxxU) that are just really ghastly because they're deliberately trying to mix-and-match to confuse the consumer (to sell older stuff as being new), but in general when people complain about "not understanding all those Lakes and Coves" it's usually just because they aren't interested in the brand/product and don't want to bother learning the names, and they will eagerly rattle off a list of painters or cities that AMD uses as their codenames.
Like, again, to reiterate here, I literally never have seen anyone raise AMD using painter names as being "opaque to the consumer" in the same way that people repeatedly get upset about lakes. And it's the exact same thing. It's people who know the AMD brand and don't know the Intel brand and think that's some kind of a problem with the branding, as opposed to a reflection of their own personal knowledge.
I fully expect that AMD will release 7000 series desktop processors this year or early next year, and exactly 0 people are going to think that a 7600 being newer than a 7702 is confusing in the way that we get all these aggrieved posts about Intel and NVIDIA. Yes, 7600 and 7702 are different product lines, and that's the exact same as your "but i7 3770 and N3060 are different!" example. It's simply not that confusing, it takes less time to learn than to make a single indignant post on social media about it.
Similarly, the NVIDIA practice of using inventors/compsci people is not particularly confusing either. Basically the same as AMD with the painters/cities.
It's just not that interesting, and it's not worth all the bikeshedding that gets devoted to it.
</soapbox>
Anyway, your example is all messed up though. J3710 and J3060 are both the same gen (Braswell), launched at the same time (Q1 2016), that example is entirely wrong. J4125 vs J4205 is an older but higher specced processor vs a newer but lower spec, it's a 8th gen Pentium vs a 9th gen Celeron, like a 3100X vs a 2700X (zomg 3100X is bigger number but actually slower!). And the J4125 and J4205 are refreshes of the same architecture with legitimately very similar performance classes. i3 and Atom or i7 and Atom are completely different product lines and the naming is not similar at all there, apart from both having 3s as their first number (not even first character, that is different too, just happen to share the first number somewhere in the name).
Again, like with the Tiger Lake 11xxGxx naming, the characters and positions in the name have meaning. You can come up with better examples than that even within the Intel lineup. Just literally picking 3770 and J3060 as being "similar" because they both have 3s in them.
The one I would legitimately agree on is that the Atom lineup is kind of a mess. Braswell, Apollo Lake, Gemini Lake, and Gemini Lake Refresh are all crammed into the "3000/4000" series space, and there is no "generational number" in that scheme either. Braswell is all 3000 series and Gemini Lake/Gemini Lake Refresh is all 4000 series but you've got Apollo Lake sitting in the middle with both 3000 and 4000 series chips. And a J3455 (Apollo Lake 1.5 GHz) is legitimately a better (or at least equal) processor to a J3710 (Braswell 1.6 GHz). Like 5700U vs 5800U, there are some legitimate architectural differences behind hidden behind an opaque number there (and on the Intel it's graphics - Gemini Lake/Gemini Lake Refresh have a much better video block).
(And that's the problem with "performance rating" approaches, even if a 3710 and a 3455 are similar in performance there's still other differences between them. Also, PR naming instantly turns into gamesmanship - what benchmark, what conditions, what TDP, what level of threading? Is an Intel 37000 the same as an AMD 37000?)
yes, it's a bit of a shitshow, as mutually evidenced. unless consumers brush up on such intricate details (most do not), they will inevitably fall into traps such as "i7 is better than i3" e.g. i7-2600 being outperformed by i3-10100 and "quad core is better than dual core". marketing is becoming more focused on generations now which is a prudent move: "10th Gen is better than 2nd Gen" but it will be at least a decade before the shitshow is swept
I don't really mind the incomprehensible letters -- looking up the generation is pretty easy, and these are data-center focused products... getting the name right is somebody's job and the easiest possible thing.
Xeons have that problem too. I guess some companies just assume they only sell their professional equipment to professionals who read the spec sheet before spending 10k+
We need a canonical, chronologically monotonic, marketing independent ID scheme. Marketing people always tries to disrupt naming schemes and that’s the real problem.
Intel is using generation numbers in their marketing materials. In the technical-oriented slide decks you’d see things like “42th generation formerly named Bullshit Creek” but they are not supposed to use that for sales. And then actual part names like i9-42045K.
We keep using code names in discussions because the actual names are ass backwards and not very descriptive.
Sure, but the way Nvidia names generations is far from obvious. It seems to be “names of famous scientists, progressing in alphabetical order, we skip some letters if we can’t find a well known scientist with a matching last name and are excited about a scientist 2 letters from now, we wrap around to the beginning of the alphabet when we get to the end, and we just skipped from A to H, so expect another wraparound in the next 5-10 years.”
Also, the naming drives devs towards the architecture papers, which are important if you want to get within sight of theoretical perf. When NVidia changes the letter, it's like saying "hey, pay attention, at least skim the new whitepaper." Over the last decade, I feel like this convention has respected my time, so in turn it has earned my own respect. I'll read the Hopper whitepaper tonight, or whenever it pops up.
I think this is less of an issue since these GPUs are not meant for the everyman, so basically the handful of server integrators can figure this out by themselves.
And for your typical dev - they'll interact with the GPU through a cloud provider, where they can easily know that a G5 instance is newer than a G4 one.
Sounds like we need some new training methods. If training could take place locally and asynchronously instead of globally through backpropagation, the amount of energy could probably be significantly reduced.
Yeah, I strongly agree. While Nvidia is working on better hardware (and they're doing a great job at it!), we believe that better training methods should be a big source of efficiency. We've released a new PyTorch library for efficient training at http://github.com/mosaicml/composer.
Our combinations of methods can train CV models ~4x faster to the same accuracy on CV tasks, and ~2x faster to the same perplexity/GLUE score on NLP tasks!
The principled way of doing this is via ensemble learning, combining the predictions of multiple separately-trained models. But perhaps there are ways of improving that by including "global" training as well, where the "separate" models are allowed to interact while limiting overall training costs.
That's like a person driving the Model T in 1908 saying "trying to reduce gas efficiency is so silly".
Why are people so dumb when it comes to planning for the future? Does it require a 1973 oil crisis to make people concerned about potential issues? Why can't people be preventative instead of reactive? Isn't the entire point of an engineer to optimize what they're building for the good of humanity?
Seeing the increased bandwidth is super exciting for a lot of business analytics cases we get into for IT/security/fraud/finance teams: imagine correlating across lots of event data from transactions, logs, ... . Every year, it just goes up!
The big welcome surprise for us the secure virtualization. Outside of some limited 24/7 ML teams, we mostly see bursty multi-tenant scenarios for achieving cost-effective utilization. MiG etc static physical partitioning was interesting -- I can imagine cloud providers giving that -- but more dynamic & logical isolation, with more of a focus on namespace isolation, is more relevant to what we see. Once we get into federated learning, and further disintermediations around that, even more cool. Imagine bursting on 0.1-100 GPUs every 30s-20min. Amazing times!
I was recently researching how you'd host systems like this in a datacentre and was blown away to find out that you can cool 40kW in a single air cooled rack - this might be old news for many, but it was 2x or 3x what I expected! Glad I'm not paying the electricity bill :)
Possible, yes. Easy, no. Don't assume any ole air cooled rack (or datacenter) can manage 40k per rack. It makes it really important to do the hot isle/cold isle well, manage the airflow carefully, block any empty rack units, etc. etc. etc.
Yes, wouldn’t want to try and design it myself! I think the datacentre providers I was looking at spend quite a while doing CFD simulations to get it right!
I think the 1PFLOPS figure for TF32 is with sparsity, which should be called out in the name. Maybe ‘TFS32’? I mainly use dense FP16 so the 1PFLOPS for that looks pretty good.
I'm using older Turing GPUs BF16 would require Ampere. The weights in my models tend to be normalized so the fraction would be more important than the exponent so I would probably still use FP16. I would need to test it though.
The Tensor cores will be great for machine learning and the FP32/FP64 fantastic for HPC, but I'd be surprised if there were a lot of applications using both of these features at once. I wonder if there's room for a competitor to come in and sell another huge accelerator but with only one of these two features either at a lower price or with more performance? Perhaps the power density would be too high if everything was in use at once?
Graphcore's IPU is a machine learning variant on that. Power density seems to be ok. The CTO's talks (used to, I'm out of date) talk about dark silicon a lot.
I share your suspicion that fp64 and ML workloads are distinct but can see each running on the same cluster at different times.
> room for a competitor to come in and sell another huge accelerator but with only one of these two features either at a lower price or with more performance?
They'd need fab capacity first. I wouldn't count on it any time soon, and chip gens have short lives.
"Combined with the additional memory on H100 and the faster NVLink 4 I/O, and NVIDIA claims that a large cluster of GPUs can train a transformer up to 9x faster, which would bring down training times on today’s largest models down to a more reasonable period of time, and make even larger models more practical to tackle."
The 9x speedup is a bit inflated... it's measured at a reference point of ~8k GPUs, on a workload that the A100 cluster is particularly bad at.
When measured at smaller #s of GPUs which are more realistic, the speedup is somewhere between 3.5x - 6x. See the GTC Keynote video at 38:50: https://youtu.be/39ubNuxnrK8?t=2330
Based on hardware specs alone, I think that training transformers with FP8 on H100 systems vs. FP16 on A100 systems should only be 3-4x faster. Definitely looking forward to external benchmarks over the coming months...
Interesting - I did not know that. Don't we also need motherboard manufacturers though to more widely implement the hardware required? It has been awhile since I have read about NVlink to be fair
At inference time it will be possible to do 4000 TFLOPS using sparse FP8 :)
But keep in mind the model won't fit on a single H100 (80GB) because it's 175B params, and ~90GB even with sparse FP8 model weights, and then more needed for live activation memory. So you'll still want atleast 2+ H100s to run inference, and more realistically you would rent a 8xH100 cloud instance.
But yeah the latency will be insanely fast given how massive these models are!
1) NVIDIA will likely release a variant of H100 with 2x memory, so we may not even have to wait a generation. They did this for V100-16GB/32GB and A100-40GB/80GB.
2) In a generation or two, the SOTA model architecture will change, so it will be hard to predict the memory reqs... even today, for a fixed train+inference budget, it is much better to train Mixture-Of-Experts (MoE) models, and even NVIDIA advertises MoE models on their H100 page.
MoEs are more efficient in compute, but occupy a lot more memory at runtime. To run an MoE with GPT3-like quality, you probably need to occupy a full 8xH100 box, or even several boxes. So your min-inference-hardware has gone up, but your efficiency will be much better (much higher queries/sec than GPT3 on the same system).
Oh I totally expect the size of models to grow along with whatever hardware can provide.
I really do wonder how much more you could squeeze out of a full pod of gen2-H100's, obviously the model size would be ludicrous, but how far are we into the realm of dimishing returns.
Your point about MoE architectures certainly sounds like the more _useful_ deployment, but the research seems to be pushing towards ludicrously large models.
You seem to know a fair amount about the field, is there anything you'd suggest if I wanted to read more into the subject?
I agree! The models will definitely keep getting bigger, and MoEs are a part of that trend, sorry if that wasn’t clear.
A pod of gen2-H100s might have 256 GPUs with 40 TB of total memory, and could easily run a 10T param model. So I think we are far from diminishing returns on the hardware side :) The model quality also continues to get better at scale.
Re. reading material, I would take a look at DeepSpeed’s blog posts (not affiliated btw). That team is super super good at hardware+software optimization for ML. See their post on MoE models here: https://www.microsoft.com/en-us/research/blog/deepspeed-adva...
I think it depends what downstream task you're trying to do... DeepMind tried distilling big language models into smaller ones (think 7B -> 1B) but it didn't work too well... it definitely lost a lot of quality (for general language modeling) relative to the original model.
If you're comparing it to the MI250 you're comparing it to 2 separate chips on a single card. This is fundamentally different and unless you have an ideal workload or have optimized appropriately, it's not going to hit anywhere near the peak FLOPS if you have data movement between chips.
> For now, DPX ISA details are available to early access partners. We anticipate broader info availability aligned with CUDA 12.0 release later this year.
This chip is capable of 2000 INT8 Tensor TOPS, or 1000 F16 Tensor TFLOPS. In other words, it is capable of performing over a quadrillion operations per second. Absolutely insane... I still have fond memories of installing my first NVidia gaming GPU, with just 512MB of RAM, probably capable of much less than a single teraflop of compute.
It's a crystal so just one molecule for all the transistors. In terms of atoms it's something on the order of the size of a 30 nm cube and with each silicon atom being .2nm in diameter something like 3 million, give or take an order of magnitude or two.
Those specs imply some pretty crazy architectural efficiency gains, massive theoretical compute performance per transistor compared to Ampere. It's all marketing numbers until the benchmarks are out, though.
The main tensor op is a matmul intrinsic which is useful for way more than just deep learning.
Edit; many of these speeds are low precision which is less useful outside of deep learning, but the higher precision matmul ops in the tensor cores are still very fast and very useful for wide variety of tasks.
> but the higher precision matmul ops in the tensor cores are still very fast and very useful for wide variety of tasks.
The FP64 matrix-multiplication is only 60 TFlops, no where near the advertized 1000 TFlops. TF32 matrix-multiplication is a poorly named 16-bit operation.
You are indeed correct, I was (kinda) fooled by the marketing and I think that TF32 is deceptively named. I think the tensor cores are being used in this architecture for FP64 and 60 TFlops is still pretty decent.
I'm on Turing architecture so I've never used TF32. I've only used FP32 and FP16 but FP32 isn't supported by these tensor cores.
Well the addition is done in FP32, and it's a 32-bit storage format in memory, so calling it a 16-bit format isn't right either. It's really a hybrid format where everything is 32-bit except multiplication.
Given that it's 32-bit in memory (so all your data structures are 32-bit) and also that in my experience using it is very transparent (I haven't run into any numerical issues compared to full FP32), I think calling it a 32-bit format is a reasonable compromise.
I admit that I don't have the hardware to test your claims. But pretty much all the whitepapers I can find on TF32 explicitly state the 10-bit mantissa, suggesting that this is at best, a 19-bit format. 1-bit sign + 8-bit exponent + 10-bit mantissa.
Yes, the system will read/write the 32-bit value to RAM. But if there's only 10-bits of mantissa in the circuits, you're only going to get 10-bits of precision (best case). The 10-bit mantissa makes sense because these systems have FP16 circuits (1 + 5-bit exponent + 10-bit mantissa) and BFloat16 circuits (1 sign + 8-bit exponent + 7-bit mantissa). So the 8-bit exponent circuit + 10-bit mantissa circuit exists physically on those NVidia cores.
-------
But the 'Tensor Cores' do not support 32-bit (aka: 23-bit mantissa) or higher.
There's a paper on getting fp32 accuracy using tf32 tensor cores and losing 3x efficiency. Can't wait to try it with cutlass... once I get how to use cutlass, woof.
The user does see 32-bits and all bits are used because all the additions (and other operations besides the multiply in matrix ops) are in FP32. So the bottom bits are populated with useful information.
700 watts so being NVidia it'll blow up in 6 months and you'll need to wait in a queue for 6 months to RMA it because all the miners had bought up the entire supply chain.
Those datacenter/hpc GPUs don't seem to get bought so much by the mining community? I don't have problems sourcing some through the usual channels (HPE, dell,...?). But you need somehow deep pockets.
I for one suffer deeply when I try to install the nvidia drivers on Linux. The website binaries _always_ break my system
Only the ppas from graphics-drivers work properly
My experience on windows is much more automatic and it never breaks anything. But I'd rather pay the price (installing on Linux) to avoid windows at all costs
If you installed the drivers using the PPAs, you can't then update using the NVIDIA-provided binaries without doing a very thorough purge, including deleting all dependent installs (CUDNN, CUBLAS, etc.)
I highly recommend sticking with one technique or the other; never intermix them.
Yea it's not ideal but really no option is. Built in to Linux would be a problem too given the rate of GPU driver development. Most Linux installs in the corporate world are stuck on the major version of the kernel and system packages they shipped with.
That's because nvidias linux support for consumers is indeed trash, while their creators/business/creatives software (eg CUDA) is not trash, but you mostly hear consumers trashing nvidia.
Off topic but I can't stand when corporations use actual people's names for their marketing who never gave them the permission to do so. For something like Shakespeare or Cicero I'm OK with it but Grace Hopper was alive in my lifetime, and even Tesla feels a little weird. What gives you the right to use that person's reputation to shill your product?
> What gives you the right to use that person's reputation to shill your product?
Practically speaking you have the right to do anything unless someone complains about it. A lot of popular figures, even those long dead, have estates and organizations that manage their likeliness and other related copyright and IP. IDK what the situation is in this case, but Nvidia may very well have paid for the name.
I don’t think my kids have any more right to use my name than a corporation, unless I specifically grant them that right (like Walt Disney did by naming it the Walt Disney company). Another sickening one is the Ed Lee Club in SF, who endorses political candidates under the name of a much-loved dead SF mayor.
No, because they don’t own my identity! If my name is valuable I can will it to them. I can will my money to them. I just don’t think they should be able to endorse political candidates with my name after I’m dead, unless I specifically gave them that right by contract.
The situation is that various Australian companies (think Kangaroo) and DISH network already have Hopper product lines and Nvidia didn't care about getting into a legal kerfuffle and used the name anyway. As to whether Hopper's estate was consulted I don't know.
I generally agree with you, but in this case I suspect Grace Hopper would be honored by it and also impressed with the engineering here. It's not like they slapped her name on a soda can or something.
When you take software support into account, probably very favorable.
I don’t know anything about the state of Dojo, but Tesla was very hand wavy about their software stack during their presentation. And running AI algorithms efficiently on a piece of hardware is one of those things that many HW vendors have a hard time getting right.
And may be taking this opportunity to ask, what happen to Nvidia's leak? The hacker hasn't made any more news, and Nvidia hasn't provide an update either.
If you haven't had any issues with NVIDIA Linux drivers, you can count yourself extremely lucky. In the past, I had a 50/50 chance of boot failure after installing CUDA drivers over 12 different systems. Mainline Ubuntu drivers are somewhat stable, but installing a specific CUDA version from the official NVIDIA repos rarely works on the first try. Switching from Tensorflow to PyTorch has helped a lot though, as Tensorflow was much more picky about the installed CUDA version.
they may have edited their comment but they were commenting on the lack of quality of their Linux drivers (which I agree with but only on a consumer level, never used nvidia in a server)
> 80 billion transistors
> Hopper H100 .. generational leap
> 9x at-scale training performance over A100
> 30x LLM inference throughput
> Transformer Engine .. speed .. 6x without losing accuracy
So another monster chip - same size of the Apple M1-max thingy ..
I guess it comes down to pricing. The A100 is already ridiculously expensive at $10K. They can this one at $50K and it would sell out?