Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Nvidia Hopper GPU Architecture and H100 Accelerator (anandtech.com)
278 points by jsheard on March 22, 2022 | hide | past | favorite | 171 comments


From the nvidia page,

> 80 billion transistors

> Hopper H100 .. generational leap

> 9x at-scale training performance over A100

> 30x LLM inference throughput

> Transformer Engine .. speed .. 6x without losing accuracy

So another monster chip - same size of the Apple M1-max thingy ..

I guess it comes down to pricing. The A100 is already ridiculously expensive at $10K. They can this one at $50K and it would sell out?


Most people who use one of these will be doing so through an EC2 VM (or equivalent). Given that cloud platforms can spread load, keep these GPUs churning close to 24/7 and more easily predict/amortize costs, they’ll probably buy the amount that they know they need, and Nvidia probably has some approximately correct idea of what that number is.


You can buy A100s in a server today, a number of integrators will happily sell it to you.


As someone who've tried for some weeks, it really seems like it's out-of-stock literally everywhere. The demand seems to be a lot higher than the supply at the moment, so much that I'm considering buying one myself instead of renting servers with it.


Does it make sense that all the GPUs are bought out? They each provide a return for mining in the short-term. In the long term, they can be used to run A(G)I models, which will be very very useful


This is the GPU the parent is talking about https://www.nvidia.com/en-us/data-center/a100/


This still makes sense! TPUs are useful for AI, which itself will be very very useful. It’s almost like it’s the best investment. That’s why smart players buy them all. Maybe I’m going out-of-topic.


And even if vendors does say they have it, or can get it, it ended up taking us 4-6 months before systems were online.


Did you check Lambda or Exxact?


Yes, nor Lambda Labs or Exxact Corporation have them available last time I checked (last week). Both citing high demand as the reason for it being unavailable.


We (Lambda) have all of the different NVIDIA GPUs in stock ---- can you send a message to sales@lambdalabs.com and check in again with your requirements? We're seeing a lot more stock these days as the supply chain crisis of 2021 comes to an end.


I talked with you (Lambda Labs) just a week ago about the A100 specifically and you said that the demand was higher than the supply, and that people should check once a day or something like that to see if it's available in your dashboard. If you clearly have it available now, please say so outright instead of trying to push some other offer on me in emails :)


Howdy, I run [Crusoe Cloud](https://crusoecloud.com/) and we just launched an alpha of an A100 and A40 Cloud offering--we've got capacity at a reasonable price!

If you're interested in giving us a shot, feel free to shoot me an email at mike at crusoecloud dot com.


So the product naming for Nvidia's server-GPUs by compute power now goes:

P100 -> V100 -> A100 -> H100

This is not confusing at all.



Yeah it is, but unless you've memorised the history of Nvidia architectures it doesn't tell you which is the newer one

Fermi -> Kepler -> Maxwell -> Pascal -> Volta (HPC only) -> Turing -> Ampere -> Hopper (HPC only?) -> Lovelace?


Someone should make a game like "Pokemon or Big Data" [1] except you have to choose which of two GPU names is faster. Even the consumer naming is bonkers so there's plenty of material there!

[1] http://pixelastic.github.io/pokemonorbigdata/


Isn't this the norm? Only AMD started the trend of naming the uArch with Numbers as Zen 4 or RDNA 3 fairly recently. With Intel it is Haswell > Broadwell > ..... Whatever Lake.


Usually the architecture name isn't the only distinguishing feature of the product name, you don't need to remember Intel codenames because a Core 12700 is obviously newer than a Core 11700

Nvidia's accelerators are just called <Architecture Letter>100 every time so if you don't remember the order of the letters it's not obvious

They could have just named them P100, V200, A300 and H400 instead


Intel was using Core i[3,5,7] names for multiple generations. A Core i7 could be faster or slower than a Core i5 depending on which generation each existed in.

It is nice when products have a naming scheme where natural ordering of the name maps to performance.


>you don't need to remember Intel codenames because a Core 12700 is obviously newer than a Core 11700

J3710, 7th Gen J3060, 8th Gen

J4205, 8th Gen J4125, 9th Gen

i3-5005U, 5th Gen N5095, 10th Gen

i7-3770, 3rd Gen 3865U, 7th Gen N3060, 8th Gen


And an AMD 5700U is older than a 5400U as well. A 3400G is older than a 3100X. 3300X isn't really distinctive from 3100X, both are quad-core configurations (but different CCD/cache configurations, which is of course the name doesn't really disclose to the consumer). It happens, naming is a complex topic and there's a lot of dimensions to a product.

In general, complaining about naming is peak bikeshedding for the tech-aware crowd. There are multiple naming schemes, all of them are reasonable, and everyone hates some of them for completely legitimate reasons (but different for every person). And the resulting bikeshedding is exactly as you'd expect with that.

The underlying problem is that products have multiple dimensions of interest - you've got architecture, big vs small core, core count, TDP, clockrate/binning, cache configuration/CCD configuration, graphics configuration, etc. If you sort them by generation, then an older but higher-spec can beat a newer but lower-spec. If you sort by date then refreshes break the scheme. If you split things out into series (m7 vs i7) to express TDP then some people don't like that there's a bunch of different series. If you put them into the same naming scheme then some people don't like that a 5700U is slower than a 5700X. If you try to express all the variables in a single name, you end up with a name like "i7 1185G7" where it's incomprehensible if you don't understand what each of the parts of the name mean.

(as a power user, I personally think the Ice Lake/Tiger Lake naming is the best of the bunch, it expresses everything you need to know: architecture, core count, power, binning, graphics. But then big.LITTLE had to go and mess everything up! And other people still hated it because it was more complex.)

There are certain ones like AMD's 5000 series or the Intel 10th-gen (Comet Lake 10xxxU) that are just really ghastly because they're deliberately trying to mix-and-match to confuse the consumer (to sell older stuff as being new), but in general when people complain about "not understanding all those Lakes and Coves" it's usually just because they aren't interested in the brand/product and don't want to bother learning the names, and they will eagerly rattle off a list of painters or cities that AMD uses as their codenames.

Like, again, to reiterate here, I literally never have seen anyone raise AMD using painter names as being "opaque to the consumer" in the same way that people repeatedly get upset about lakes. And it's the exact same thing. It's people who know the AMD brand and don't know the Intel brand and think that's some kind of a problem with the branding, as opposed to a reflection of their own personal knowledge.

I fully expect that AMD will release 7000 series desktop processors this year or early next year, and exactly 0 people are going to think that a 7600 being newer than a 7702 is confusing in the way that we get all these aggrieved posts about Intel and NVIDIA. Yes, 7600 and 7702 are different product lines, and that's the exact same as your "but i7 3770 and N3060 are different!" example. It's simply not that confusing, it takes less time to learn than to make a single indignant post on social media about it.

Similarly, the NVIDIA practice of using inventors/compsci people is not particularly confusing either. Basically the same as AMD with the painters/cities.

It's just not that interesting, and it's not worth all the bikeshedding that gets devoted to it.

</soapbox>

Anyway, your example is all messed up though. J3710 and J3060 are both the same gen (Braswell), launched at the same time (Q1 2016), that example is entirely wrong. J4125 vs J4205 is an older but higher specced processor vs a newer but lower spec, it's a 8th gen Pentium vs a 9th gen Celeron, like a 3100X vs a 2700X (zomg 3100X is bigger number but actually slower!). And the J4125 and J4205 are refreshes of the same architecture with legitimately very similar performance classes. i3 and Atom or i7 and Atom are completely different product lines and the naming is not similar at all there, apart from both having 3s as their first number (not even first character, that is different too, just happen to share the first number somewhere in the name).

Again, like with the Tiger Lake 11xxGxx naming, the characters and positions in the name have meaning. You can come up with better examples than that even within the Intel lineup. Just literally picking 3770 and J3060 as being "similar" because they both have 3s in them.

The one I would legitimately agree on is that the Atom lineup is kind of a mess. Braswell, Apollo Lake, Gemini Lake, and Gemini Lake Refresh are all crammed into the "3000/4000" series space, and there is no "generational number" in that scheme either. Braswell is all 3000 series and Gemini Lake/Gemini Lake Refresh is all 4000 series but you've got Apollo Lake sitting in the middle with both 3000 and 4000 series chips. And a J3455 (Apollo Lake 1.5 GHz) is legitimately a better (or at least equal) processor to a J3710 (Braswell 1.6 GHz). Like 5700U vs 5800U, there are some legitimate architectural differences behind hidden behind an opaque number there (and on the Intel it's graphics - Gemini Lake/Gemini Lake Refresh have a much better video block).

(And that's the problem with "performance rating" approaches, even if a 3710 and a 3455 are similar in performance there's still other differences between them. Also, PR naming instantly turns into gamesmanship - what benchmark, what conditions, what TDP, what level of threading? Is an Intel 37000 the same as an AMD 37000?)


yes, it's a bit of a shitshow, as mutually evidenced. unless consumers brush up on such intricate details (most do not), they will inevitably fall into traps such as "i7 is better than i3" e.g. i7-2600 being outperformed by i3-10100 and "quad core is better than dual core". marketing is becoming more focused on generations now which is a prudent move: "10th Gen is better than 2nd Gen" but it will be at least a decade before the shitshow is swept


I don't really mind the incomprehensible letters -- looking up the generation is pretty easy, and these are data-center focused products... getting the name right is somebody's job and the easiest possible thing.

However, is the number superfluous at this point?


Xeons have that problem too. I guess some companies just assume they only sell their professional equipment to professionals who read the spec sheet before spending 10k+


We need a canonical, chronologically monotonic, marketing independent ID scheme. Marketing people always tries to disrupt naming schemes and that’s the real problem.


Intel is using generation numbers in their marketing materials. In the technical-oriented slide decks you’d see things like “42th generation formerly named Bullshit Creek” but they are not supposed to use that for sales. And then actual part names like i9-42045K.

We keep using code names in discussions because the actual names are ass backwards and not very descriptive.


You just blew my mind. That did not occur to me, but it is obvious in retrospect.


Sure, but the way Nvidia names generations is far from obvious. It seems to be “names of famous scientists, progressing in alphabetical order, we skip some letters if we can’t find a well known scientist with a matching last name and are excited about a scientist 2 letters from now, we wrap around to the beginning of the alphabet when we get to the end, and we just skipped from A to H, so expect another wraparound in the next 5-10 years.”


Volta (2017) -> Turing (2018)

Volta was HPC only while Turing was consumer line, I guess that may have something to do with the out of order naming sequence.

Then we had Ampere which was dual market. And then split again with Lovelace on the consumer side and Hopper for data centers.

Just guessing here, obvi.


I mean, I'd already internalized P100 < V100 < A100 as a Colab user.

Schedule me on an H100 and I promise I won't mind the "confusing" naming.


Also, the naming drives devs towards the architecture papers, which are important if you want to get within sight of theoretical perf. When NVidia changes the letter, it's like saying "hey, pay attention, at least skim the new whitepaper." Over the last decade, I feel like this convention has respected my time, so in turn it has earned my own respect. I'll read the Hopper whitepaper tonight, or whenever it pops up.


I think this is less of an issue since these GPUs are not meant for the everyman, so basically the handful of server integrators can figure this out by themselves.

And for your typical dev - they'll interact with the GPU through a cloud provider, where they can easily know that a G5 instance is newer than a G4 one.


Sounds like we need some new training methods. If training could take place locally and asynchronously instead of globally through backpropagation, the amount of energy could probably be significantly reduced.


Disclosure: I work at MosaicML

Yeah, I strongly agree. While Nvidia is working on better hardware (and they're doing a great job at it!), we believe that better training methods should be a big source of efficiency. We've released a new PyTorch library for efficient training at http://github.com/mosaicml/composer.

Our combinations of methods can train CV models ~4x faster to the same accuracy on CV tasks, and ~2x faster to the same perplexity/GLUE score on NLP tasks!


I've been seeing a lot more about MosaicML on my Twitter feed. Just wanted to ask -- how are your priorities different than, say, Fastai?


The principled way of doing this is via ensemble learning, combining the predictions of multiple separately-trained models. But perhaps there are ways of improving that by including "global" training as well, where the "separate" models are allowed to interact while limiting overall training costs.


Trying to reduce energy consumption for ML like this is so silly.


Training costs are growing exponentially bigger.

The degree to which energy and capital costs can be optimized will determine how large they can go.


That's like a person driving the Model T in 1908 saying "trying to reduce gas efficiency is so silly".

Why are people so dumb when it comes to planning for the future? Does it require a 1973 oil crisis to make people concerned about potential issues? Why can't people be preventative instead of reactive? Isn't the entire point of an engineer to optimize what they're building for the good of humanity?


Reducing energy consumption for computation is not silly.

We're at a point we we're turning into a computation driven society and computation is becoming a globally relevant power consumption aspect.

> global data centers likely consumed around 205 terawatt-hours (TWh) in 2018, or 1 percent of global electricity use

And that's just data centers, if you add all client devices you probably double that.

Plus that number will only continue to grow.


Why?


Seeing the increased bandwidth is super exciting for a lot of business analytics cases we get into for IT/security/fraud/finance teams: imagine correlating across lots of event data from transactions, logs, ... . Every year, it just goes up!

The big welcome surprise for us the secure virtualization. Outside of some limited 24/7 ML teams, we mostly see bursty multi-tenant scenarios for achieving cost-effective utilization. MiG etc static physical partitioning was interesting -- I can imagine cloud providers giving that -- but more dynamic & logical isolation, with more of a focus on namespace isolation, is more relevant to what we see. Once we get into federated learning, and further disintermediations around that, even more cool. Imagine bursting on 0.1-100 GPUs every 30s-20min. Amazing times!


This seems fast...

     TF32  ....... 1,000 TFLOPS  (tensor core)
     FP64/FP32 ...    60 TFLOPS
I am more interested in the 144-core Grace CPU Superchip. nVidia is getting into the CPU business...


50% sparsity and rated at 700W. The new DGX is 10kW!


I was recently researching how you'd host systems like this in a datacentre and was blown away to find out that you can cool 40kW in a single air cooled rack - this might be old news for many, but it was 2x or 3x what I expected! Glad I'm not paying the electricity bill :)


Here's what a propane heater of similar output looks like: https://www.amazon.com/Dura-Heat-Propane-Forced-Heater/dp/B0...


Most of the propane heater is a fan in a tube, the flame is probably quite smaller than a CPU package.


I've got an 8kW wood stove and that thing gets rather hot to touch - as in, you will get a blister... 40kW is a small city car worth of power.


If you think about it, cars can manage cooling 200+kW with a radiator - and a lot of airflow.

That's still an amazing amount of power though. I can't help thinking about the kind of sear you could get on a steak with 40kW of power :D


Possible, yes. Easy, no. Don't assume any ole air cooled rack (or datacenter) can manage 40k per rack. It makes it really important to do the hot isle/cold isle well, manage the airflow carefully, block any empty rack units, etc. etc. etc.


Yes, wouldn’t want to try and design it myself! I think the datacentre providers I was looking at spend quite a while doing CFD simulations to get it right!


Supermicro will sell you 40 kW CPU TDP in a single air-cooled rack, not counting the rest of the servers.


I think the 1PFLOPS figure for TF32 is with sparsity, which should be called out in the name. Maybe ‘TFS32’? I mainly use dense FP16 so the 1PFLOPS for that looks pretty good.


Asked elsewhere, but why FP16 as opposed to BF16?


I'm using older Turing GPUs BF16 would require Ampere. The weights in my models tend to be normalized so the fraction would be more important than the exponent so I would probably still use FP16. I would need to test it though.


Same - plus its SUPER


The Tensor cores will be great for machine learning and the FP32/FP64 fantastic for HPC, but I'd be surprised if there were a lot of applications using both of these features at once. I wonder if there's room for a competitor to come in and sell another huge accelerator but with only one of these two features either at a lower price or with more performance? Perhaps the power density would be too high if everything was in use at once?


Graphcore's IPU is a machine learning variant on that. Power density seems to be ok. The CTO's talks (used to, I'm out of date) talk about dark silicon a lot.

I share your suspicion that fp64 and ML workloads are distinct but can see each running on the same cluster at different times.


Just look at Cerebras Wafer-Scale Engine it's basically a chip the size of a Wafer but it's not cheap...


> room for a competitor to come in and sell another huge accelerator but with only one of these two features either at a lower price or with more performance?

They'd need fab capacity first. I wouldn't count on it any time soon, and chip gens have short lives.


Can different tenant VMs access different parts of H100 in parallel?

If so, it may be a reasonable mix.


Right now no - the unit of subdivision for MIG is too coarse. Everyone gets the same proportion of each function


"Combined with the additional memory on H100 and the faster NVLink 4 I/O, and NVIDIA claims that a large cluster of GPUs can train a transformer up to 9x faster, which would bring down training times on today’s largest models down to a more reasonable period of time, and make even larger models more practical to tackle."

Looking good.


The 9x speedup is a bit inflated... it's measured at a reference point of ~8k GPUs, on a workload that the A100 cluster is particularly bad at.

When measured at smaller #s of GPUs which are more realistic, the speedup is somewhere between 3.5x - 6x. See the GTC Keynote video at 38:50: https://youtu.be/39ubNuxnrK8?t=2330

Based on hardware specs alone, I think that training transformers with FP8 on H100 systems vs. FP16 on A100 systems should only be 3-4x faster. Definitely looking forward to external benchmarks over the coming months...


We have needed wide use of NVlink or something like it for a long time now......heres to hoping mobo manufacturers actually widely implement it!


The open standard version of NVLink is CXL. They're available in latest gen CPUs.


Interesting - I did not know that. Don't we also need motherboard manufacturers though to more widely implement the hardware required? It has been awhile since I have read about NVlink to be fair


1000 TFLOPS so i can run my GPT3 in under 100 ms locally :D

If 1000 TFLOPS is possible to do in inference time then im speechless


At inference time it will be possible to do 4000 TFLOPS using sparse FP8 :)

But keep in mind the model won't fit on a single H100 (80GB) because it's 175B params, and ~90GB even with sparse FP8 model weights, and then more needed for live activation memory. So you'll still want atleast 2+ H100s to run inference, and more realistically you would rent a 8xH100 cloud instance.

But yeah the latency will be insanely fast given how massive these models are!


So, we're about a 25-50% memory increase off of being able to run GPT3 on a single machine?

Sounds doable in a generation or two.


Couple points:

1) NVIDIA will likely release a variant of H100 with 2x memory, so we may not even have to wait a generation. They did this for V100-16GB/32GB and A100-40GB/80GB.

2) In a generation or two, the SOTA model architecture will change, so it will be hard to predict the memory reqs... even today, for a fixed train+inference budget, it is much better to train Mixture-Of-Experts (MoE) models, and even NVIDIA advertises MoE models on their H100 page.

MoEs are more efficient in compute, but occupy a lot more memory at runtime. To run an MoE with GPT3-like quality, you probably need to occupy a full 8xH100 box, or even several boxes. So your min-inference-hardware has gone up, but your efficiency will be much better (much higher queries/sec than GPT3 on the same system).

So it's complicated!


Oh I totally expect the size of models to grow along with whatever hardware can provide.

I really do wonder how much more you could squeeze out of a full pod of gen2-H100's, obviously the model size would be ludicrous, but how far are we into the realm of dimishing returns.

Your point about MoE architectures certainly sounds like the more _useful_ deployment, but the research seems to be pushing towards ludicrously large models.

You seem to know a fair amount about the field, is there anything you'd suggest if I wanted to read more into the subject?


I agree! The models will definitely keep getting bigger, and MoEs are a part of that trend, sorry if that wasn’t clear.

A pod of gen2-H100s might have 256 GPUs with 40 TB of total memory, and could easily run a 10T param model. So I think we are far from diminishing returns on the hardware side :) The model quality also continues to get better at scale.

Re. reading material, I would take a look at DeepSpeed’s blog posts (not affiliated btw). That team is super super good at hardware+software optimization for ML. See their post on MoE models here: https://www.microsoft.com/en-us/research/blog/deepspeed-adva...


Is it difficult/desirable to squeeze/compress an open-sourced 200B parameter model to fit into 40GB?

Are these techniques for specific architectures or can they be made generic ?


I think it depends what downstream task you're trying to do... DeepMind tried distilling big language models into smaller ones (think 7B -> 1B) but it didn't work too well... it definitely lost a lot of quality (for general language modeling) relative to the original model.

See the paper here, Figure A28: https://kstatic.googleusercontent.com/files/b068c6c0e64d6f93...

But if your downstream task is simple, like sequence classification, then it may be possible to compress the model without losing much quality.



GPT-3 can't fit in 80GB of RAM.


At what costs I wonder?


Huge recurrent licensing costs is the killer with these


I would assume about 30-40k usd but we'll see


NVidia and AMD datacenter GPUs continue to diverge between focusing on deep learning and traditional scientific computing respectively.


Scientific computer scientists always prefer Nvidia because of CUDA and much better development experience on Nvidia tools.


how so? what is this bad at for scientific computing?


FP16 performance is great, but the FP64 performance isn't terribly compelling. Scientific computing is generally FP64.


If you're comparing it to the MI250 you're comparing it to 2 separate chips on a single card. This is fundamentally different and unless you have an ideal workload or have optimized appropriately, it's not going to hit anywhere near the peak FLOPS if you have data movement between chips.


Anyone find details about the DPX instructions for dynamic programming?


There are some deep dive sessions at GTC that will probably go into them.


The NVIDIA statement about it:

> For now, DPX ISA details are available to early access partners. We anticipate broader info availability aligned with CUDA 12.0 release later this year.


This chip is capable of 2000 INT8 Tensor TOPS, or 1000 F16 Tensor TFLOPS. In other words, it is capable of performing over a quadrillion operations per second. Absolutely insane... I still have fond memories of installing my first NVidia gaming GPU, with just 512MB of RAM, probably capable of much less than a single teraflop of compute.


80 billion transistors boggles my mind. How many molecules are their per transistor?


It's a crystal so just one molecule for all the transistors. In terms of atoms it's something on the order of the size of a 30 nm cube and with each silicon atom being .2nm in diameter something like 3 million, give or take an order of magnitude or two.


That makes sense. My mistake, I did mean atoms, not molecules. Wolfram alpha estimates 1.35 million Si atoms, so well within 1 order of magnitude.

https://www.wolframalpha.com/input?i=30%5E3+cubic+nanometers...


wat


Those specs imply some pretty crazy architectural efficiency gains, massive theoretical compute performance per transistor compared to Ampere. It's all marketing numbers until the benchmarks are out, though.

Edit: big TDP, though.


1 petaflop on a chip?? What is the catch?


Tensor petaflops are useful in only very few circumstances. One of which is the highly lucrative deep learning community though.


The main tensor op is a matmul intrinsic which is useful for way more than just deep learning.

Edit; many of these speeds are low precision which is less useful outside of deep learning, but the higher precision matmul ops in the tensor cores are still very fast and very useful for wide variety of tasks.


> but the higher precision matmul ops in the tensor cores are still very fast and very useful for wide variety of tasks.

The FP64 matrix-multiplication is only 60 TFlops, no where near the advertized 1000 TFlops. TF32 matrix-multiplication is a poorly named 16-bit operation.


You are indeed correct, I was (kinda) fooled by the marketing and I think that TF32 is deceptively named. I think the tensor cores are being used in this architecture for FP64 and 60 TFlops is still pretty decent.

I'm on Turing architecture so I've never used TF32. I've only used FP32 and FP16 but FP32 isn't supported by these tensor cores.


Well the addition is done in FP32, and it's a 32-bit storage format in memory, so calling it a 16-bit format isn't right either. It's really a hybrid format where everything is 32-bit except multiplication.

Given that it's 32-bit in memory (so all your data structures are 32-bit) and also that in my experience using it is very transparent (I haven't run into any numerical issues compared to full FP32), I think calling it a 32-bit format is a reasonable compromise.


> Well the addition is done in FP32

Addition is done in 10-bit mantissa. So maybe TF19 might be the better name, since its a 19-bit format (slightly more than 16-bit BFloats).

Really, its a BFloat with a 10-bit mantissa instead of a 7-bit mantissa. 10-bit mantissa matches FP16, while the 8-bit exponent matches FP32.

So TF19 probably would have been the best name, but NVidia like marketing so they call it TF32 instead.


It's a 32-bit format in memory and the additions are done with 32-bits.


I admit that I don't have the hardware to test your claims. But pretty much all the whitepapers I can find on TF32 explicitly state the 10-bit mantissa, suggesting that this is at best, a 19-bit format. 1-bit sign + 8-bit exponent + 10-bit mantissa.

Yes, the system will read/write the 32-bit value to RAM. But if there's only 10-bits of mantissa in the circuits, you're only going to get 10-bits of precision (best case). The 10-bit mantissa makes sense because these systems have FP16 circuits (1 + 5-bit exponent + 10-bit mantissa) and BFloat16 circuits (1 sign + 8-bit exponent + 7-bit mantissa). So the 8-bit exponent circuit + 10-bit mantissa circuit exists physically on those NVidia cores.

-------

But the 'Tensor Cores' do not support 32-bit (aka: 23-bit mantissa) or higher.


Yup, in a semi-related field, NVIDIA has 3xTF32 for cases needing higher precision: https://github.com/NVIDIA/cutlass/discussions/361


There's a paper on getting fp32 accuracy using tf32 tensor cores and losing 3x efficiency. Can't wait to try it with cutlass... once I get how to use cutlass, woof.


The catch is it's only for TF32 computations (Nvidia proprietary 19 bit floating point number format)


I missed that, to me that makes the ‘32’ in the name misleading.


TF32 = FP32 range + FP16 precision


Why not call it TF19 then.


Because your existing FP32 models should run fine when converted to TF32, so TF32 is equivalent to FP32 as far as DL practitioners are concerned.


There is a lot of redundancy in DL that forgives all manner of sins; think it’s sneaky.


Because it's 32-bits wide in memory.

The effective mantissa is like FP16 but it's padded out to be the same size as FP32.

In other words, there's 1 sign bit, 8 exponent bits, 10 mantissa bits that are USED, and 13 mantissa bits that are IGNORED.

1 + 8 + 10 + 13 = 32

The 13 ignored mantissa bits are part of the memory image: they pad the number out to 32-bit alignment.


But the user never sees that memory right? Doesn't it go in FP32 and come out FP32? I still think it's deceptive marketing.


The user does see 32-bits and all bits are used because all the additions (and other operations besides the multiply in matrix ops) are in FP32. So the bottom bits are populated with useful information.


And with 50% sparsity


DP Linpack flops is what counts in supercomputer ranking. Stuck at .44 Exoflops in 2021.


700 watts so being NVidia it'll blow up in 6 months and you'll need to wait in a queue for 6 months to RMA it because all the miners had bought up the entire supply chain.


Those datacenter/hpc GPUs don't seem to get bought so much by the mining community? I don't have problems sourcing some through the usual channels (HPE, dell,...?). But you need somehow deep pockets.


Given that it's Nvidia, no Linux support. That's the catch.


That's a strange statement. The vast vast majority of these cards will be in systems running Linux.


I for one suffer deeply when I try to install the nvidia drivers on Linux. The website binaries _always_ break my system

Only the ppas from graphics-drivers work properly

My experience on windows is much more automatic and it never breaks anything. But I'd rather pay the price (installing on Linux) to avoid windows at all costs


If you installed the drivers using the PPAs, you can't then update using the NVIDIA-provided binaries without doing a very thorough purge, including deleting all dependent installs (CUDNN, CUBLAS, etc.)

I highly recommend sticking with one technique or the other; never intermix them.


Yea it's not ideal but really no option is. Built in to Linux would be a problem too given the rate of GPU driver development. Most Linux installs in the corporate world are stuck on the major version of the kernel and system packages they shipped with.


All the AI software running on these data-centre chips is almost exclusively running on Linux.

I wish people would stop talking rubbish about NVIDIA's Linux support.


That's because nvidias linux support for consumers is indeed trash, while their creators/business/creatives software (eg CUDA) is not trash, but you mostly hear consumers trashing nvidia.


Only FOSS zealots actually, the rest of us is quite ok with their binary drivers.


They don't make (relevant) money from consumer hardware on Linux.


I thought that only applied to their consumer products.


Their consumer products have Linux support too, the catch is just that the drivers are proprietary binary blobs


Don't they provide Linux drivers for their gaming graphics cards too, just not open source?


Yes


No Linux support? Guess I'll have to keep using Solaris with my A4000!


Nvidia provides Linux drivers for their server chips.


Don't they provide them for their consumer cards too, just that it's a closed source binary blob?



CUDA and related software/libraries only work on Linux or Windows.


Some are even Linux only like nccl (AFAIK required to fully use NVLink)


Good lord 700W TDP!


Off topic but I can't stand when corporations use actual people's names for their marketing who never gave them the permission to do so. For something like Shakespeare or Cicero I'm OK with it but Grace Hopper was alive in my lifetime, and even Tesla feels a little weird. What gives you the right to use that person's reputation to shill your product?


> What gives you the right to use that person's reputation to shill your product?

Practically speaking you have the right to do anything unless someone complains about it. A lot of popular figures, even those long dead, have estates and organizations that manage their likeliness and other related copyright and IP. IDK what the situation is in this case, but Nvidia may very well have paid for the name.


I don’t think my kids have any more right to use my name than a corporation, unless I specifically grant them that right (like Walt Disney did by naming it the Walt Disney company). Another sickening one is the Ed Lee Club in SF, who endorses political candidates under the name of a much-loved dead SF mayor.


Your kids have the right to everything you own (including your name) by default unless you take steps to change that, say using a will or estate.


Yes, I know, I'm saying that it should not be that way. Rights to your likeness should end at your death unless you specifically write down otherwise.


Do you have kids?


Yes.


And you think they shouldn't have that right because of social concerns like accumulation of wealth?


No, because they don’t own my identity! If my name is valuable I can will it to them. I can will my money to them. I just don’t think they should be able to endorse political candidates with my name after I’m dead, unless I specifically gave them that right by contract.


You'd be dead...


The situation is that various Australian companies (think Kangaroo) and DISH network already have Hopper product lines and Nvidia didn't care about getting into a legal kerfuffle and used the name anyway. As to whether Hopper's estate was consulted I don't know.


I generally agree with you, but in this case I suspect Grace Hopper would be honored by it and also impressed with the engineering here. It's not like they slapped her name on a soda can or something.


That’s not NVIDIA’s place to decide.


Theranos' "Edison" machine enters the chat...


what gives you the right to own a name ever? especially once your dead?


How does a DGX Pod w/ the new 3.2Tbps per machine NVLINK switch compare to Tesla Dojo?


When you take software support into account, probably very favorable.

I don’t know anything about the state of Dojo, but Tesla was very hand wavy about their software stack during their presentation. And running AI algorithms efficiently on a piece of hardware is one of those things that many HW vendors have a hard time getting right.


Tesla Dojo Training tile (25x D1): 565 TF FP32 / 9 PF BF16/CFP8 / 11GB SRAM / 10kW

NVIDIA DGX H100 (8x H100): 480 TF FP32 / 8 PF+ TF16 / 16 PF INT8 / 640GB HBM3 / 10kW

Dojo off-chip BW: 16 TB/s / 36TB/s off-tile

H100 off-chip BW: 3.9TB/s / 400GB/s off-DGX


The new block cluster shared memory and synchronisation stuff looks really really nice.


And may be taking this opportunity to ask, what happen to Nvidia's leak? The hacker hasn't made any more news, and Nvidia hasn't provide an update either.


In the keynote, Jensen made a sly remark about how they themselves could benefit a lot from one of their cyberthreat AI solutions.


I wouldn't be surprised if they ended up "collaborating"


The simplest explanation is that Nvidia just paid up.


What was the leak?



The keynote is impressive (https://www.youtube.com/watch?v=39ubNuxnrK8).

They can make an AI factory that will fit many times the current internet throughput in a small room.

I'm super excited about what kinds of applications this could mean.


Looking forward to seeing Doom run on this.


This would be cool if they had decent drivers for Linux.


They do have good drivers for Linux for the things this chips is intended to be used for (research, ML).


If you haven't had any issues with NVIDIA Linux drivers, you can count yourself extremely lucky. In the past, I had a 50/50 chance of boot failure after installing CUDA drivers over 12 different systems. Mainline Ubuntu drivers are somewhat stable, but installing a specific CUDA version from the official NVIDIA repos rarely works on the first try. Switching from Tensorflow to PyTorch has helped a lot though, as Tensorflow was much more picky about the installed CUDA version.

Obligatory Linus Torvals on NVIDIA: https://www.youtube.com/watch?v=_36yNWw_07g


I can assure you systems that take advantage of this chip for scientific/ML workloads aren't running Windows.


they may have edited their comment but they were commenting on the lack of quality of their Linux drivers (which I agree with but only on a consumer level, never used nvidia in a server)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: