> Krishna also referenced the depreciation of the AI chips inside data centers as another factor: "You've got to use it all in five years because at that point, you've got to throw it away and refill it," he said
This doesn't seem correct to me, or at least is built on several shaky assumptions. One would have to 'refill' your hardware if:
- AI accelerator cards all start dying around the 5 year mark, which is possible given the heat density/cooling needs, but doesn't seem all that likely.
- Technology advances such that only the absolute newest cards can be used to run _any_ model profitably, which only seems likely if we see some pretty radical advances in efficiency. Otherwise, it seems like assuming your hardware is stable after 5 years of burn in, you could continue to run older models on that hardware at only the cost of the floorspace/power. Maybe you need new cards for new models for some reason (maybe a new fp format that only new cards support? some magic amount of ram? etc), but it seems like there may be room for revenue via older/less capable models at a discounted rate.
Isn’t that what Michael Burry is complaining about? That five years is actually too generous when it comes to depreciation of these assets and that companies are being too relaxed with that estimate. The real depreciation is more like 2-3 years for these GPUs that cost tens of thousands of dollars a piece.
That's exactly the thing. It's only about bookkeeping.
The big AI corps keep pushing depreciation for GPUs into the future, no matter how long the hardware is actually useful. Some of them are now at 6 years. But GPUs are advancing fast, and new hardware brings more flops per watt, so there's a strong incentive to switch to the latest chips. Also, they run 24/7 at 100% capacity, so after only 1.5 years, a fair share of the chips is already toast. How much hardware do they have in their books that's actually not useful anymore? Noone knows! Slower depreciation means more profit right now (for those companies that actually make profit, like MS or Meta), but it's just kicking the can down the road. Eventually, all these investments have to get out of the books, and that's where it will eat their profits. In 2024, the big AI corps invested about $1 trillion in AI hardware, next year is expected to be $2 trillion. Only the interest payments for that are crazy. And all of this comes on top of the fact that none of the these companies actually make any profit at all with AI. (Except Nvidia of course) There's just no way this will pan out.
There are three distinct but related topics here, it's not "just about bookkeeping" (though Michael Burry may be specifically pointing to the bookkeeping being misquoted):
1. Financial depreciation - accounting principals typically follow the useful life of the capital asset (simply put, if an airplane typically gets used for 30 years, they'll split the cost of purchasing an airplane across 30 years equally on their books). Getting this right has more to do with how future purchases get financed due to how the bookkeepers show profitability, balance sheets, etc.. Cashflow is ultimately what might create an insolvent company.
2. Useful life - per number 1 above - this is the estimated and actual life of the asset. So if the airplane actually is used over 35 years, not 30, it's actual useful life is 35 years. This is to your point of "some of them are 6 years old". Here is where this is going to get super tricky with GPUs. We (a) don't actually know what the useful life is or is going to be (hence Michael Burry's question) for these GPUs (b) the cost of this is going to get complicated fast. Let's say (I'm making these up) GPU X2000 is 2x the performance of GPU X1000 and your whole data center is full of GPU X1000. Do you replace all of those GPUs to increase throughput?
3. Support & maintenance - this is what actually gets supported by the vendor. There doesn't seem to be any public info about the Nvidia GPUs but typically these are 3-5 years (usually tied to the useful life) and often can be extended. Again, this is going to get super complicated to financially because we don't know what future advancements might happen to performance improvements to GPUs (and therefore would necessitate replacing old ones and therefore creating renewed maintenance contracts).
Typical load management that’s existed for 70 years: when interactive workloads are off-peak, you do batch processing. For OpenAI that’s anything from LLM evaluation of the days’ conversations to user profile updates.
Flops per watt is relevant for a new data center build-out where you're bottlenecked on electricity, but I'm not sure it matters so much for existing data centers. Electricity is such as small percentage of total cost of ownership. The marginal cost of running a 5 year old GPU for 2 more years is small. The husk of a data center is cheap. It's the cooling, power delivery equipment, networking, GPUs etc that costs money, and when you retrofit data centers for the latest and greatest GPUs you have to throw away a lot of good equipment. Makes more sense to build new data centers as long as inference demand doesn't level off.
How different is this from rental car companies changing over their fleets? I don't know, this is a genuine question. The cars cost 3-4x as much and last about 2x as far as I know, and the secondary market is still alive.
> How different is this from rental car companies changing over their fleets?
New generations of GPUs leapfrog in efficiency (performance per watt) and vehicles don't? Cars don't get exponentially better every 2–3 years, meaning the second-hand market is alive and well. Some of us are quite happy driving older cars (two parked outside our home right now, both well over 100,000km driven).
If you have a datacentre with older hardware, and your competitor has the latest hardware, you face the same physical space constraints, same cooling and power bills as they do? Except they are "doing more" than you are...
The traditional framing would be cost per flop. At some point your total costs per flop over the next 5 years will be lower if you throw out the old hardware and replace it with newer more efficient models. With traditional servers that's typically after 3-5 years, with GPUs 2-3 years sounds about right
The major reason companies keep their old GPUs around much longer with now are the supply constraints
The used market is going to be absolutely flooded with millions of old cards. I imagine shipping being the most expensive cost for them. The supply side will be insane.
Think 100 cards but only 1 buyer as a ratio. Profit for ebay sellers will be on "handling", or inflated shipping costs.
I assume NVIDIA and co. already protects themselves in some way, either by the fact of these cards not being very useful after resale, or requiring them to go to the grinder after they expire.
In the late '90s, when CPUs were seeing the advances of GPUs are now seeing, there wasn't much of a market for two/three-year old CPUs. (According to a graph I had Gemini create, the Pentium had 100 MFLOPS and the Pentium 4 had 3000 MFLOPS.) I bought motherboards that supported upgrading, but never bothered, because what's the point of going from 400 MHz to 450 MHz, when the new ones are 600 or 800 MHz?
I don't think nVidia will have any problem there. If anything, hobbyists being able to use 2025 cards would increase their market by discovering new uses.
Cards don't "expire". There are alternate strategies to selling cards, but if they don't sell the cards, then there is no transfer of ownership, and therefore NVIDIA is entering some form of leasing model.
If NVIDIA is leasing, then you can't get use those cards as collateral. You can't also write off depreciation. Part of what we're discussing is that terms of credit are being extended too generously, with depreciation in the mix.
The could require some form of contractual arrangement, perhaps volume discounts for cards, if they agree to destroy them at a fixed time. That's very weird though, and I've never heard of such a thing for datacenter gear.
They may protect themselves on the driver side, but someone could still write OSS.
I think it's a bit different because a rental car generates direct revenue that covers its cost. These GPU data centers are being used to train models (which themselves quickly become obsolete) and provide inference at a loss. Nothing in the current chain is profitable except selling the GPUs.
You say this like it's some sort of established fact. My understanding is the exact opposite and that inference is plenty profitable - the reason the companies are perpetually in the red is that they're always heavily investing in the next, larger generation.
I'm not Anthropic's CFO so i can't really prove who's right one way or the other, but I will note that your version relies on everyone involved being really, really stupid.
The current generation of today was the next generation of yesterday.
So, unless the services sold on inference can cover the cost of inference + training AND gain money, they are still operating at loss.
“like it's some sort of established fact” -> “My understanding”?! a.k.a pure speculation. Some of you AI fans really need to read your posts out loud before posting them.
The GPUs going into data centers aren't the kind that can just be reused by putting them into a consumer PC and playing some video games, most don't even have video output ports and put out FPS similar to cheap integrated GPUs.
And the big ones don't even have typical PCIe sockets, they are useless outside of behemoth rackmount servers requiring massive power and cooling capacity that even well-equipped homelabs would have trouble providing!
I would presume that some tier shaped market will arise where the new cards are used for the most expensive compute tasks like training new models, the slightly used for inference, older cards for inference of older models, or applied to other markets that have less compute demand (or spend less $ per flop, like someone else mentioned).
It would be surprising to me that all this capital investment just evaporates when a new data center gets built or refitted with new servers. The old gear works, so sell it and price it accordingly.
At that point it isn't a $10k card anymore, it's a $5k card. And possibly not a $5k card for very long in the scenario that the market has been flooded with them.
Ah well yes to a degree that’s possible but at least at the moment you’d still be better off buying a $5k Mac Studio if it’s just inference you’re doing
Why would you do that when you can pay someone else to run the model for you on newer more efficient and more profitable hardware? What makes it profitable for you and not for them?
I think its illustrative to consider the previous computation cycle ala Cryptomining. Which passed through a similar lifecycle with energy and GPU accelerators.
The need for cheap wattage forced the operations to arbitrage the where location for the cheapest/reliable existing supply - there rarely was new buildout as the cost was to be reimbursed by the coins the miningpool recovered.
For the chip situation caused the same apprecaition in GPU cards with periodic offloading of cards to the secondary market (after wear and tear) as newer/faster/more efficient cards came out until custom ASICs took over the heavy lifting, causing the GPU card market to pivot.
Similarly in the short to moedium term the uptick of custo ASICs like with Google TPU will definately make a dent in bot cpex/opex and potentially also lead to a market with used GPUs as ASICs dominate.
So for GPUs i can certainly see the 5 year horizon making a impact in investment decisions as ASICs proliferate.
It’s far more extreme: old servers are still okay on I/O, and memory latency, etc. won’t change that dramatically so you can still find productive uses for them. AI workloads are hyper-focused on a single type of work and, unlike most regular servers, a limiting factor in direct competition with other companies.
I mean you could use training GPUs for inference right? That would be use case number 1 for a 8 * a100 box in a couple of years. It can also be used for non IO limited things like folding proteins or other 'scientific' use cases. Push comes to shove im sure an old A100 will run crysis.
All those use cases would probably use up 1% of the current AI infrastructure, let alone ahat they're planning to build.
Yeah, just like gas, possible uses will expand if AI crashes out, but:
* will these uses cover, say, 60% of all this infra?
* will these uses scale up to use that 60% within the next 5-7 years, while that hardware is still relevant and fully functional?
Also, we still have railroad tracks from the 1800s rail mania that were never truly used to capacity and dot com boom dark fiber that's also never been used fully, even with the internet growing 100x since. And tracks and fiber don't degrade as quickly as server hardware and especially GPUs.
LambdaLabs is still making money off their Tesla V100s, A100s, and A6000s. The older ones are cheap enough to run some models and very cheap, so if that's all you need, that's what you'll pick.
The V100 was released in 2017, A6000 in 2020, A100 in 2021.
Power consumption is only part of the equation. More efficient chips => less heat => lower cooling costs and/or higher compute density in the same space.
Even if the power is free you still need a grid connection to move it to where you need it, and, guess what, the US grid is bursting at the seams. This is not just due to data center demand; it was struggling to cope with the transition away from coal well before that point.
You also can’t buy a gas turbine for love nor money at the moment, and they’re not ever going to be free.
If you plonked massive amounts of solar panels and batteries in the Nevada desert, that’s becoming cheap but it ain’t free, particularly as you’ll still need gas backup for a string of cloudy days.
If you think SMRs are going to be cheap I have a bridge to sell you, you’re also not going to build them right next to your data centre because the NRC won’t let you.
So that leaves fusion or geothermal. Geothermal is not presently “very cheap” and fusion power has not been demonstrated to work at any price.
In-house hyperscaler stuff gets shredded, after every single piece of flash storage gets first drilled through and every hard drive gets bent by a hydraulic press. Then it goes into the usual e-waste recycling stream (ie. gets sent to poor countries where precious metals get extracted by people with a halved life expectancy).
Off-the-shelf enterprise gear has a chance to get a second life through remarketing channels, but much of it also gets shredded due to dumb corporate policies. There are stories of some companies refusing to offload a massive decom onto the second hand market as it would actually cause a crash. :)
Similar to corporate laptops where due to stupid policies, for most BigCos you can't really buy or otherwise get a used laptop, even as the former corporate used of said laptop.
I used (relatively) ancient servers (5-10 years in age) because their performance is completely adequate; they just use slightly more power. As a plus it's easy to buy spare parts, and they run on DDR3, so I'm not paying the current "RAM tax". I generally get such a server, max out its RAM, max out its CPUs and put it to work.
Same, the bang for buck on a 5yo server is insane. I got an old Dell a year ago (to replace our 15yo one that finally died) and it was $1200 AUD for a maxed out recently-retired server with 72TB of hard drives and something like 292GB of RAM.
The idle wattage per module has shrunk from 2.5-3W down to 1-1.2 between DDR3 & DDR5. Assuming a 1.3W difference (so 10.4W for 8760 hours), a DDR3 machine with 8 sticks would increase your yearly power consumption by almost 1% (assuming avg 10,500kWh/yr household)
That's only a couple dollars in most cases but the gap is only larger in every other instance. When I upgraded from Zen 2 to Zen 3 it was able to complete the same workload just as fast with half as many cores while pulling over 100W less. Sustained 100% utilization barely even heats a room effectively anymore!
The one thing to be careful with Zen 2 onwards is that if your server is going to be idling most of the time then the majority of your power usage comes from the IO die. Quite a few times you'd be better off with the "less efficient" Intel chips because they save 10-20 Watts when doing nothing.
A similar one I just ran into: my Framework Desktop was idling @ 5W more than other reported numbers. Issue turned out to be the 10 year old ATX PSU I was using.
To be clear, this server is very lightly loaded, it's just running our internal network services (file server, VPN/DNS, various web apps, SVN etc.) so it's not like we're flogging a room full of GeForce 1080Ti cards instead of buying a new 4090Ti or whatever. Also it's at work so it doesn't impact the home power bill. :D
Maybe? The price difference on newer hardware can buy a lot of electricity, and if you aren't running stuff at 100% all the time the calculation changes again. Idle power draw on a brand new server isn't significantly different from one that's 5 years old.
Manipulating this for creative accounting seems to be the root of Michael Burry’s argument, although I’m not fluent enough in his figures to map here. But, commenting that it interesting to see IBM argue a similar case (somewhat), or comments ITT hitting the same known facts, in light of Nvidia’s counterpoints to him.
with Michael Lewis, about 30 mins long. Highlights - he thinks we are near the top, his puts are for two years time. If you go long he suggests healthcare stocks. He's been long gold some years, thinks bitcoin is dumb. Thinks this is dotcom bubble #2 except instead of pro investors it's mostly index funds this time. Most recent headlines about him have been bad reporting.
> They still work fine but power costs make them uneconomical compared to latest tech.
That's not necessarily the driving financial decision, in fact I'd argue company's with data center hardware purchases barely look at this number. It's more simple than that - their support runs out and its cheaper to buy a new piece of hardware (that IS more efficient) because the hardware vendors make extended support inordinately expensive.
Put yourselves in the shoes of a sales person at Dell selling enterprise server hardware and you'll see why this model makes sense.
Eh, not exactly. If you don't run CPU at 70%+ the rest of the machine isn't that much more inefficient that model generation or two behind.
It used to be that new server could use half power of the old one at idle but vendors figured out that servers also need proper power management a while ago and it is much better.
Last few gens increase could be summed up to "low % increase in efficiency, with TDP, memory channels and core count increase".
So for loads not CPU bound the savings on newer gen aren't nearly worth it to replace it, and for bulk storage the CPU power usage is even smaller part
Definitely single thread performance and storage are the main reasons not to use an old server. A 6 year old server didn't have nvme drives, so SATA SSD at best. That's a major slow down if disk is important.
Aside from that there's no reason to not use a dual socket server from 5 years ago instead of a single socket server of today. Power and reliability maybe not as good.
NVMe is just a different form factor for what's essentially a PCIe connection, and adapters are widely available to bridge these formats. Surely old servers will still support PCIe?
I thought the same until I calculated that newer hardware consumes a few times less energy and for something running 24x7 that adds up quite a bit (I live in Europe, energy is quite expensive).
So my homelab equipment is just 5 years old and it will get replaced in 2-3 years with something even more power efficient.
Asking coz I just did a quick comparison and it seems to depend but for comparison I have a really old AMD Athlon "e" processor (like literally September 2009 is when it came out according to some quick Google search, tho I probably bought it a few months later than that but still ...) that runs at ~45W TDP. In idle conditions, it typically consumes around 10 to 15 watts (internet wisdom, not kill-a-watt-wisdom).
Some napkin math says it would cost me about 40 years worth of amortization to replace this at my current power rates for this system. So why would I replace it? And even with some EU countries' power rates we seem to be at 5-10 years amortization upon replacement. I've been running this motherboard, CPU + RAM combo for ~15 years now it seems, replacing only the hard drives every ~3 years. And the tower it's in is about 25 years old.
Oh I forgot, I think I had to buy two new CR2032 batteries during those years (CMOS battery).
Now granted, this processor can basically do "nothing" in comparison to a current system I might buy. But I also don't need more for what it does.
That is definitely true and why I compared idle watts. That Athlon uses the same idle watts as modern mobile CPUs. So no reason to replace during the mostly idle times. Spot on. I can't have this system off during idle time as it wouldn't come up to fulfill its purpose fast enough when needed and it would be a pain to trigger that anyway (I mean, really, port knocking to start up that system type thing). Else I would. That I do do with the HTPC which has a more modern Intel core i3.
The "nothing" here was exactly meant more for the times when it does have to do something. But even then at 45W TDP, as long as it's able to do what it needs to, then the newer CPUs have no real edge. What they gain in performance due to multi core they loose in being essentially equivalent single core performance for what that machine does: HTPC file serving, email server etc.
Spinning rust and fans are the outliers when it comes to longevity in compute hardware. I’ve had to replace a disk or two in my rack at home, but at the end of the day the CPUs, RAM, NICs, etc. all continue to tick along just fine.
When it comes to enterprise deployments, the lifecycle always revolves around price/performance. Why pay for old gear that sucks up power and runs 30% slower than the new hotness, after all!
But, here we are, hitting limits of transistor density. There’s a reason I still can’t get 13th or 14th gen poweredge boxes for the price I paid for my 12th gen ones years ago.
There’s no marginal tax impact of discarding it or not after 5 years - if it was still net useful to keep it powered, they would keep it. Depreciation doesn’t demand you dispose of or sell the item to see the tax benefit.
No but it tips the scales. If the new hardware is a little more efficient, but perhaps not so much so that you would necessarily replace it, the ability to appreciate the new stuff, but not the old stuff might tip your decision
But if your competitor is running newer chips that consume less power per operation, aren't you forced to upgrade as well and dispose of the old hardware?
Sure, assuming the power cost reduction or capability increase justifies the expenditure. It's not clear that that will be the case. That's one of the shaky assumptions I'm referring to. It may be that the 2030 nvidia accelerators will save you $2000 in electricity per month per rack, and you can upgrade the whole rack for the low, low price of $800,000! That may not be worth it at all. If it saves you $200k/per rack or unlocks some additional capability that a 2025 accelerator is incapable of and customers are willing to pay for, then that's a different story. There are a ton of assumptions in these scenarios, and his logic doesn't seem to justify the confidence level.
> Sure, assuming the power cost reduction or capability increase justifies the expenditure. It's not clear that that will be the case.
Share price is a bigger consideration than any +/- differences[1] between expenditure vs productivity delta. GAAP allows some flexibility in how servers are depreciated, so depending on what the company wants to signal to shareholders (investing in infra for futur returns vs curtailing costs), it may make sense to shorten or lengthen depreciation time regardless of the actual TCOO keep/refresh cost comparisons.
1. Hypothetical scenario: a hardware refresh costs $80B, actual performance increase is only worth $8B, but the share price increases the value of org's holding of its own shares by $150B. As a CEO/CFO, which action would you recommend- without even considering your own bonus that's implicitly or explicitly tied to share price performance.
Illustration numbers: AI demand premium = $150 hardware with $50 electricity. Normal demand = $50 hardware with $50 electricity. This is Nvidia margins @75% instead of 40%. CAPEX/OPEX is 70%/20% hardware/power instead of customary 50%/40%.
If bubble crashes, i.e. AI demand premium evaporates, we're back at $50 hardware and $50 electricity. Likely $50 hardware and $25 electricity if hardware improves. Nvdia back to 30-40% margins, operators on old hardware stuck with stranded assets.
The key thing to understand is current racks are sold at grossly inflated premiums right now, scarcity pricing/tax. If the current AI economic model doesn't work then fundmentally that premium goes away and subsequent build outs are going to be costplus/commodity pricing = capex discounted by non trivial amounts. Any breakthroughs in hardware, i.e. TPU compute efficiency would stack opex (power) savings. Maybe by year 8, first gen of data centers are still depreciated to $80 hardware + $50 power vs new center @ $50 hardware + $25 power. That old data center is a massive write-down because it will generate less revenue than it costs to amoritize.
A typical data centre is $2,500 per year per kW load (including overhead, hvac and so on).
If it costs $800,000 to replace the whole rack, then that would pay off in a year if it reduces 320 kW of consumption. Back when we ran servers, we wouldn't assume 100% utilisation but AI workloads do do that; normal server loads would be 10kW per rack and AI is closer to 100. So yeah, it's not hard to imagine power savings of 3.2 racks being worth it.
Thanks for the numbers! Isn't it more likely that the amount of power/heat generated per rack will stay constant over each upgrade cycle, and the upgrade simply unlocks a higher amount of service revenue per rack?
Not in the last few years. CPUs went from ~200W TDP to 500W.
And they went from zero to multiple GPUs per server. Tho we might hit "the chips can't be bigger and the cooling can't get much better" point there.
The usage would be similar if it was say a rack filled with servers full of bulk storage (hard drives generally keep the power usage similar while growing storage).
But CPU/GPU wise, it's just bigger chips/more chiplets, more power.
I'd imagine any flattening might be purely because "we have DC now, re-building cooling for next gen doesn't make sense so we will just build servers with similar power usage as previously", but given how fast AI pushed the development it might not happen for a while.
I've been in university research computing for 15 years, so large enough (~900 nodes) we need a dedicated DC, but not at the same scale as others around here.
Our racks are provisioned so that there are two independent rails, which each can support 7kW. Up until the last few years, this was more than enough power. As CPU TDPs increased, we started to need to do things like not connect some nodes to both redundant rails or mix disk servers into compute racks to keep under 7kW/rack.
A single HGX B300 box has 6x6kW power supplies. Even before we get to paying the (high) power bills, it's going to cost a small fortune to just update the racks, power distribution units, UPS, etc... to even be able to support more than a handful of those things
> Isn't it more likely that the amount of power/heat generated per rack will stay constant over each upgrade cycle,
Power density seems to grow each cycle. But eventually your DC hits power capacity limits, and you have to leave racks empty because there's no power budget.
Or they could charge the same as you and make more money per customer. If they already have as many customers as they can handle doing that may be better than buying hardware to support a larger number of customers.
It’s not about assumptions on the hardware. It’s about the current demands for computation and expected growth of business needs. Since we have a couple years to measure against it should be extremely straightforward to predict. As such I have no reason to doubt the stated projections.
Networking gear was famously overbought. Enterprise hardware is tricky as there isn’t much of a resale market for this gear once all is said and done.
The only valid use case for all of this compute which could reasonably replace ai is btc mining. I’m uncertain if the increased mining capacity would harm the market or not.
BTC mining on GPUs haven't been profitable for a long time, it's mostly ASICs, GPUs can be used for some other altcoins which makes the potential market for used previous generation GPUs even smaller.
That assumes you can add compute in a vacuum. If your altcoin receives 10x compute then it becomes 10x more expensive to mine.
That only scales if the coin goes up in value due to the extra "interest". Which isn't impossible but there's a limit, and it's more often to happen to smaller coins.
Failure rates also go up. For AI inference it’s probably not too bad in most cases, just take the node offline and re-schedule the jobs to other nodes.
There is the opportunity cost of using a whole datacenter to house ancient chips, even if they're still running. You're thinking like a personal use chip which you can run as long as it is non-defective. But for datacenters it doesn't make sense to use the same chips for more than a few years and I think 5 years is already stretching their real shelf life.
Do not forget that we're talking about supercomputers. Their interconnect makes machines not easily fungible, so even a low reduction in availability can have dramatic effects.
Also, after the end of the product life, replacement parts may no longer be available.
You need to get pretty creative with repair & refurbishment processes to counter these risks.
Historically, GPUs have improved in efficiency fast enough that people retired their hardware in way less than 5 years.
Also, historically the top of the line fabs were focused on CPUs, not GPUs. That has not been true for a generation, so it's not really clear if the depreciation speed will be maintained.
> Historically, GPUs have improved in efficiency fast enough that people retired their hardware in way less than 5 years.
This was a time when chip transistor cost was decreasing rapidly. A few years earlier even RAM cost was decreasing quickly. But these times are over now. For example, the PlayStation 5 (where the GPU is the main cost), which launched five years ago, even increased in price! This is historically unprecedented.
Most price/performance progress is nowadays made via better GPU architecture instead, but these architectures are already pretty mature, so the room for improvement is limited.
Given that the price per transistor (which TSMC & Co are charging) has decreased ever more slowly in recent years, I assume it will eventually come almost to a halt.
By the way, this is strictly speaking compatible with Moore's law, as it is only about transistors per chip area, not price. Of course the price per chip area was historically approximately constant, which meant exponentially increasing transistor density implied exponentially decreasing transistor price.
> This was a time when chip transistor cost was decreasing rapidly.
GPUs were actually mostly playing catch-up. They were progressively becoming more expensive parts that could afford being built on more advanced fabs.
And I'll have to point, "advanced fabs" is a completely post-Moore's law concept. Moore's law is about literally the number of transistors on the most economic package. Not any bullshit about area density that marketing people invented on the last decade (you can go read the paper). With Moore's law, the cheapest fab improves quickly enough that it beats whatever more advanced fabs existed before you can even finish designing a product.
5 years is maybe referring to the accounting schedule for depreciation on computer hardware, not the actual useful lifetime of the hardware.
It's a little weird to phrase it like that though because you're right it doesn't mean you have to throw it out. Idk if this is some reflection of how IBM handles finance stuff or what. Certainly not all companies throw out hardware the minute they can't claim depreciation on it. But I don't know the numbers.
Anyways, 5 years is an infection point on numbers. Before 5 years you get depreciation to offset some cost of running. After 5 years, you do not, so the math does change.
that is how the investments are costed though, so makes sense when we're talking return on investment, so you can compare with alternatives under the same evaluation criteria.
General question to people who might actually know.
Is there anywhere that does anything like Backblaze's Hard Drive Failure Rates [1] for GPU Failure Rates in environments like data centers, high-performance computing, super-computers, mainframes?
The best that came back on a search was a semi-modern article from 2023 [2] that appears to be a one-off and mostly related to consumer facing GPU purchases, rather than bulk data center, constant usage conditions. It's just difficult to really believe some of these kinds of hardware deprecation numbers since there appears to be so little info other than guesstimates.
Found an Arxiv paper on continued checking that's from UIUC UrbanaIL about a 1,056 A100 and H100 GPU system. [3] However, the paper is primarily about memory issues and per/job downtime that causes task failures and work loss. GPU Resilience is discussed, it's just mostly from the perspective of short-term robustness in the face of propagating memory corruption issues and error correction, rather than multi-year, 100% usage GPU burnout rates.
Any info on the longer term burnout / failure rates for GPUs similar to Backblaze?
Edit: This article [4] claims it's 0.1-2% failure rate per year (0.8% (estimated)) with no real info about where the data came from (cites "industry reports and data center statistics"), and then claims they often last 3-5 years on average.
When you operate big data centers it makes sense to refresh your hardware every 5 years or so because that’s the point at which the refreshed hardware is enough better to be worth the effort and expense.
You don’t HAVE to, but its more cost effective if you do.
(Source, used to operate big data centers)
It's worse than that in reality, AI chips are on a two year cadence for backwards compatibility (NVIDIA can basically guarantee it, and you probably won't be able to pay real AI devs enough to stick around to make hardware work arounds). So their accounting is optimistic.
5 years is normal-ish depreciation time frame. I know they are gaming GPUs, but the RTX 3090 came out ~ 4.5 years before the RTX 5090. The 5090 has double the performance and 1/3 more memory. The 3090 is still a useful card even after 5 years.
Given power and price constraints, it's not that you cannot run them in 5 years time it's that you don't want to run them in 5 years time and neither will anyone else that doesn't have free power.
This doesn't seem correct to me, or at least is built on several shaky assumptions. One would have to 'refill' your hardware if:
- AI accelerator cards all start dying around the 5 year mark, which is possible given the heat density/cooling needs, but doesn't seem all that likely.
- Technology advances such that only the absolute newest cards can be used to run _any_ model profitably, which only seems likely if we see some pretty radical advances in efficiency. Otherwise, it seems like assuming your hardware is stable after 5 years of burn in, you could continue to run older models on that hardware at only the cost of the floorspace/power. Maybe you need new cards for new models for some reason (maybe a new fp format that only new cards support? some magic amount of ram? etc), but it seems like there may be room for revenue via older/less capable models at a discounted rate.