Poor Disk Performance

userbinator · on May 10, 2021

No HDD on the market has ever been variable-speed. What you may be perceiving as a change in RPM could just be the difference in harmonics of the spindle motor as you change the dampening.

As for particles on the platters, the keywords to search for are "thermal asperity". An excess of them, or ones that are actually stuck to the platter, can cause damage but as you have noticed, what usually happens is the head just knocks them out of the way and heats up slightly, causing a misread and subsequent retry (hence the slow speed). If you power on a working drive and then remove the lid, the air currents will keep any new particles from sticking to the pllatters.

By pushing down on the lid, however, (simulating screws) it sped up and down a few times before failing. The harder I pushed the less it vibrated and the more it worked, until I finally had it returning I/O, albeit slowly.

It's more likely that you were adjusting the actuator angle. See the bottom picture in this article for comparison (also a WD drive from around the same era):

https://hddguru.com/articles/2006.02.17-Changing-headstack-Q...

Teknoman117 · on May 10, 2021

I was about to call you out, but turns out the thought that WD green drives were variable RPM was just people misinterpreting crappy documentation from WD on how fast they operate. They didn't publish speeds because they reserved the right to sell any speed of drive under that label without telling you...

rsync · on May 10, 2021

"No HDD on the market has ever been variable-speed."

I'm not sure that's true ...

I remember from one of the Sun Performance Tuning manuals there is a chapter on disk performance and there was a throwaway line about "non commodity disks". Specifically, that there were, in the past, disk drives with heads that moved independently.

I don't know much more about "non commodity disks" - I think they must have been prevalent in the 70s (?) - but variable speed doesn't sound much weirder than independent head movement ...

userbinator · on May 10, 2021

Multiple heads moving independently have many advantages, varying the spindle speed doesn't; it takes a lot of energy to change the speed of the platters, and doesn't really provide any benefits. Zoned bit recording already takes advantage of the varying linear density, and doesn't require changing the platter speed.

In the 70s, disk drive motors would likely be mains powered synchronous induction motors, which are constant speed and where the traditional speeds of 3600, 5400, and 7200 originated.

bradknowles · on May 10, 2021

Sony 3.5” floppy disks used to be variable speed. I knew about that at the time I interviewed for a position as an intern at Imprimis in 1989, just before they got bought by Seagate. The guy I was interviewing with smiled smugly and said “we’ve got a better solution — we vary the speed at which we read and write data!”

As a result of that job, I ended up writing what I believe to be the first FAQ for hard drives and stiction — in 1989. I haven’t been able to find a copy of it, however.

Dunno if there would be any useful information in that FAQ for the OP, but it would be interesting to check it out and see. That is, if anyone can find a copy.

rzzzt · on May 10, 2021

Dual head stacks were once a thing, and there was a recent thread about them as well: https://news.ycombinator.com/item?id=26502216

wmf · on May 10, 2021

I've only heard about one model of dynamic RPM (DRPM) hard disk and only prototypes were made.

gvb · on May 10, 2021

(When I was a sysadmin, I heard a story of how old VAX drives would stall, so holes had been drilled in them with tape over the holes. When stalled, the sysadmin would peel back the tape and use their finger to spin-start them. Those even older drives must have been more tolerant of dust!)

More than once I had a hard drive fail to start up after a power cycle (back then the drives only spun down when power was removed). First thing we tried was to remove the drive and give the whole drive a sharp spin on the axis of the platter. Due to inertia of the platter, this would tend to get the platter to move a bit and "unstick" it.

My recollection is that it worked every time I had to do this. Of course, we would back up that drive and replace it as soon as possible.

wazoox · on May 10, 2021

It still works on relatively modern drives. If the spindle or actuator arm is stuck, hitting the drive on its side (for instance by hitting a table with it) can free stuck movable parts. Worked well well into the 500GB era.

toss1 · on May 10, 2021

Stiction. Good tip on how you were successfully eliminating the stiction -- likely useful to remember in many such situations!

[1] https://en.wikipedia.org/wiki/Stiction

anyfoo · on May 10, 2021

I went one step further: We had an old HP-UX machine with a failed hard drive that wouldn't spin up. The data on it wasn't supercritical but still nice to keep, so I was free to experiment. I removed the housing and pushed the platter by hand. It spun up, and while still open, I immediately took a full disk image with dd.

A more similar story to yours was with my 120MB hard disk on my PC when I was a, which inconsistently exhibited similar symptoms. I had no money, so always had to do with what I had (many stories sprang out of that). The hard disk was in a removable caddy, and when it refused spinning up, I simply took it out of the PC and gently bounced it on my bed right next to the desk. Put back into the PC, it then worked every time as I recall.

kstrauser · on May 10, 2021

One time an acquaintance sold me a 50 MB SCSI drive filled with, umm, a selection of Amiga games (all PD, honest, officer!). When I got it home and installed, the drive howled like a banshee and a benchmark program said I was getting about 100KB/s reads from it. Figuring I had absolutely nothing to lose, I flipped it over and squirted a little 3-in-1 oil on the spindle bearing. The whine's pitch started increasing and quieting as the drive spun up to its full operating speed, and I watched the little graph slowly work its way up to a more reasonable 1MB/s. I made backups of the software on it then turned the system off, pulled the drive, and threw it away.

I have never before or since oiled a piece of computer hardware to improve its IO, but this one time it worked.

wiredfool · on May 10, 2021

I remember putting an Apple 20MB hard drive on a heater to warm up the lubricants so that it would spin up.

intc · on May 10, 2021

Perhaps 6 - 7 years ago we provided some one U server for a client. Initially the server worked fine. Then we upgraded it's disks (originally 80 gigs or so) to 1TB each (in RAID1 configuration).

After the new disks were installed the server started to have multitudes of issues with disk performance and random read / write errors.

I think it took us several days to understand that the new spinners where much more sensitive than the original ones - And the large (very powerful) fan (located between the hdds) emitted too much vibration for the disks to operate properly.

We ended up swapping the chassis to one with radial fans. Problem solved.

CraigJPerry · on May 10, 2021

>> What's good for one user may be bad for another

I bought 3 "identical" old Dell T3610 workstations off eBay for a home lab project. They have "identical" striped HDDs in them.

Kicking off a Fedora Core OS install on all 3 simultaneously (which i've had to do a few times as i learned how ignition works) results in the same ordering of the hosts finishing their rebuilds -

Machine 2 always finishes first by around 3 minutes of a 14 min wipe & reinstall process.

Machine 3 finishes ahead of machine 1 by around 45 seconds or so usually.

Almost 5 minutes in a <20 mins process, that's huge! I still don't actually know the root cause. Benchmarking disk I/O has them within a few percent of each other. It's not that they're contending to remotely load the OS installer - that gets cached in memory at the start. There's a few seconds difference in the UEFI bios startup timings but none of them are particularly consistent which is weird, i would have thought UEFI init time would be the same on a given host each boot, but there's a few seconds in it each time.

userbinator · on May 10, 2021

There's a few seconds difference in the UEFI bios startup timings

Could that be anything to do with the Intel ME or similar "management" spyware/etc. trying to phone home or do something? You may have to reflash the BIOS to "clean" that completely.

The other thing I can think of is that the CPU heatsinks aren't clogged with the same amount of dust and causing one machine to thermally throttle more than the others.

xen2xen1 · on May 10, 2021

Thermals are probably the right answer.

bentcorner · on May 10, 2021

That's interesting. If you're curious, swapping disks might tell you more. If it's thermals like other commenters have suggested then you won't see a difference. (unless you knock some loose!)

bityard · on May 10, 2021

I used to work at a web hosting company and one of my responsibilities was managing all the VPS backups. We had a few racks for "backup servers" which were all identical boxes. We would provision CentOS on them two or three at a time and I observed the exact same thing. Some servers were always just a little faster than others. Never figured out why.

teddyh · on May 10, 2021

> couldn't resist seeing if the disk was readable despite the dust, and finding out what was on it (I'd forgotten).

He proceeds to successfully read the disk, but doesn’t say what was on it.

lazide · on May 10, 2021

I’m guessing he forgot to tell us, hah.

flakiness · on May 10, 2021

I didn't expect Brendan Gregg talking about anything but Cloud anymore, but here here is! I appreciate his curiosity-chasing storytelling.

I wonder how to "read over 99.9999% of disk sectors successfully". Is there any handy script to do this without harm? Then I can try these tools locally on my Ubuntu laptop to see the numbers.

aidenn0 · on May 10, 2021

ddrescue is pretty good; it strides over the disk reading in big chunks, and writes out a map file of which chunks failed. Then it goes sector-by-sector over the failed chunks to fill-in the holes.

flakiness · on May 10, 2021

It seems a scary option to try casually :-/

> Never try to rescue a r/w mounted partition. The resulting copy may be useless. It is best that the device or partition to be rescued is not mounted at all, not even read-only.

https://www.gnu.org/software/ddrescue/manual/ddrescue_manual...

aidenn0 · on May 10, 2021

Yes, if you make a raw copy of a disk you are also concurrently writing to, you won't have a snapshot of the disk. Nothing surprising there, right?

[edit]

Also that section isn't talking about damaging the original disk, but rather ending up with a useless copy.

If you have a malfunctioning device then using it in any way may cause further malfunctions, but in general running something like ddrescue isn't going to destroy something that wasn't already about to self-destruct anyways.

londons_explore · on May 10, 2021

Perhaps the dust particle causes the head to jump up and back down again, but the disruption to the data stream being read isn't sufficient to prevent ecc returning good data?

bostonsre · on May 10, 2021

I think his book on systems performance is the best computer science book I've ever read.

john-tells-all · on May 10, 2021

thanks for the reminder! I'll go buy it now, Gregg is always an inspiration.

Here's the link => http://www.brendangregg.com/systems-performance-2nd-edition-...

nix23 · on May 10, 2021

Shout on it ;)

hs86 · on May 10, 2021

For the uninitiated: https://www.youtube.com/watch?v=tDacjrSCeq4

geerlingguy · on May 10, 2021

Incidentally, this post was by the same person as that video :)

kowlo · on May 10, 2021

Thank you - that is the highlight of my week

h2odragon · on May 10, 2021

somewhere i have a stack of platters from a 5in SCSI drive with a circle in the middle where the heads crashed and peeled the coating off the disk.

YourMeds · on May 10, 2021

Can we talk about how Microsoft made Windows 10 unusable on hard drives?

gotbeans · on May 10, 2021

He is packing house and the disk is only 80gb.

Kneejerk, half-"/jk" reaction when started reading was Brendan was now all in with the chia fever.

Neil44 · on May 10, 2021

He found an old drive with the lid already removed, the drive was potentially faulty beforehand then, hence why the lid was removed. You would want to performance test a known good drive before removing the lid to compare properly.