Real System Failures

bsder · on July 14, 2017

Quoting:

"For more than 30 years, our design lab has seen that no IC greater than 16 pins (except memory) has worked according to its documentation"

That matches the experience of every single embedded engineer I have ever known.

JoeAltmaier · on July 14, 2017

Oh yes, a thousand times Yes. The entire job of embedded engineers is to work around flaws in large SOC/SOM designs. The errata sheets are many pages of one-line descriptions. The tools barely work, any provided 'drivers' from the manufacturer are little more than proof-of-concept, and none of the advanced features work well/at all.

tluyben2 · on July 16, 2017

Yep, and that documentation is often bad. We create our own (currently we are working with some obscure mcu) but cannot share because NDAs etc with the chip manufacturer. And so the docs remain crap and the trial and error continues.

TeMPOraL · on July 14, 2017

This was an absolutely amazing read.

Few lessons I took from it:

- "There is no such thing as digital circuitry. There is only analog circuitry driven to extremes." Digital is a pretty leaky abstraction. You can't safely ignore the physical world. In particular, be wary of your digital parts changing into other parts (or new "parts" appearing out of the blue) thanks to physics.

- There's so much that can go wrong. I'm in awe of people working on life-critical systems, and of challenges they deal with.

- What the fuck is going on with IC durability? The presentation quotes a text from 2013, which says "Commercial semiconductor road maps show component reliability timescales are being reduced to 5–7 years, more closely aligning with commercial product life cycles of 2–3 years." I.e. if your device has modern electronics on-board, it already won't last long because semiconductor devices themselves are expected to naturally fail after few years. This makes me really sad about the state of our technological civilization.

- Don't ignore specs you don't fully, 100% honest-to-god understand! Slide 38 is a damning enough description by itself. I'd add that this also applies to bureaucracies and laws - just because you think some rule is stupid, doesn't mean it is. "Move fast and break things" approach has no place where lives (or livelihoods) can be affected.

- Even adding a node to a linked list isn't a trivial thing, and has many places in which you can screw it up. This highlights just how much acummulated complexity we're dealing with here.

- Life always finds a way... to grow in your electronics and break it.

nickpsecurity · on July 14, 2017

Far as IC's, Ive been reading on their failures for some time. As non-HW person, it looks like the physics of IC reliability get worse every time they shrink to a lower node. Variability of what each component does increases. The analog may drift more since I saw one group using digital cells to keep it accurate at 90nm. I keep seeing attempts to do magic at the gate or synthesis levels to counter a bunch of this. They already do to a large degree which is why they last 2-3 years in first place instead of breaking immediately.

So, it looks like they'd spend a lot extra with effects on performance/watts/cost to make them last for NASA lengths of time. Those same companies are incentivized to sell new chips regularly in a price-competitive market. So, no reason for them to do aforementioned work since it's just throwing away money.

Again, just overview of a non-HW guy that reads a lot of HW industry's publications. HW people correct anything I missed please.

lfowles · on July 14, 2017

To extremely simplify it, we're effectively making our lightbulbs smaller and expecting the same brightness out of them. We could keep a bulb going for a century, but it's not going to light as well as we expect.

deathanatos · on July 14, 2017

On slide 10, faulty hardware introduces a standing wave onto a bus. Two CPUs are nodes, two at antinodes, causing a 2-2 disagreement in the state of the system.

Yet, the slide goes on to argue this is a software problem? It was my impression that Byzantine tolerant systems required agreement among ⅔ of the nodes; if the system is split 50/50, how can even a tolerant system not fail? (Or rather, is it the difference between failing gracefully and failing spectacularly, and the slide fails to elaborate on exactly how the system failed? But I don't see how we can expect this to succeed.)

gizmo686 · on July 14, 2017

I am not sure if it is refering to the NASA incident you are refering to, but a situation consistent with said failure is described here [0] (section 5.3), although it seems like it still assumes some background I do not have to fully understand it.

In the situtation described in my link, there were 4 data sources, and 4 processors. A byzantine fault occurred with the link from one of the data sources to the processors. This cascaded into a failure of the entire 4 processor system when it fell into a 2:2 split. In concept, the processors could have communicated with each other and detected a disagreement in one of the four data sources.

The software bug is that a fault with one of the data sources cascaded into a fault with the entire processing unit. In a correctly functioning system, the fault should have been contained to just the one data source. That is, the problem should have been no worse then simply losing the bus entirely.

Assuming there was sufficient redundancy across the buses, the system could have continued functioning properly despite the fault. However, because the software did not correctly handle the fault, any other redundancy became useless.

[0] https://www.cs.indiana.edu/classes/p545-sjoh/post/lec/fault-...

jfoutz · on July 14, 2017

I kinda think thats the point. Not exactly a software problem, but an information theory problem. If you need to tolerate 2 failures, you need 6 machines.

Normal_gaussian · on July 14, 2017

5 machines. Even numbers of machines introduce a higher failure likelihood for no greater tolerance.

jfoutz · on July 14, 2017

You're thinking majority not Byzantine fault tolerance. 2/3 have to agree.

nickpsecurity · on July 14, 2017

Many of us in high-assurance systems have witnessed triple, modular redundancy fail. I started saying 3 out of 5 for that reason. May be same for other commenter. I also want where possible for the computers to be in different locations, using different hardware, and with different developers working against same API.

vitus · on July 14, 2017

Yes, but those aren't Byzantine failures. Byzantine failures present with incorrect values, not simply the absence of a value (as seen by a total hardware failure).

See the abstract in Lamport's original paper introducing the Byzantine generals problem [0].

We also see a similar issue in error correction -- an introductory undergrad course might teach this via Lagrange interpolation [1], where you need only n+k of the coefficients in the presence of erasure errors, but n+2k in the general case (where n is the size of the actual message, and k is the maximum number of errors to correct).

[0] https://www.microsoft.com/en-us/research/wp-content/uploads/...

[1] https://inst.eecs.berkeley.edu/~cs70/fa14/notes/n8.pdf

nickpsecurity · on July 14, 2017

Partial hardware failures exist. The wrong values start going through the system. A bit flip is easiest example. NonStop has been countering both partial and total HW failures in their design a long time now.

mcguire · on July 14, 2017

This is a fascinating presentation, including scenarios of real, honest-to-god Byzantine failures.

Also, it's a great example of the Edward-Tufte-hating, horribly, hilariously bad PowerPoint NASA presentation style.

TeMPOraL · on July 14, 2017

Interestingly, those "horribly, hilariously bad PowerPoint" presentations tend to have better content-to-noise ratio than the pretty ones.

This applies to the modern web as well.

mcguire · on July 15, 2017

My personal favorite example of a "NASA-style, bad PowerPoint presentation" is a SCRUM meeting involving a deck of slides with ~20 lines of text per slide. The SCRUM master involved was an employee of Accenture, IIRC. [Edit: And yes, the deck by itself had an impressively high information content.]

NASA and its contractors (which, in my experience, do most of the work attributed to NASA) have a weird, self-reinforcing cycle of decision-making by PowerPoint such that slide decks are important, necessary, and ubiquitous and therefore almost universally bad. Tufte's example of of a presentation burying the lede that the Shuttle will get blowed up is just one consequence.

dfox · on July 14, 2017

What this "NASA-style powerpoint" refers to is the fact that the content would be better presented as normal flowing text instead of set of slides that are overpacked with text.

On the other hand this presentation is not that bad, at least it does not have bullet lists nested five levels deep where text size does not match the nesting levels and such things.

mturmon · on July 14, 2017

Presentation appears to be due to Honeywell, not NASA. It's hosted on a NASA site due to the subject matter.

mcguire · on July 15, 2017

(In case you don't see it, check out my other comment, https://news.ycombinator.com/item?id=14777946, for further background data.)

The NASA environment, which includes Honeywell in this case, is full of these sorts of things. This one is actually quite good.

idlewords · on July 14, 2017

It's almost a mark of authenticity, the way Nobel laureates always have the ugliest websites.

idlewords · on July 14, 2017

It's stuff like this that makes me wonder how big the gap is between disaster planning in large scale computer infrastructure (like AWS), and what will happen when there is an actual major disaster like a large earthquake.

The amount of confidence people have in their ability to plan for contingencies seems to go down in proportion to their exposure to hardware. Complex systems are endlessly inventive when it comes to finding ways to fail.

contingencies · on July 14, 2017

All of these problems could have been found by formal analysis.

"If only we'd had the human, time, money and organizational support resources to plan ahead more accurately, we wouldn't have made this particular mistake!" That's called the benefit of hindsight, and it's the project manager's classic "told you so". To management it sounds like "give me more budget and a slacker timeline", and to engineering it sounds like "someone wants to use a different one-true-solves-all-problems-solution".

Experienced system designers know that the real art is knowing that out in the real world, things will fail no matter how careful you are, so anticipating and detecting both known and unknown failure modes and recovering from them is really the critical need.

For an accessible, real world study of how this can be achieved with arbitrarily complex software systems, I can highly recommend reading about Erlang, or alternatively deploying a nontrivial pacemaker/corosync cluster. Most engineers never build a system this resilient in their lifetime, but once you have, you can never look back.

mcguire · on July 15, 2017

Note that detecting and recovering from an unknown failure mode[1] can itself be a failure mode.

[1] https://mars.jpl.nasa.gov/mer/newsroom/pressreleases/2004012...

Gracana · on July 14, 2017

How is your magical-problem-avoiding-technology recommendation any different from theirs?

contingencies · on July 14, 2017

Well, for starters there are two well engineered solutions pre-built and battle tested that can be applied to arbitrary problems, by HN readers, now.

Further, instead of the "build the perfect system" philosophy put forward in the presentation (ie. formal analysis), both solutions use the alternative "tolerate and control for failure". This is a significant philosophical and practical distinction.

0xCMP · on July 14, 2017

Interesting thing from this is that he says they should have used more formal analysis to build failure and fault tolerant systems.

But how do you formally verify/analyze a system for fault and failure tolerance if the methods of detecting failure and other faults are themselves not enough?

e.g. The slide about COM/MON, which I admit I didn't fully understand, seems to be that the solution picked wasn't the very best possible one due to constraints and that failures were not detected that the point they were expected to.

I guess you at least would know those are failure/fault points which can not be tolerated or handled somehow and should be watched.

ricksharp · on July 14, 2017

Ok, I'm not a systems developer (I'm a full stack / cloud developer), so I don't usually work with systems that introduce analog faults (operations in software tend to either succeed or fail with an exception).

The only place I have encountered something like this was on an Arduino board where the use of a buzzer was causing a voltage drop that affected the logic of the code. (It appeared that a delay function returned immediately instead of taking 250ms, which sped up the loop.)

Question:

How do you actually implement Byzantine Fault Tolerance?

I found this in Wikipedia:

Byzantine fault tolerance mechanisms use components that repeat an incoming message (or just its signature) to other recipients of that incoming message. All these mechanisms make the assumption that the act of repeating a message blocks the propagation of Byzantine symptoms.

Is verifying the interpreted input value the primary way to design for Byzantine Fault Tolerance?

akshayn · on July 14, 2017