I'm shocked they didn't have a killswitch or automated stop-loss of some kind. A script that says "We just lost $5M in a few minutes; maybe there's a problem." Or, a guy paid minimum wage to watch the balance, with a button on his desk. $172,222 is a lot of minimum-wage years.
I work for a small automated trading firm (in foreign exchange), and marking positions to market is one of the difficulties in designing an effective kill switch, because these marks can easily make the difference between a large gain and a large loss. In fast-moving markets (which is when a kill switch is most useful), it's very hard to determine the true mid-market rate. Our system of course always has such a notion, but if we used it to shut down our system every time it looked like we had lost money, then a market data glitch (which is not at all uncommon) would impose a large opportunity cost as a human intervened during an active market.
Instead, we designed our system so that there's a very low threshold for it stop trading if it appears to have lost money, but to only do so temporarily. If our marked-to-market position recovers shortly thereafter while the trading system is idle, then the apparent loss was probably due to a market data glitch. On the other hand, if our position does not recover, then the temporary stoppage becomes permanent, and a human intervenes. (Obviously, there are more details here, but this is the general idea, and it's worked very well for us.)
> a market data glitch (which is not at all uncommon)
This is a point that bears amplifying. People who do not work in the financial industry may not appreciate just how bad market data feeds are. Radical jumps with no basis in reality, prices dropping to zero, regular ticks going missing, services going offline altogether with no warning, etc.
What kind of technology stack are you guys using? Also, is your system constantly being improved to detect these things or was it just a onetime setup kind of thing?
This particular subsystem was a replacement for a previous version, which was a kill switch and had led to opportunity costs. We spent a lot of time designing, implementing, and testing it, but haven't felt the need to touch it since then. It's sufficiently general that it doesn't need to be adapted as our strategies change, and it doesn't need to adapt to changing market conditions (as, e.g., a trading strategy does).
They unintentionally built up a huge position well beyond their internal capital thresholds. You don't have worry about mark to market to detect that. If you think your FX strategy could end up with max notional of let's say £50MM, you can put something in place to stop trading if you exceed that. If you are a smaller shop, your prime broker is probably already doing something like that for you.
Those types of risk controls (i.e. capital thresholds) are bog standard in our industry and it still kind of blows my mind that Knight managed to get around them so easily.
How many millions in orders do they normally process per minute?
Since there were no procedures in place, would you like to be the guy who pulled the plug on the (let's guess) $100 million/minute processing system? Do you think you could get another job after that? What would the costs be for violating contracts? You could single handedly sink the company (which, in the end, this issue basically did).
I don't blame the guy for not killing all operations. He never should have been put in that situation. Proper QA, regression testing, monitoring, running a shadow copy and verifying it's output, there are tons of things that could have prevented/mitigated this.
In this case the system was deploying new functionality, so your shadow couldn't just run last week's version or you'd get false alarms all the time. And obviously deploying the same broken version to the primary and the shadow wouldn't detect anything.
So you'd need two codebases and two developer teams, coordinated enough that their code produced exactly the same output yet independent enough that they didn't make the same mistakes. With the challenges of coordination this would more than double your costs.
Of course, with the benefit of hindsight, the costs might have been worthwhile...
Stop loss orders are not a panacea here. You could lock in a loss for no other reason than the market becoming very volatile due to some news item or some other reason like the flash crash. And you can't even guarantee to limit your losses at the stop loss price. It's just the price that triggers an attempt to unwind the position.
Also, the market may not go against you immediately. What if the glitch in the system means you're opening positions in stocks and you drive up the price by doing so? The losses are not immediately apparent. There's no screen where you could watch your losses run up in real time. The losses only become apparent once you try to unwind those positions and that's the case in many kinds of scenarios.
I believe it took J.P Morgan months to unwind the London Whale positions and really know what losses were incurred.
I think there's a better chance of catching a glitch at the point where the positions are opened.
It looks like they saw positions accumulating in one of theirs accounts, but couldn't identify the source. And maybe they saw positions too late, because of lag. Here's a description from that SEC report, onto what went wrong with their monitoring system:
"Moreover, because the 33 Account held positions from multiple sources, Knight personnel could not quickly determine the nature or source of the positions accumulating in the 33 Account on the morning of August 1. Knight’s primary risk monitoring tool, known as “PMON,” is a post-execution position monitoring system. At the opening of the market, senior Knight personnel observed a large volume of positions accruing in the 33 Account. However, Knight did not link this tool to its entry of orders so that the entry of orders in the market would automatically stop when Knight exceeded pre-set capital thresholds or its gross position limits. PMON relied entirely on human monitoring and did not generate automated alerts regarding the firm’s financial exposure. PMON also did not display the limits for the accounts or trading groups; the person viewing PMON had to know the applicable limits to recognize that a limit had been exceeded. PMON experienced delays during high volume events, such as the one experienced on August 1, resulting in reports that were inaccurate."
"Knight did not retest the Power Peg code after moving the cumulative quantity function to determine whether Power Peg would still function correctly if called."
From thereon they purely and simply deserved everything that happened to them.
They didn't retest some code that hadn't been used since 2003. Why would they? Sure, removing that code would've been a good idea, but that's a forgivable mistake.
Reusing a flag that did something different in a currently-deployed version, without having a "transition" version that ignores that flag? Dodgy, but makes sense if you're in a rush.
Needing to manually deploy code to 8 different servers? Just stupid.
Because not testing it can have unintended side effects at some low level of probability, and we all know this - mostly we just don't think it's going to happen to us. However, as the significance of the risk associated with that low level of probability goes up the demands for securing against that risk go up.
Computers are very powerful when placed in certain configurations. The more powerful the system you're dealing with the more cautious you should be. If they were dealing with an app then, sure, I'd have a lot more pity for them not taking precautions - such precautions would not be reasonable to expect of them. But if you're not being excessively paranoid about such a powerful system as was deployed here, then you're doing it wrong.
I do feel some pity for them based on the fact that there's not a tradition of caution in programming. And I do agree that there were multiple points of failure in there. But testing all the code that's going to be on a system like this is a base level of caution that should be used - whether or not you intend to use that code. If you think it's too much bother to test, then it shouldn't be there - but if it's gonna be there then for god's sake test.
If I understand correctly, they were planning to remove the Power Peg code. Do you re-test the left-overs of your refactoring: the parts you delete?
The real problem here is that they were using incremental deployment and did not have a good process for ensuring the same changes were successfully made to all server.
The parts you delete aren't left over. They've been deleted. Do you test the parts of your code left behind after refactoring? Yes, absolutely.
Code is either working correctly, and verified to be such by automatic testing, or it's not there. You don't leave unused code lying around to be removed later!
Right, and they were deleted. The code was not there after the refactoring. Except that because they were using an incremental deployment it was not actually removed from one of the servers. It was not really a case of untested software being run, but rather one of the servers running an older version of the software due to a deployment failure. The newer version repurposed the Power Peg flag, but since that one server was still running an older version of the code it behaved differently. It carried out the old behavior, which was not suitable for the current environment.
This is the biggest argument in my opinion against incremental deployment: it is hard to know exactly what is on any given box. Each time you push an incremental piece to a server you have effectively created a completely unique and custom version of the software on that server. Much better to package the entire solution and be able to say with certainty, "Server A has build number 123."