Attack or otherwise, this is ultimately a hardware reliability problem. Any access pattern that can cause bit errors is indicative of faulty memory. If I remember correctly, the original Rowhammer paper shows that RAM from ~2009 and before was completely unaffected. Yet in the continuing quest for higher densities and lower costs (is RAM not cheap enough already?) the manufacturers are sacrificing reliability and correctness, and IMHO that is not acceptable, nor is their insistence that this is not a problem (it seems they were powerful enough to convince one well-known memory testing application to make the RH test optional(!) and spread FUD that it wasn't really a concern if that test failed, because a lot of RAM would fail it.) NO access pattern should ever cause errors to occur on correctly functioning hardware.
Sometimes it's worth dealing with complex abstractions in a higher layer rather than making the sacrifices necessary to implement a neat abstraction natively.
The problem of corruption at the physical layer when certain types of bit pattern occur is also encountered when transmitting data over a wire; constraining the physical parameters to remain suitable for the naive representation of binary works well for getting a signal across the PCB, but would be extremely limiting at intranet scales. The usual approach is to modulate the data in a way that avoids encoding the problematic bit patterns.
Are the tradeoffs necessary to maintain the simple abstraction worth it in this case? I don't know, but considering how much of a bottleneck RAM has become for modern hardware, I think it's worth considering the alternatives.
I think the problem here is that nothing at a higher layer is mitigating the effect of the attack. It's not so much a choice to put the complexity where it is cheapest as it is a total ball-drop on overall composite system correctness (and thus security).
From a different angle: I think your point is fair but I also think that for it to apply to this situation, the memory vendors would have needed to loudly and openly say that they were invoking that tradeoff so the OS vendors could adjust. Presumably that would also result in a lot of benchmarking being done to see if the net effect of a physical-layer vulnerability and a software-layer mitigation was actually a net positive.
There is nothing wrong with tradeoffs. There is everything wrong with violating an agreed-upon abstraction while pretending that you are not.
There is nothing wrong with selling RAM where certain access patterns corrupt the content in predictable ways. There is everything wrong though with selling that RAM for use in systems that are known to expect RAM to return exactly the bits written to it with a certain (high) degree of reliability. And it is wrong precisely because it is not a tradeoff. If you are honest about the properties of the RAM you are selling, then that is the basis for the system designer to make a decision whether using your RAM with an appropriate interface is a better choice than using more reliable RAM with a "traditional interface". Pretending that your RAM is suitable for the "traditional interface" is what prevents the tradeoff from happening and is essentially fraudulent.
Sure, but the abstraction of "RAM" basically prescribes something uniform and lossless, fundamental to the operation of most any software.
If we can make volatile memory chips significantly more dense by letting them be lossy, then lets either add another layer to the memory hierarchy and/or rename L3 cache to RAM and move the L3<->L4 mechanics into real software.
At any rate, manufacturers shouldn't just be silently eroding the abstraction so they can compete on density harder.
Because markets summarize information in a single variable, price, and thus make it difficult for consumers to observe the erosion of quality that goes along with lower price. This is especially true if there is a concerted effort to pretend that the loss of quality is minimal or unimportant. Thus we end up with cheap crap everywhere; within a decade people have forgotten that they ever had a higher-quality product available and accept the new, lower standard as their baseline.
Because in a competitive environment, any value that can be sacrificed by one party to briefly get ahead of their competitors, will be sacrificed by everyone. That includes "quality", "correctness", and "not lying about it".
Properly done Target Row Refresh has a circuit-size cost around 0.1-0.2% and a performance cost of 0% or <0.1% depending on whether an attack is happening.
The serious impacts show up when you can't rely on it being done properly, and have to use expensive workarounds.
Surprisingly, Rowhammer-like memory problems go back to the early 1950s. Early computers (such as Manchester Baby and the IBM 701) used electrostatic Williams tubes as their main memory, storing data as dots and dashes on CRT tubes. One problem with Williams tubes was that if you accessed a location on the screen multiple times, the charge on a neighboring spot could be affected, flipping the bit. (Of course back then this was a correctness issue, not a security issue.) The quality of the tube was measured by the read-around ratio, the number of times you could read a bit without corrupting the neighbors. A good tube might have a read-around ratio of 50. Nobody missed Williams tubes when they were replaced by core memory.
Of course back then this was a correctness issue, not a security issue
It's still a correctness issue today, too. I don't understand why manufacturers (and their customers) consider it OK to ship broken DRAM chips that do not conform to their stated specifications.
Rowhammer isn't (just) a security issue to be worked around, it's a hardware bug that needs to be fixed. As far as I can tell, it hasn't been.
> I don't understand why manufacturers (and their customers) consider it OK to ship broken DRAM chips that do not conform to their stated specifications.
Because they can, and sucks to be you. This is how things are everywhere. For competitive markets, the only real quality pressure is regulatory and contractual (and maybe reputational, sometimes). There needs to be a direct feedback loop between the value end-customers care about and the profit of producers/sellers for that value to matter.
As a random and interesting example of this phenomenon (really seen everywhere), here's something I learned yesterday: according to Derek Lowe[0], there's no graphene supplier anywhere that actually supplies you graphene, and they all tend to lie about it. Apparently this is one of the big things that holds graphene research back (and probably invalidates a bunch of papers).
The problem at the core is a tradeoff triangle. You can pick two of either correctness, size/speed or cost. Almost everyone picks size and cost.
Pretty much all tech is this way. Layer 1 of most copper, fiber and RF networks and long buses require scrambling[1] of the data to prevent issues caused by clumps of 1s and 0s. Modern x64 CPUs scramble[2] data before its written to ram. SSDs scramble[3] data before writing it to the physical flash chips.
Couldn't the error rate of the ECC system be monitored, to detect an attack in progress and raise an alarm?
Even if the attacker was able to get the flipping completely reliable, there would presumably be a learning/probing phase with a period of elevated ECC. Either this probe could be detected, or the attacker would be forced to remain below a threshold of detectability slowing the attack down enough to make it impractical?
It would be detected and diagnosed as faulty hardware --- which it is. If it keeps occurring after the RAM is replaced, then perhaps it could be.
The problem with characterising it as an "attack" is that it leads to the notion that certain access patterns are "bad", and that's not a slippery slope we should be heading down...
Can a software defense mechanism be implemented; say a checkbit per 7 bits that emulate ecc?
Sure, that would reduce the total ram by 1/8 ... But that would be a design choice to implement. Is ECC ram only 12.5% more expensive than non-ECC? If its higher, it may indeed be more advantageous to use non-ECC -if- a software compensation can be implemented.
A general software defense mechanism would probably have to either intercept all memory accesses or insert code after each memory access.
In either case, extra memory accesses would be needed since checksums need to be loaded from memory. This would also make cache misses more frequent, since checksum data would evict non-checksum data from cache constantly. This would have a huge performance impact - most software contains a LOT of memory accesses.
However, it might be feasible to mitigate this in specific cases by having custom code in software that needs to be secure.
Databases often do this already (I'm more familiar with databases but I suspect filesystems probably do too). The original motivation was to provide some defense against bug reports along the lines of "your database ate my data", that turned out to be due to 3rd party code inside the same process crapping on memory, hardware errors etc.
These checksums are typically done on blocks of payload data of course, not all memory content.
ECC single bit errors are reported to the host OS (or can at least be querried). Simply counting those and alerting if they reach a high-enough rate would be a pretty decent mitigation.
What measures can you take upon receiving such an alert? Shutdown the server? The reporting doesn't include the task / user causing the error, so you don't get anything actionable.
You can place more invasive checks to figure out which process is responsible if you know you are being attacked. Or you can switch your system to a more high assurance mode from a performance mode (eg: refresh memory more often).
I wonder if having a separate stick of RAM exclusively dedicated to kernelspace would provide any mitigation against privelige escalation via rowhammer. Are we considering a future where every "ring" is literally a separate set of CPU, RAM, etc in order to stymie side channels, or is that just too crazy?
If kernel space were relatively small this might be practical as a motherboard feature, possibly soldered in place. Though I doubt it'll become standard unless there are no other alternatives since it seems like a very specialized solution.
The kernel's memory usage is typically pretty small, unless you're considering the page cache to be part of it.
Although: I once investigated a soft freeze on a realtime-patched Linux system that turned out to be caused by a vendor's software somehow managing to indefinitely stall an RCU grace period, eventually consuming all available memory on the system. The kernel core dump being over 4GB in size was a bit of a give-away.
Haven't read the full article, but if I remember correctly in order for ECCploit to work you do need to reverse the ECC function of a memory controller first.
Also for people who just want the link of the academic article (including abstract):
I haven't read the paper, so I don't know how reliably they can do it in a real world setting where they are not the only people interacting with the server, but they demonstrate that it's possible.
But isn't a key-value server perilously close to a database prompt? And this exploit depends on having authenticated access, right? Otherwise something like fail2ban would prevent hammering, I'd think.
They mentioned the attack can work with roughly a week worth of unprivileged runtime, as long as the ECC mode of the ram chips in the targeted system has been previously sufficiently reverse engineered.
Is that too alarmist? To me, it sounds like something perhaps too cumbersome for casual drive by attacks, but it seems right down the alley of so called "persistent threats", or whatever it is we call those guys nowadays.
Every actual system compromise began as a previously “theoretical” attack. _Wired_’s article isn’t overly alarmist given the install-base of devices with ECC. With a possible attack-surface this large I’d rather someone cry-wolf than for those in the tech industry be caught flat footed.