Researchers have significantly increased the scope of the Rowhammer threat

userbinator · on Dec 2, 2018

Attack or otherwise, this is ultimately a hardware reliability problem. Any access pattern that can cause bit errors is indicative of faulty memory. If I remember correctly, the original Rowhammer paper shows that RAM from ~2009 and before was completely unaffected. Yet in the continuing quest for higher densities and lower costs (is RAM not cheap enough already?) the manufacturers are sacrificing reliability and correctness, and IMHO that is not acceptable, nor is their insistence that this is not a problem (it seems they were powerful enough to convince one well-known memory testing application to make the RH test optional(!) and spread FUD that it wasn't really a concern if that test failed, because a lot of RAM would fail it.) NO access pattern should ever cause errors to occur on correctly functioning hardware.

firethief · on Dec 3, 2018

Sometimes it's worth dealing with complex abstractions in a higher layer rather than making the sacrifices necessary to implement a neat abstraction natively.

The problem of corruption at the physical layer when certain types of bit pattern occur is also encountered when transmitting data over a wire; constraining the physical parameters to remain suitable for the naive representation of binary works well for getting a signal across the PCB, but would be extremely limiting at intranet scales. The usual approach is to modulate the data in a way that avoids encoding the problematic bit patterns.

Are the tradeoffs necessary to maintain the simple abstraction worth it in this case? I don't know, but considering how much of a bottleneck RAM has become for modern hardware, I think it's worth considering the alternatives.

clhodapp · on Dec 3, 2018

I think the problem here is that nothing at a higher layer is mitigating the effect of the attack. It's not so much a choice to put the complexity where it is cheapest as it is a total ball-drop on overall composite system correctness (and thus security).

From a different angle: I think your point is fair but I also think that for it to apply to this situation, the memory vendors would have needed to loudly and openly say that they were invoking that tradeoff so the OS vendors could adjust. Presumably that would also result in a lot of benchmarking being done to see if the net effect of a physical-layer vulnerability and a software-layer mitigation was actually a net positive.

zAy0LfpBZLC8mAC · on Dec 3, 2018

There is nothing wrong with tradeoffs. There is everything wrong with violating an agreed-upon abstraction while pretending that you are not.

There is nothing wrong with selling RAM where certain access patterns corrupt the content in predictable ways. There is everything wrong though with selling that RAM for use in systems that are known to expect RAM to return exactly the bits written to it with a certain (high) degree of reliability. And it is wrong precisely because it is not a tradeoff. If you are honest about the properties of the RAM you are selling, then that is the basis for the system designer to make a decision whether using your RAM with an appropriate interface is a better choice than using more reliable RAM with a "traditional interface". Pretending that your RAM is suitable for the "traditional interface" is what prevents the tradeoff from happening and is essentially fraudulent.

mindslight · on Dec 3, 2018

Sure, but the abstraction of "RAM" basically prescribes something uniform and lossless, fundamental to the operation of most any software.

If we can make volatile memory chips significantly more dense by letting them be lossy, then lets either add another layer to the memory hierarchy and/or rename L3 cache to RAM and move the L3<->L4 mechanics into real software.

At any rate, manufacturers shouldn't just be silently eroding the abstraction so they can compete on density harder.

throwawaymath · on Dec 3, 2018

> is RAM not cheap enough already?

RAM is pretty expensive.

Cyphase · on Dec 4, 2018

Relative to what?

jacob019 · on Dec 3, 2018

Unfortunately (consumer) markets rarely agree with this sentiment. A race to the bottom is normal.

peteretep · on Dec 3, 2018

> IMHO that is not acceptable, nor is their insistence that this is not a problem

It seems the market doesn’t agree with you. Why do you think that is?

astazangasta · on Dec 3, 2018

Because markets summarize information in a single variable, price, and thus make it difficult for consumers to observe the erosion of quality that goes along with lower price. This is especially true if there is a concerted effort to pretend that the loss of quality is minimal or unimportant. Thus we end up with cheap crap everywhere; within a decade people have forgotten that they ever had a higher-quality product available and accept the new, lower standard as their baseline.

TeMPOraL · on Dec 3, 2018

Because in a competitive environment, any value that can be sacrificed by one party to briefly get ahead of their competitors, will be sacrificed by everyone. That includes "quality", "correctness", and "not lying about it".

Also known as race to the bottom.

wybiral · on Dec 3, 2018

It can take time before the market catches up to what is happening and responds.

twtw · on Dec 3, 2018

Calm down. If you want lower performance in return for better reliability and security, just turn on the mitigations and/or increase the refresh rate.

Dylan16807 · on Dec 3, 2018

Properly done Target Row Refresh has a circuit-size cost around 0.1-0.2% and a performance cost of 0% or <0.1% depending on whether an attack is happening.

The serious impacts show up when you can't rely on it being done properly, and have to use expensive workarounds.

kens · on Dec 3, 2018

Surprisingly, Rowhammer-like memory problems go back to the early 1950s. Early computers (such as Manchester Baby and the IBM 701) used electrostatic Williams tubes as their main memory, storing data as dots and dashes on CRT tubes. One problem with Williams tubes was that if you accessed a location on the screen multiple times, the charge on a neighboring spot could be affected, flipping the bit. (Of course back then this was a correctness issue, not a security issue.) The quality of the tube was measured by the read-around ratio, the number of times you could read a bit without corrupting the neighbors. A good tube might have a read-around ratio of 50. Nobody missed Williams tubes when they were replaced by core memory.

CamperBob2 · on Dec 3, 2018

Of course back then this was a correctness issue, not a security issue

It's still a correctness issue today, too. I don't understand why manufacturers (and their customers) consider it OK to ship broken DRAM chips that do not conform to their stated specifications.

Rowhammer isn't (just) a security issue to be worked around, it's a hardware bug that needs to be fixed. As far as I can tell, it hasn't been.

TeMPOraL · on Dec 3, 2018

> I don't understand why manufacturers (and their customers) consider it OK to ship broken DRAM chips that do not conform to their stated specifications.

Because they can, and sucks to be you. This is how things are everywhere. For competitive markets, the only real quality pressure is regulatory and contractual (and maybe reputational, sometimes). There needs to be a direct feedback loop between the value end-customers care about and the profit of producers/sellers for that value to matter.

As a random and interesting example of this phenomenon (really seen everywhere), here's something I learned yesterday: according to Derek Lowe[0], there's no graphene supplier anywhere that actually supplies you graphene, and they all tend to lie about it. Apparently this is one of the big things that holds graphene research back (and probably invalidates a bunch of papers).

--

[0] - http://blogs.sciencemag.org/pipeline/archives/2018/10/11/gra...

extrapickles · on Dec 3, 2018

The problem at the core is a tradeoff triangle. You can pick two of either correctness, size/speed or cost. Almost everyone picks size and cost.

Pretty much all tech is this way. Layer 1 of most copper, fiber and RF networks and long buses require scrambling[1] of the data to prevent issues caused by clumps of 1s and 0s. Modern x64 CPUs scramble[2] data before its written to ram. SSDs scramble[3] data before writing it to the physical flash chips.

[1]: 8b10b and newer techniques

[2]: https://web.eecs.umich.edu/~misiker/resources/HPCA17-coldboo...

[3]: https://www.anandtech.com/show/2954/3

femto · on Dec 3, 2018

Couldn't the error rate of the ECC system be monitored, to detect an attack in progress and raise an alarm?

Even if the attacker was able to get the flipping completely reliable, there would presumably be a learning/probing phase with a period of elevated ECC. Either this probe could be detected, or the attacker would be forced to remain below a threshold of detectability slowing the attack down enough to make it impractical?

jacob019 · on Dec 3, 2018

In Linux you can generally check the number of ECC errors that have occurred since boot.

/sys/devices/system/edac/mc/mc0/ce_count

userbinator · on Dec 3, 2018

It would be detected and diagnosed as faulty hardware --- which it is. If it keeps occurring after the RAM is replaced, then perhaps it could be.

The problem with characterising it as an "attack" is that it leads to the notion that certain access patterns are "bad", and that's not a slippery slope we should be heading down...

carbocation · on Dec 2, 2018

In brief, the authors show that ECC is also affected, not just non-ECC RAM.

Dylan16807 · on Dec 3, 2018

But only on a system that's willing to completely ignore an enormous amount of single bit errors.

crankylinuxuser · on Dec 3, 2018

Can a software defense mechanism be implemented; say a checkbit per 7 bits that emulate ecc?

Sure, that would reduce the total ram by 1/8 ... But that would be a design choice to implement. Is ECC ram only 12.5% more expensive than non-ECC? If its higher, it may indeed be more advantageous to use non-ECC -if- a software compensation can be implemented.

FartyMcFarter · on Dec 3, 2018

A general software defense mechanism would probably have to either intercept all memory accesses or insert code after each memory access.

In either case, extra memory accesses would be needed since checksums need to be loaded from memory. This would also make cache misses more frequent, since checksum data would evict non-checksum data from cache constantly. This would have a huge performance impact - most software contains a LOT of memory accesses.

However, it might be feasible to mitigate this in specific cases by having custom code in software that needs to be secure.

dboreham · on Dec 3, 2018

Databases often do this already (I'm more familiar with databases but I suspect filesystems probably do too). The original motivation was to provide some defense against bug reports along the lines of "your database ate my data", that turned out to be due to 3rd party code inside the same process crapping on memory, hardware errors etc.

These checksums are typically done on blocks of payload data of course, not all memory content.

blattimwind · on Dec 3, 2018

> I suspect filesystems probably do too

Most don't :)

rocqua · on Dec 3, 2018

ECC single bit errors are reported to the host OS (or can at least be querried). Simply counting those and alerting if they reach a high-enough rate would be a pretty decent mitigation.

ge0rg · on Dec 3, 2018

What measures can you take upon receiving such an alert? Shutdown the server? The reporting doesn't include the task / user causing the error, so you don't get anything actionable.

extrapickles · on Dec 3, 2018

You can place more invasive checks to figure out which process is responsible if you know you are being attacked. Or you can switch your system to a more high assurance mode from a performance mode (eg: refresh memory more often).

kibwen · on Dec 3, 2018

I wonder if having a separate stick of RAM exclusively dedicated to kernelspace would provide any mitigation against privelige escalation via rowhammer. Are we considering a future where every "ring" is literally a separate set of CPU, RAM, etc in order to stymie side channels, or is that just too crazy?

ghthor · on Dec 3, 2018

Sounds good to me, would be fun to piece that together and pick different types of parts for each different function.

saati · on Dec 3, 2018

Wouldn't help against rowhammer via ethernet[1].

[1] https://www.cs.vu.nl/~herbertb/download/papers/throwhammer_a...

paulryanrogers · on Dec 3, 2018

If kernel space were relatively small this might be practical as a motherboard feature, possibly soldered in place. Though I doubt it'll become standard unless there are no other alternatives since it seems like a very specialized solution.

Insequent · on Dec 3, 2018

The kernel's memory usage is typically pretty small, unless you're considering the page cache to be part of it.

Although: I once investigated a soft freeze on a realtime-patched Linux system that turned out to be caused by a vendor's software somehow managing to indefinitely stall an RCU grace period, eventually consuming all available memory on the system. The kernel core dump being over 4GB in size was a bit of a give-away.

rocqua · on Dec 3, 2018

That would hurt VDSO wouldn't it?

mettamage · on Dec 3, 2018

Haven't read the full article, but if I remember correctly in order for ECCploit to work you do need to reverse the ECC function of a memory controller first.

Also for people who just want the link of the academic article (including abstract):

https://cs.vu.nl/~lcr220/ecc/ecc-rh-paper-eccploit-press-pre...

mirimir · on Dec 3, 2018

This is certainly a serious threat.

However, it's my understanding that exploits depend on running code (including JavaScript) on the target system (or in a sandbox or VM). Is that true?

Eridrus · on Dec 3, 2018

There has been one paper recently that showed a rowhammer exploit over the network against a key-value server: https://www.usenix.org/conference/atc18/presentation/tatar

I haven't read the paper, so I don't know how reliably they can do it in a real world setting where they are not the only people interacting with the server, but they demonstrate that it's possible.

mirimir · on Dec 3, 2018

Impressive.

But isn't a key-value server perilously close to a database prompt? And this exploit depends on having authenticated access, right? Otherwise something like fail2ban would prevent hammering, I'd think.

rocqua · on Dec 3, 2018

Unless your ban works either on the network, or at the firmware of your network card, the packet data is going to reach kernel memory.

It then seems somewhat plausible that such packet data could effect faulty RAM.

mirimir · on Dec 4, 2018

Good point. Thanks.

justaj · on Dec 3, 2018

If I'm not mistaken this attack is negated by DDR4 RAM, is that correct?

saati · on Dec 3, 2018

No. It's negated by DDR2, but good luck getting any new DDR2 hardware.

userbinator · on Dec 3, 2018

Early DDR3 is unaffected too.

nsteel · on Dec 3, 2018

Why is that?

ccnafr · on Dec 2, 2018

Leave it to Wired to blow a theoretical attack out of proportion

itsnotlupus · on Dec 2, 2018

What did they exaggerate?

They mentioned the attack can work with roughly a week worth of unprivileged runtime, as long as the ECC mode of the ram chips in the targeted system has been previously sufficiently reverse engineered.

Is that too alarmist? To me, it sounds like something perhaps too cumbersome for casual drive by attacks, but it seems right down the alley of so called "persistent threats", or whatever it is we call those guys nowadays.

ngcc_hk · on Dec 2, 2018

Agree very much especially in cloud and iot. Later you have much less protection already and physical (in some case) easy to get to.

Mtinie · on Dec 3, 2018

Every actual system compromise began as a previously “theoretical” attack. _Wired_’s article isn’t overly alarmist given the install-base of devices with ECC. With a possible attack-surface this large I’d rather someone cry-wolf than for those in the tech industry be caught flat footed.