Distributed systems in an alternative universe

jo909 · on March 9, 2016

Everything is a network of computers. Some networks are very fast (QPI between sockets, PCIe), others only somewhat fast (SATA/SAS, USB, Ethernet). At both ends of the network there are discrete computers (hard disk controller, network card, keyboard, monitor, ...) talking to each other.

Essentially the difference between reading data from local memory and reading it from say Amazon S3 is just the speed/latency of the various networks, and how many computers had to store, transform and shuffle the data onto another network.

It only gets more amazing when you think about all the analog stuff that happens at the end of those networks, the button presses of people on their keyboard, light hitting a sensor in a camera, spinning particles of magnetized rust, stored electrons in a flash chip, light travelling through fibers of glass, crystals moving in the pixels of your monitor etc. In the end those analog actions are what make it "real" and what the whole network of networks is build for.

scribu · on March 10, 2016

It seems to me like a big difference between multi-core CPUs and conventional networked systems is the expectation of reliability. For example, when a CPU core randomly fails, you don't expect the CPU as a whole to continue operating, do you?

al2o3cr · on March 10, 2016

Most of the time, presumably not. But then there's System z...

ingenter · on March 9, 2016

I've been thinking about a hypothetical CPU that has explicit cache and RAM access. It's pretty close to OP's alternative universe.

What would happen is people would write wrappers for such a system so that you could think about your RAM in a linear way. And you're now doing what a Haswell CPU does.

Of course, this approach has advantages for specific tasks Let's say, we want to implement constant-timing crypto primitive and put all of the required data in cache. We can now do it.

This would also help with implementing an efficient VM, OS primitives, or many other interesting things. But it would not change the way people write C/Rust/JS/Python.

Of course, it's a useful model for your performance, and you can now think about cache misses as "internet download", and synchronization primitives as "LAN share", etc, etc.

And, of course:

http://blog.memsql.com/cache-is-the-new-ram/

nickpsecurity · on March 9, 2016

Look up scratchpad memory. It has better performance, lower cost, more predictable, and can help with covert channels. The thing is that most programmers didn't want to manage memory. Compiler vendors also didn't want to manage many different scratchpads. Cache was useful and productive. So, we all have caches outside a few products here and there.

wmf · on March 9, 2016

It's not hypothetical; it's Cell. IMO it was a nightmare, but you can probably find an old PS3 with Linux and try it out.

jwr · on March 9, 2016

DSP (digital signal processors) let you manage cache yourself. And believe me, it isn't fun — it's something you'd want to do only for algorithms that merit tedious hand-tuning.

As for linear, I believe the last high-performance CPU with a direct-mapped cache (e.g. not an X-way set associative one) was the DEC Alpha?

jkadlec · on March 9, 2016

As far as I know, the only way to have something close to Wireshark for CPUs is to do a simulation. Then you can measure network latencies, cache hits and so on, but the simulation will be extremely slow. There's GEM5 (http://www.gem5.org) - a simulator with couple of DSLs designed to do just that, but I'd bet that Intel has their own thing. I did my thesis on simulating MESIF in GEM5, here's a link: https://dip.felk.cvut.cz/browse/pdfcache/kadlej16_2013dipl.p... , that could get people started if they are interested (it also might be a good introduction into the subject of cache coherency protocols). I have the code laying around somewhere, I've been wanting to make it public, but it needs cleanup first.

irishjohnnie · on March 9, 2016

I think the closest thing to a Wireshark for CPUs is Intel VTune (https://software.intel.com/en-us/intel-vtune-amplifier-xe)

hendzen · on March 9, 2016

Actually, you can buy a JTAG debugger [0] for Intel CPUs. It's going to cost you though.

[0] - https://software.intel.com/en-us/articles/intel-system-studi...

irishjohnnie · on March 10, 2016

Wouldn't the equivalent of that be a JTAG debugger for the NIC?

fmstephe · on March 9, 2016

Although it isn't equivalent to wireshark, linux's perf command (no doubt osx/win/bsd have their own) is close enough and very easy to use.

It exposes a very large number of CPU counters, including cache misses etc.

Not quite as thorough as simulating. But probably more useful in practice.

lallysingh · on March 9, 2016

There was one for powerpcs (PSIM?) that IBM put out during the mac/power days.

But generally you run benchmarks to figure out what the costs are and performance counters to see how often your code is paying those costs.

vdnkh · on March 9, 2016

Finally, a post that validates the usefulness of my BS in electrical and computer engineering (despite being a JS guy now)! But seriously, I believe that hardware understanding is extremely useful for all sorts off tangential tasks related to modern programming. I've found that having intimate knowledge of how a computer works, especially around memory and disk access, helps me to understand how to solve problems from the standpoint of a rote-CPU rather than my high-level brain. It's also really useful for deciphering kernel-level errors.

phamilton · on March 9, 2016

The key difference I see is not just likelihood of failure (an L3 cache is pretty much always going to be available), but characteristics of failure. A distributed system experiences partitions which it might recover from. That uncertainty around recovery actually makes it a whole lot more difficult to reason about.

scott_s · on March 9, 2016

It applies at the socket level as well: modern systems with many sockets (physical processor chips) are like to be NUMA (https://en.wikipedia.org/wiki/Non-uniform_memory_access), or non-uniform memory access systems. That means that there is a performance benefit if you access memory on your local NUMA node.

And irrespective of NUMA, it also applies to data sharing between threads: if you have a system with 100s of threads, but you have a multithreaded application written like this:

    while (globalFlag) {
        doSomeLightWork();
    }

Your application will not scale well. The cost of 100s of threads all accessing globalFlag will dominate performance, as the cache line for that variable bounces around the entire system.

All systems are becoming "distributed systems". Even though a single machine will get more and more beefy, we will still have to apply distributed systems thinking in order to take full advantage of that system.

api · on March 9, 2016

A chip is a distributed system of transistors too. It's all a matter of scale. The CAP theorem applies there too. Discrete systems choose CA over P, which is why they crash (loss of A) if any single part of them is partitioned in any way.

hga · on March 9, 2016

As I understand it, they're officially considered to be a type of NUMA, cache coherent NUMA or ccNUMA: https://en.wikipedia.org/wiki/Non-uniform_memory_access#Cach...

Rusky · on March 9, 2016

Wouldn't globalFlag just be shared in a duplicated cache line between all cores until one of them modifies it?

scott_s · on March 9, 2016

If the code was:

    while (globalFlag) {
        ++someValue;
    }

Then you'd probably be okay. But calling a function which does some lightweight but non-trivial work will interfere with your cache. When you come back from the stack frame for doSomeLightWork(), you're more likely to need to load globalFlag from memory.

In a multithreaded application where you want to scale to dozens or even hundreds of threads, you have look at even global memory accesses as communication.

Rusky · on March 9, 2016

That still doesn't sound like the cache line "bouncing around" any more than it would if a single-core app did the same work. There's no contention on it the way there would be on a shared mutable value, at least.

scott_s · on March 9, 2016

There is still contention, but not as much. With, say, 8 threads it probably doesn't matter much. But with 128, it starts to matter.

fridek · on March 9, 2016

A helpful Googler once told me to always think as if you had just one server - with a hierarchy of computation, access and storage costs and different characteristics of failure.

julie1 · on March 9, 2016

Nice. Someone actually saying loud that modern x86 are achaotic system that actually behaves chaotically in important cases.

Modern Computers are multi modal for a lot of resources access : IO/CPU/locks/Context Switch/IRQ/memory/L1/L2/L3...

What does a modal change means? Frequency changes. Basically the system have 3 main mode Fast/intermittent/slow access.

EX: when cache is full pointer_p = x is ~= 400cycles else it can go up to 15000cycles.

The problem is the modal change between resources accesses comes with sharp unpredictable transitions and that they are tightly coupled together.

Memory over use overflow L1/L2/L3 in swap that triggers another mode in the pattern of IO. And slow memory access impact the access to the data structure needed to treat the HW data stream ...

At the congestion (when computers are heavily sollicited) tightly coupled interaction coming from code written in non controlled way tends to do a snowball effect, creating an avalanche of bad behaviours that makes any important resources access pattern switch from slow/fast access in a way that parasite other resources. Chaotically.

Why is the code uncontrolled? Almost all optimization are made by complex code that Intel put into the CPU on the silicium. So your code is uncontrolled whether you like it or not. And that what this guy complain about. Not complexity of Intel CPU but complication.

Our unit test makes at best uni dimensional stress tests. While the true world has between n and n * (n - 1) potential coupling behaviour at most where n is the number of potential coupling. (We have quite a lot of resources HPET1/2/3 IRQ, context switches).

The number of mathematical combination of behaviour that netflix/google/amazon/intel pretend to control are just NOT possibly coverable.

What does it mean? It mean no one should engage its trust in using modern computers to do any heavy load critical tasks.

At the opposite of old computers we cannot use them at full capacity and we don't know what is the safe load so we had even more coupling and complexity and chaos to the architecture.

The problem is means increasingly fast diminishing operation/watt efficiency.

The throughput of energy that humanity can collect per unit of time is bounded.

Old devs that did Hardware know modern IT are heading the wrong direction. But, business men don't care, journalists interview the successful business men, not the coders that are doing the actual job and know how the machinery behaves.

nickpsecurity · on March 9, 2016

Good writeup. Except this part:

"Old devs that did Hardware know modern IT are heading the wrong direction. But, business men don't care, journalists interview the successful business men, not the coders that are doing the actual job and know how the machinery behaves."

The majority was going in that direction. However, there's been steady uptake of very different models that fight some of the problems you mention. Here's a few of them:

1. Simple SIMD/MIMD accelerators like DSP's, vector processors, and so on. GP on GPU's started dominating this market but there's still simple accelerators on the market.

2. RISC multicore CPU's plus accelerators. My favorite of this bunch is Cavium's Octeon II/III just due to the combo of straight-forward cores, good I/O, and accelerators for stuff we use all over the place. Cranking out mass market, affordable version of that is a great start at countering Intel/AMD's mess. That IBM, Oracle, and recently Google maintained RISC components will help a transition.

3. "Semi-custom." AMD led the way w/ Intel following on modifications to their CPU's for customers willing to pay. All we know is there's a ton of money going into this. Who knows what improvements have been made. Potential here is to roll back some of the chaos, keep ISA compatibility with apps, and add in accelerators.

4. FPGA's w/ HLS. Aside from usable HLS, I always pushed for FPGA's to be integrated as deep with CPU as possible for low-latency, high performance. SGI led by putting it into their NUMAlink system as RASC. My idea was it being in NOC on CPU. Intel just bought Altera. Now I'm simply waiting to see my long-term vision happen. :)

5. Language-oriented machines with an understandable, hardware-accelerated model for SW. The Burroughs B5000 (ALGOL machine) and LISP machines got this started. Only one still in server space is Azul Systems Vega3 chips: many-core Java CPU's with pauseless GC for enterprise Java apps. There's a steady stream of CompSci prototypes for this sort of thing with lots of work on prerequisite infrastructure like LLVM. Embedded stays selling Forth, Java and BASIC CPU's. Sandia's Score processor was a high assurance Java CPU that did first-pass in silicon. Plenty of potential here to implement a clean model and unikernel-style OS while reusing silicon building blocks like ALU's from RISC or x86 systems for high-performance. Intel won't do it because of what happened to i432 APX and i960. They're done with that risk but would probably sell such mods to a paying customer of semi-custom business.

6. Last but not the least: all the DARPA, etc work in exascale that basically removes all kinds of bottlenecks and bad craziness while adding good craziness. Rex Computing is one whose founder posts here often. There's others. Lots of focus on CPU vs memory accesses vs energy used.

So, these 6 counter the main trends of biggest CPU's with a variety of success in the business sector. Language-oriented stuff barely exists while RISC + accelerators, semi-custom, and SIMD/MIMD via GPU's say business is a boomin'. Old guard's voices weren't entirely drowned out. There's still hope. Hell, there's even products up for sale and Octeon II cards are going for $200-300 on eBay w/ Gbps acclerators on them. After I knock out some bills, I plan to try another model to get away from the horrors of Intel. :)

hga · on March 9, 2016

I think the Azul Vega line has been beaten by commodity x86-64 hardware and it's a legacy system, this bit I just noticed at the end of the company's page on the Vega3 reinforces that (https://www.azul.com/products/vega/):

"Choose Vega 3 for high-capacity JDK 1.4 and 1.5-based workloads. For applications built in Java SE 8, 7 or 6, check out Zing, a highly cost-effective 100% software-based solution containing Azul’s C4 and ReadyNow! technology, optimized for commodity Linux servers."

Much the same fate as custom Lisp processors.

alblue · on March 10, 2016

I interviewed Gil Tene at QConLondon this week (founder of Azul and designer of the Vega). One of the advantages was that the hardware transactional memory support was built in, which meant that it could perform optimistic locking on gaining synchronised methods, leading to reduced contention in a mega multi core/multi threaded way. It worked by tracking which memory areas were loaded into the cache like and only permitting write back once the transaction was committed but using the chip's existing cache coherency protocols along the way.

When they pivoted to commodity x86/64 chips the HTM code couldn't be ported over with it due to lack of support with the chips. However Moore's law meant that the Intel processors were faster than the Vega, in the same way that an iPhone is more powerful than the Cray-1 was. So it's still a win.

He was particularly excited in Intel's latest generation of server CPUs which now have this (in the form of the TXC instruction if I remember correctly). He predicted that as these became available they might be re-integrated back into Zing - though of course this was speculation rather than promise.

His talk on hardware transactional memory is here, and the slides/video will be available later on InfoQ:

https://qconlondon.com/presentation/understanding-hardware-t...

Obviously the talk/interview isn't there yet (was only done Mon:Tue this week) but this will be where it is in the coming months:

http://www.infoq.com/author/Gil-Tene

Disclaimer: I believe that Azul was a sponsor of QCon, but I am writing this and interviewed Gil because I'm a technology nerd and write for InfoQ because I want to share. Apologies if this comes over as an advert, which is not the intent.

nickpsecurity · on March 9, 2016

Shit... doesn't surprise me. Except that's not a total loss. We don't know why it lost in the market. I'm guessing it's because a niche player tried to differentiate against Intel's silicon on performance and something they optimize for. That's fail for sure.

However, differentiating on acceptable performance with better reliability, analysis, security, and so on might work. The embedded stuff does this with success in terms of Java CPU's and high-security VM's. An old trick that might help is to implement the CPU at microcode on top of ultra-optimized stuff like Intel's while hiding the ugliness. Basically, the semi-custom stuff again. A lot of potential there.

stcredzero · on March 10, 2016

And I think this is insane. We've gone through a series of "obvious, pragmatic" next steps to arrive at a model that seems sane and normal on the surface, but which results in...wat!?

http://mechanical-sympathy.blogspot.com/2011/07/false-sharin...

Writing highly concurrent and parallel stuff in Erlang to run on top of multi core systems seems sane to me. In most other languages? Maybe half-sane at best.

hoosein · on March 9, 2016

The biggest difference is event and time management.

im_down_w_otp · on March 9, 2016

Not fault detection and probability of failure?

afsafafaf · on March 9, 2016

How does this compare to IncrediBuild or Electric Accelerator?