More

deadgrey19 · on Nov 16, 2021

Yet another academic work in the OS space that should have been tested against the the state of the art in (commercially backed) networking space.

From the abstract:

> However, none of these systems offer a general-purpose datapath OS replacement that meet the needs of µs-scale systems

One of the key points that the authors use to motivate their work is that using kernel bypass technologies is hard because it requires software engineers to rearchitect their code.

Call me sceptical of that claim. I can think of 3 (open source) systems that provide drop-in replacement network interfaces for existing software, but offer nanosecond (not microsecond!) scale performance.

- Solarflare/Xilinx Open Onload - https://github.com/Xilinx-CNS/onload

- Exablaze/Cisco - ExaSOCK - https://github.com/cisco/exanic-software

- Melanox/NVidia VMA - https://github.com/Mellanox/libvma

With each of the above systems, they are designed to offer low latency network stack replacement for existing applications, without code modification or even recompilation. Exactly the authors’ goals.

But there is not a single reference to any of these technologies in the paper. And crucially no comparison to any of them in the evaluation.

The authors seem blissfully unaware of what work has actually been done to solve these problems. Which makes the contributions of the paper … somewhat dubious.

tobz · on Nov 16, 2021

One crucial difference I was able to spot is that Demikernel seems to have an approach that can work for OSes other than Linux. To wit, examples of the performance of the Demikernel approach are shown on a Windows system running in Azure.

Of the three examples you listed, none of them seem to support their kernel bypass capabilities in anything other than Linux.

deadgrey19 · on Nov 16, 2021

Fair enough. My guess is that the commercial solutions are simply going where the money is. If there was a commercially pressing need to port to other systems, I’m sure it can be done.

With that said, it's one thing to have a legitimate reason for preferring an alternative approach. But the paper doesn't even mention these solutions, which means that the authors are either ignorant of what's already been done (bad), or deliberately avoiding comparisons to alternative approaches (worse).

pjmlp · on Nov 17, 2021

I would put the reviewers on the spot actually, even if the authors weren't aware of such solutions, the reviewers should.

deadgrey19 · on Nov 17, 2021

This is the trouble with submitting to OS conferences. Their networking knowledge is typically limited. I’ve seen the same thing happen in the past.

scrubs · on Nov 16, 2021

Note that Solarflare requires a per NIC license to get that by-pass. And like Melanox those NICs are not cheap. Kernel by-pass on COTS without recourse to high end NICs w/ or w/o FPGA and the like is needed. I am benchmarking eRPC built on top of DPDK on AWS to that end [1]. And not having to play sys-admin and figure out configs or at least simplifying that work would help. DPDK isn't a necessarily straightforward library. [1] https://www.usenix.org/system/files/nsdi19-kalia.pdf

deadgrey19 · on Nov 17, 2021

FWIW - I don't think it would be that hard to port VMA or ExaSOCK to run on top of DPDK, certainly less effort than making a whole new category of operating system!

But ... selling the hardware is where those companies make their money, so there's not much of a business case for making their free and open-source software run on other people's (cheap) NICs.

deadgrey19 · on Sept 29, 2020

Exactly this happened to me. We bought the TV and for years things were great. Really loved the functionality of the smart TV. Being able to stream YouTube at a moments notice was great.

But then the adverts started. We paid $3000 for this TV. Not a low end “ad-supported tv” The only way to “opt out” is to disable the “smart” TV function which I have done. Any smarts now comes from my Apple TV which does not thrust averts in my face. Needless to say, my last Samsung purchase ever.

deadgrey19 · on April 6, 2020

Basically the entire internet is built on TCP. But TCP wasn't always very good. There was a time when congestion nearly killed the early internet. Congestion avoidance and control talks about how they fixed it. It's light hearted, accessible and one of my all time favourite reads:

https://dl.acm.org/doi/10.1145/52324.52356

deadgrey19 · on Feb 4, 2020

This is a standard second year CS course at UNSW: https://cgi.cse.unsw.edu.au/~cs2041/19T2/index.html

Topcis included: Shell Perl Git Javascript Webservers Make Performance tuning

deadgrey19 · on July 18, 2019

Thinking in math first has been the catch cry of functional programmers (and their formal logic/verification friends) for decades. And there's nothing wrong with it, unless the problem you are trying to solve actually requires performance. Then, you have to think in "system" first. For example: Write a program that captures network packets and stores them to disk as fast as possible. There's no maths to think of here. The complexity is all in the "fast" part, and to solve that, a deep understanding of the architecture of the system is necessary. Fancy algorithms (maths) will not help you here. e.g. Will compression help? Depends on the properties of the system. Will batching help? Depends on the system. Will threading help? Depends on the system.

chongli · on July 18, 2019

What you're describing is a very limited view of math, resembling the general public view of math as being algebra, trigonometry, geometry, and calculus; that is, all the math people are exposed to in secondary school. Look further and you'll see disciplines such as mathematical logic, combinatorics, and graph theory, without which you wouldn't have networking or binary or computers at all, really.

I don't know what you mean by "fancy algorithms", but all algorithms that run on your computer have a basis in math.

As for making things go as fast as possible, well that's a special case of the field of optimization, another mathematical discipline.

deadgrey19 · on July 18, 2019

What I'm discussing here is the general concept of modelling your program formally before writing it (as per the article). What I'm arguing is that this type of approach is only possible for a certain set of applications which take the form of y = f(x), where f(x) is some type of data transformation /computation operation (e.g. calculate the GCD of these ints, find the shortest path through a given graph, sort this set etc). There's another set of applications which I/O bound, are very important, and yet, the "computation" that they preform is limited to none. These applications are rooted in, and bounded by system parameters, like understanding how disks work, how network cards work, how CPUs work etc. This is an optimisation problem, but not one that can be modelled mathematically (in any reasonable way) because of the vast complexities of the system. Building a TLA+ "proof" of this system will reduce trivially to x = x, and yet, the system is still important, and difficult to write well.

tonyarkles · on July 18, 2019

> this type of approach is only possible for a certain set of applications which take the form of y = f(x), where f(x) is some type of data transformation /computation operation (e.g. calculate the GCD of these ints, find the shortest path through a given graph, sort this set etc)

These days I'm trying to be mostly an embedded guy, and 100% understand what you're talking about re: problems that don't lend themselves well to mathematical modelling. Figuring out that your SPI bus is going slow because you've got the wrong multipler in a clock domain isn't a math problem :)

What I'd like to add to your y = f(x) examples though is that many Business Problems can (and probably should!) be modelled as y=f(x) type problems. I've seen a ton of business logic over the years that modifies objects in a pretty ad-hoc manner and is incredibly hard to reason about, especially in the big picture. The vast majority of the time, those problems can be modelled roughly as:

  new_state = f(old_state, event)

When you start modelling the business problems more formally like that, you can start using things like TLA+ to do model checking and find gaps in your formulation of the problem. Maybe you've got an state/event pairing that you haven't thought of. Maybe there's a way to get a model into a state that it can't escape from. TLA+ is useful for a lot more than verifying "calculate the GCD of these ints, find the shortest path through a given graph, sort this set", and I want to make sure people reading this don't write it off as a mathematical curiosity.

I've done a few embedded implementations that had pretty complicated state machines under the hood (off the top of my head, a LoRaWAN implementation). I modelled the states in TLA+, and it was a wonderful platform for discovering the flaws in the model I'd put together. It took a couple iterations before the model checker was happy, and from there the implementation was mostly mechanically translating my TLA+ model into code. There was some housekeeping stuff to keep track of (the TLA+ model was an abstraction), but it pretty much worked first try.

munificent · on July 18, 2019

> that's a special case of the field of optimization, another mathematical discipline.

I'm not sure how you can claim that the entire field of optimization is a "mathematical discipline". Algorithm analysis is, I suppose, but most other practical optimization work has little if anything to do with math.

When I've spent time doing optimization work, it has often involved things like:

* Discovering some API I'm using is particularly slow and finding an alternate one that's faster.

* Adding caching.

* Reorganizing code to take advantage of vector instructions. (Well, I haven't personally done this, but I know it's a think many others do.)

* Reorganizing data to improve CPU cache usage.

* Evaluating some things lazily when they aren't always needed.

* Making objects smaller to put less pressure on the GC.

* Inlining functions or switching some functions to macros to avoid call overhead.

* Tweaking code to get some values into registers.

tonyarkles · on July 18, 2019

I'm not sure why you're being downvoted, but I agree with all of your points. Those aren't things that really lend themselves well to mathematical modelling. But... there is a huge field of math that does apply to this: statistics.

The first two cases are somewhat special:

- It may be immediately obvious that an API is terrible, and that the replacement is not. If API 1 takes 1 sec to call, and API 2 takes 100ms to call, easy choice without stats.

- Caching can be dangerous. While not really a stats problem, you do need to have a really solid model of what is getting cached, and how to know when to invalidate those cache entries.

For the rest of the examples you provided, you're making changes that may make the problem better, may have no effect, or may make the problem worse. You absolutely need to use statistics to determine whether or not changes like those are actually having an effect. Performance analysis is part math and part art, and without the math background, you're likely going to be spinning your wheels a bunch. Beyond stats, fields like queuing theory are going to make a huge impact when you're doing performance optimization in distributed systems.

vikiomega9 · on July 18, 2019

What you're describing is not optimization the math field, and second _some_ of the example do have basis in mathematics.

> Discovering some API I'm using is particularly slow and finding an alternate one that's faster. On it's own it has nothing to do with Math, but writing code as components/services/abstract layers with well-defined boundaries/interfaces/types mean it's easier to reason about the code and avoid bugs. Implicitly here I'm saying we should use a language that has strong type support.

> Adding caching. This is memoization and without the basis that general code functions should behave like their math counterparts this is hard to reason about.

> Reorganizing code to take advantage of vector instructions These are mapping operations that are well defined in functional languages. The vectorized interface numpy provides is an abstraction of maps.

> Making objects smaller to put less pressure on the GC This is orthogonal to the actual math basis for the code. For example using enums over strings is a localized change.

traderjane · on July 18, 2019

You don't need fancy math because someone else did it for you, but if the industry is to compete and advance, it's going to need all sorts of people. The fact that you don't have to juggle every mathematical complexity underlying your domain is a success of human cooperation and compartmentalization. It's certainly not free or inherent.

deadgrey19 · on July 18, 2019

How so? Can you give an example?

philote · on July 18, 2019

For example, if you do end up needing compression, are you going to write you own compression library or use an existing one somebody else already wrote?

deadgrey19 · on July 18, 2019

I think you're missing the point. I'm talking about the concept of mathematically modelling the program first, before writing it (as per the article). This process only applies to certain types of applications, ones that are rooted in algorithmic transformations (like compression), rather than in system and I/O operations, like copying a packet from a network RX buffer into a disk block, in the most efficient way.

injb · on July 18, 2019

I agree about thinking "in system", but your "as fast as possible" comment is interesting, because I think this actually highlights the kind of problem that mathematical thinking is good at fixing.

Why "as fast as possible"? Usually when programmers say this it's because they think "the faster, the better!". But obviously there is some speed beyond which there's no discernible improvement. At that point, pursuing the as-fast-as-possible mandate at the expense of other concerns is the wrong thing to do.

Therefore, there's an assumption built into this statement that the system will never be that fast, and therefore "as fast as possible" is a good target. Needless to say this isn't a safe assumption.

The thing is that we're built to be really bad at knowing when we're making assumptions. Thinking mathematically is to some extent a way of trying to overcome that limitation.

_revy · on July 18, 2019

This article is not about formal verification.

Why would mathematical thinking involve rejecting physical realities? If there is a performance constraint you are trying to optimize, account for it.

> Write a program that captures network packets and stores them on disk as fast as possible

How are you going to do that without formulas involving:

- network bandwidth - disk bandwidth - SATA bandwidth - packet ordering - compression time and ratios - amdahl's law

This sounds ripe for mathematical reasoning! Absolutely the way you model the problem is informed your knowledge of the hardware.

deadgrey19 · on July 1, 2018

+100 to this. I had him for H Comp 1A, literally changed my life .

deadgrey19 · on May 11, 2018

Yes! This and only this! The posted "bit hacks" are little more than bit level manipulations exactly as intended. The Stanford hacks are truly excellent. My favourite is: https://graphics.stanford.edu/~seander/bithacks.html#RoundUp...

cozzyd · on May 11, 2018

That's really clever, but extremely inefficient. Something like 1 << (sizeof x - __builtin_clz(x)) should be way faster.

Edit: you obviously have to multiply by 8 to convert bytes to bits

glangdale · on May 11, 2018

By the unwritten conventions of bit hacks folks, 'builtin_clz' is not a primitive (and elsewhere in that same document you will find a brace of different ways of doing clz/ctz). Many of these documents are old and the techniques in them older still (dating back to HAKMEM).

You are definitely correct on most modern architectures which have a single operation count leading/trailing zeros that are typically no more expensive than an integer multiply. This has been latency = 3, throughput = 1 for most recent Intel arch.

I think a redo assuming modern architectures is long overdue; i.e. let's work off the basis that count leading/trailing zeros, PDEP/PEXT, etc are fast and cheap and start from there.

Bonus round: if you had a lot of these operations to do, then you still might be able to go faster on x86 on SSE2 or AVX2 by using this trick, as the penalty to pass a SIMD register full of integers to a floating point op or vice versa is just one cycle. This trick becomes obsolete yet again with AVX512, which introduces parallel count leading zeros.

AstralStorm · on May 11, 2018

And neither of the bit twiddling is useful for ARM NEON as bit operations in vector form are very limited... (plus there are pipeline stalls)

Also multiply adds can be fused so if you're doing that it cab be faster to just multiply by a different number instead of bit twiddling.

glangdale · on May 11, 2018

I'm not sure this is true. Many of the bit twiddling hacks can be used in NEON and they have a few unusual instructions I'm dying to play with.

I'm not sure which bit hack you're talking about that's done with a multiply or multiply-add. There's a nice use involving De Bruijin sequences for doing lg2 of a single bit that's very instructive - is that what you meant?

saagarjha · on May 11, 2018

Of course, casting that int * to a float * is undefined behavior…

zyx321 · on May 11, 2018

Of course.

When you first learn about 'cool hacks' like these, you'll want to use everywhere.

Then you'll realize that they don't apply to certain data types in certain environments.

Then you'll come to realize that the rules governing which hacks to use are too complex to remember.

Then you'll want to write a program that figures it out for you.

Then you'll realize that those programs already exist, and they are called compilers.

So now you are back where you started. You have made no progress, but you have gained an important insight.

TeMPOraL · on May 11, 2018

And from that, at some point you realize that the corollary to your #2 says that such hacks do apply to certain data types in certain environments. And then you've gained a very powerful tool to be used in specific circumstances.

There's a reason a lot of these potentially non-portable tricks show up in high-performance computing (particularly in videogames) and in the embedded space.

zyx321 · on May 11, 2018

That reason being that sometimes spending a 10+ hours on a tweak that gives +0.1% performance improvement is worth it.

Otherwise, remember the first rule of code optimization for beginners: "Don't."

EDIT: And by "they don't apply", I mean they can crash your program or even silently corrupt your data.

kaslai · on May 11, 2018

Sometimes tweaks like these might only save 10-20 cycles, which in a vacuum doesn't sound like much, until you consider that before the tweak, it started at 26 cycles and is now 6 cycles, and it's called 50 million times every second, which is a savings of 100 million cycles per second.

For games, this can mean a higher frame rate. For data processing, it can mean the data gets processed significantly faster. It's especially important for things like video encoders, where even a 1% savings in CPU time can translate directly to 1% more potential revenue for the same operating cost at a company that relies heavily on video encoding.

Yeah, saving those cycles doesn't really mean anything to a beginner, but they can be a huge savings for high performance computation applications.

ithkuil · on May 11, 2018

Yep the correct way is:

union { int i; float f;} u; u.i=*i;

Now u.f contains the same bit pattern as the integer pointed to by "i" but interpreted as a float.

Compilers know how to generate code efficiently out of this.

Iirc this is implementation defined behavior, e.g. things like endianness alignment or padding, but it's not undefined behavior, whose evil effects can bubble up in unexpected parts of the program that apparently don't even touch the bad cast.

scatters · on May 11, 2018

No, that's also incorrect. The correct way is:

    float f;
    memcpy(&f, i, sizeof f);

Don't worry about the implied overhead of the call to memcpy; the compiler will replace it with an intrinsic and emit the expected movss instruction.

saagarjha · on May 11, 2018

Accessing through a union is legal in C, no? I thought that this was only disallowed in C++.

gpderetta · on May 11, 2018

Maybe? There is wording in the C standard (since C03 TC3 I think) that allow reading accessing from non active members of the union, but the wording is unclear and implementations have taken a very narrow interpretation of it (i.e. the access path need to be explicitly through an union type and the actual object needs to be an union). See the GCC documentation on union and aliasing for example.

jcranmer · on May 11, 2018

It's explicitly legal in both C99 and C11, but C99 forgot to remove it from its master (non-normative) list of undefined behavior. It's up for debate whether or not it's legal in C++ (ISTR a C++0x draft that copied the C99 changes to unions, but then they added unrestricted unions and completely changed the verbiage to have almost no relation to the C99 text).

tropo · on May 11, 2018

It mostly isn't legal, though gcc explicitly promises to support it and Microsoft's compiler doesn't even do the optimizations that would cause trouble.

TeMPOraL · on May 11, 2018

Wow, never realized the pointer cast version is undefined behaviour in C. Is C++ equivalent via reinterpret_cast undefined behaviour too?

Back in my C++ days, I would much prefer pointer-cast to union-struct, because the latter was subject to alignment/padding issues you mention, while for the former, I know of no desktop architecture that could mess up a int* -> float* cast.

I'm also doubly surprised because being able to do that stuff is what I'd consider the selling point of C/C++ over higher-level languages.

jcranmer · on May 11, 2018

Pretty much all pointer casts where you would use reinterpret_cast in C++ are undefined behavior, in both C and C++. The underlying rule for strict aliasing is essentially the same in both languages (you can only access an lvalue via a pointer of its type, with possibly differing signedness, const, volatile, or via a char pointer, or via a field of a struct that has the same initial layout if you're in that same initial layout portion).

Yeah, C and C++ aren't the language most people think they are, and the strict aliasing rules are the most clear manifestation of that.

tropo · on May 11, 2018

Trouble is, C was that language. Lots of people came to depend on this. The world needs a language with this ability.

The replacement is often "gcc with inline assembly" or "gcc with lots of __attribute__ extensions such as __may_alias__" or "gcc with optimizations disabled".

The need doesn't go away, despite the wishes of a language committee now dominated by compiler suppliers. From day 1, long before the language was standardized and even before the original (not ANSI) book was published, C was being used with aliasing. It was, after all, developed to support the creation of Unics (before that became UNIX).

saagarjha · on May 11, 2018

reinterpret_cast in C++ has essentially the same rules, with exceptions made for const qualifiers and such, as well as void , char , and std::byte *. Use a memcpy instead.

saagarjha · on May 11, 2018

Nope, it's undefined:

> Accessing an object by means of any other lvalue expression (other than unsigned char) is undefined behavior

https://wiki.sei.cmu.edu/confluence/display/c/EXP39-C.+Do+no...

ithkuil · on May 11, 2018

I don't understand your comment. The link you pasted explicitly mentions the union example as a "compliant solution".

That said, memcpy and a compiler that understand the memcpy intrinsic is probably safer; that said I have a vague memory of hitting an embedded compiler where memcpy wasn't an intrinsic

rcxdude · on May 11, 2018

This is technically undefined in C, but most compilers define it to work as expected as an extension to the language.

vvanders · on May 11, 2018

While that is a clever hack if you're looking for performance(which is where I've used these before) the float to int conversion normally hoses your pipeline(depending on platform obviously) and the solution below is much faster on any system I've worked with.

deadgrey19 · on Nov 15, 2017

Congestion avoidance and control, by Van Jacobson looks at how TCP, the fundamental protocol behind the internet, works (or didn't work to begin with). It's a pretty easy paper to digest without too many dependencies, especially if you skip the maths (which is explained intuitively anyway). And it's a fascinating read. I highly recommend it!

https://nsl.cs.sfu.ca/papers/Jaco88.pdf

deadgrey19 · on Sept 15, 2017

FPGAs are not a new thing. But FPGAs that are big enough and fast enough to outperform ASICs (in the networking context) is a new thing.

Re the title. It's perhaps a bit poetic, but it's not misleading. What else could the title possibly mean? It's either going to be about 3D printing or FPGAs. You can't seriously expect to actually download a physical piece of hardware.

deadgrey19 · on Aug 24, 2017

A bit of ongoing work from me. Trying to make wiring C programs fast and painless.

https://github.com/mgrosvenor/libchaste