Rust's integer intrinsics are impressive

mrkgnao · on June 10, 2017

> conspiracy theories around popcount and the NSA

Whoa. Some links (with sometimes not-very-well-thought-out allegations):

https://groups.google.com/forum/#!topic/comp.arch/UXEi7G6WHu...

https://www.schneier.com/blog/archives/2014/05/the_nsa_is_no...

morecoffee · on June 10, 2017

I like that it compiles to a single instruction, but lots of languages already have this; it's pretty common.

Even boring old Java has had Integer.bitCount for many years.

joewalker · on June 10, 2017

I don't know if http://grepcode.com/file/repository.grepcode.com/java/root/j... is representative of all the implementations of bitCount in Java, but it's probably obvious that it's not going to use the popcnt opcode instruction.

I think the point is that Rust makes it much easier to use that opcode instruction. It's possible but hard with GCC using __builtin_popcount(), but, I'd guess totally impossible in Java due to lack of a JVM instruction for the same.

the8472 · on June 10, 2017

That's just the placeholder implementation that will work on any platform and even with the interpreter.

If you look at the openjdk9 sources you will notice that it is annotated as intrinsic candidate[0]. But earlier versions also have intrinsics for that[1], it's just not annotated as such.

[0] http://hg.openjdk.java.net/jdk9/jdk9/jdk/file/23721aa1d87f/s... [1] https://gist.github.com/apangin/7a9b7062a4bd0cd41fcc#file-ho...

chrisseaton · on June 10, 2017

> obvious that it's not going to use the popcnt opcode instruction ... I'd guess totally impossible in Java

I don't understand why you would guess at what Java can do when you can find out for certain?

    public class Test {
      
      private static int bitCount(int a) {
        return Integer.bitCount(a);
      }
      
      public static void main(String[] args) {
        while (true) {
          bitCount(14);
        }
      }
      
    }

Compiles bitCount to

    0x000000011e637980: sub    rsp,0x18
    0x000000011e637987: mov    QWORD PTR [rsp+0x10],rbp  ;*synchronization entry
                                                  ; - Test::bitCount@-1 (line 4)

    0x000000011e63798c: popcnt eax,esi            ;*invokestatic bitCount
                                                  ; - Test::bitCount@1 (line 4)

    0x000000011e637990: add    rsp,0x10
    0x000000011e637994: pop    rbp
    0x000000011e637995: test   DWORD PTR [rip+0xfffffffff10ed665],eax        # 0x000000010f725000
                                                  ;   {poll_return}
    0x000000011e63799b: ret

There's your popcnt instruction. No need to guess.

jmaloney10 · on June 10, 2017

There are methods in Java which have special vm implementations to take advantage of instructions like this. They methods have an ordinary Java code version for the interpreter and on jit compilation are replaced with the other version. For a complete list of these types of functions see http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/87ee5ee275...

vbezhenar · on June 10, 2017

JIT-compiler might recognize this call and replace it with popcnt instruction. I'm not sure if it does that, though.

throwaway91111 · on June 10, 2017

Yea, this is really nothing special—I'd imagine anything that compiles to native code would need all the

bobbyi_settv · on June 10, 2017

I did some experimentation with popcnt in C++ six years ago and was impressed to find that the intrinsic in gcc was faster than the best inline assembly I could come up with using the popcnt instruction:

https://github.com/bobbyi/Fast-Bit-Counting

Animats · on June 10, 2017

Now that most CPUs have population count, it may as well be exposed at the language level.

Doing it by table lookup results in questions such as "am I wasting too much cache space on this?" and "is a 64K table causing cache misses".

tjalfi · on June 10, 2017

Fortran 2008 added a number of bit twiddling intrinsics[0].

Here are a few of the scalar intrinsics for bit counting.

  popcnt() - population count
  leadz() - leading zero
  trailz() - trailing zero
  poppar() - parity

[0] ftp://ftp.nag.co.uk/sc22wg5/n1701-n1750/n1729.pdf

Edited to fix formatting.

mpweiher · on June 10, 2017

> be exposed at the library level.

TFTFY :-)

Flow · on June 10, 2017

Exactly what purpose does the surrounding instructions serve in this and similar simple cases? Is it compiler dogma or a missed uncommon optimization?

        push    rbp
        mov     rbp, rsp
        popcnt  eax, edi
        pop     rbp
        ret

pgrad3 · on June 10, 2017

It's a function invocation following x86 C calling convention in assembly. push ebp pushes the base stack pointer of the caller of count() onto the stack, and mov ebp, esp sets the base stack pointer for the callee count(). pop ebp pops the base stack pointer and ret pops the return address into the program counter. count_ones() is replaced with its assembly counterpart, which stores the result in return register eax. This is as optimized as count() as a user-created function can be in assembly given the calling convention. https://en.m.wikipedia.org/wiki/X86_calling_conventions

pgrad3 · on June 10, 2017

Correction: rsp and rbp are 64-bit registers storing 64-bit stack memory addresses, while popcount eax, edi is taking the 32-bit parameter as input and returning the 32-bit return value. The calling convention is x64 C.

thegeomaster · on June 10, 2017

GCC, for example, allows a "-fomit-frame-pointer" [1] optimization option which would get rid of this. I'm not sure why this isn't done by default in optimized builds. Maybe it has something to do with stack unwinding: if the functions panics for some reason, or triggers a CPU fault, there's no way to get the correct backtrace if you don't have the frame pointers on the stack.

[1]: https://gcc.gnu.org/onlinedocs/gcc-3.4.4/gcc/Optimize-Option...

jannic · on June 10, 2017

rustc will omit the frame pointer if you add "-C debuginfo=0".

  example::count:
        popcnt  eax, edi
        ret

c3833174 · on June 10, 2017

IIRC it's needed on x86 but not on x86_64

elgenie · on June 10, 2017

A function is being called, so it gets a stack frame at the start of the call and that stack frame goes away when it returns.

rbp points to the base of the current stack frame (register base pointer) and rsp points to the top of the current stack frame (register stack pointer).

What the function actually does is just the single popcnt instruction.

Flow · on June 10, 2017

Yes, I get the reserve space for local variables and so on.

But this is a leaf-function and one that has no register trashing or temporary variables. Could this be inlined later on via link-time optimizations? Will the linker then remove that function prologue and epilogue?

elgenie · on June 10, 2017

Yeah, it can certainly be inlined or optimized away with various compiler options. However, as you might surmise, debugging without stack frames becomes rather difficult as functions corresponding to source code disappear into a goo of assembly.

The example with slightly different options (as mentioned in another comment) gives what you're looking for: https://godbolt.org/g/GlkEQK

pja · on June 10, 2017

gcc has the -fomit-leaf-frame-pointer option at least.

FreeFull · on June 10, 2017

Those are the function prologue and function epilogue. https://en.wikipedia.org/wiki/Function_prologue

dnautics · on June 10, 2017

It's exposed as an intrinsic in llvm. C(99? I think) surfaces these in math.h, although the standard is silent on some edge cases (all zeros), and numbers are autopromoted to 32 bit integer. Julia surfaces these as well.

In case you're wondering what this could be useful for besides super secret NSA stuff, and Bitcoin mining, here are a few suggestions:

1) hyperloglog. (Similar to Bitcoin). Keep an estimated count of items streaming by by hashing them and store the highest lzcount of the hashes for each category you're tracking. This will be ~ log2(category count)

2) converting from fixed point to floating point. The number of zeros in front of your value represents the exponent of your value (or ones in the case of a 2's complement negative fixed point), which is critical to deducting the float representation.

Along those lines, one of the things I've done is implemented floating point-like datatypes, which extensively uses lzcount and locount for tracking values and also will use tzcount to measure if the values are exact or not.

https://github.com/Etaphase/FastSigmoids.jl/blob/master/READ...

pklausler · on June 10, 2017

I've been using bit population count to traverse packed sparse arrays since the CDC 6600, so it's handy for more things than Hamming distance. Always nice to have in hardware, but pretty cheap to synthesize when it isn't.

Someone · on June 10, 2017

Reading http://0x80.pl/articles/sse-popcount.html, I would think the hardware instruction is slower than the best software implementation. Or has that changed on newer hardware?

jblow · on June 10, 2017

This would not apply to the Rust method shown in the OP, since that one is operating on only 32 bits at a time.

It's unclear how well the benchmarks in this linked article generalize to other applications. If you are just popcounting in a tight loop, probably pretty well, but who does that? In reality you have other things going on, so if this method is occupying too many execution units or polluting your cache, you would see the effect of that on the rest of the program. But it's program-dependent, thus unclear.

the8472 · on June 10, 2017

> since that one is operating on only 32 bits at a time.

https://doc.rust-lang.org/std/primitive.u64.html#method.coun...

https://doc.rust-lang.org/std/primitive.u128.html#method.cou...

Of course to benefit from the SSE optimizations you would still have to call it in a loop and the optimizer would have to recognize that and replace it with a vectorized approach.

jackmott · on June 10, 2017

you can do the simd intrinsics explicitly as well

dang · on June 11, 2017

This submission originally linked to a blog post whose author (not the HN submitter) asked us to delete it. We don't want to kill the thread, so we removed the URL above.

__s · on June 10, 2017

> Whereas GCC’s __builtin_popcount() only works on unsigned ints Rust’s count_ones() works on signed, too!

C's weak typing should handle signed

jblow · on June 10, 2017

I was once asked to come up with a table lookup method for popcount on the spot and could not come up with a solution.

Oh, Hacker News.

If someone can't solve a problem like this off the top of their head, does it not act as a strong signal that they are a beginner and you should probably look elsewhere for quality information?

aisofteng · on June 10, 2017

Generally, yes. In this case, though, the post is about a feature of a tool that exists regardless of who notices that it does.

on June 10, 2017

[dead]

searealist · on June 10, 2017

Being asked to solve it using lookup tables is not a tricky problem. It's a straight-forward application of bitwise operations common in the field of software engineering.

jblow · on June 11, 2017

Indeed, it's a trivial lookup table problem, it's hard to come up with an easier one, so I would guess the confusion comes from lack of familiarity with bitwise operations, in which case why is the author presuming to write an article judging a certain implementation of bitwise operations impressive?

The intent of my comment is not to be mean, it's just ... there is a lot of noise on places like Hacker News and this article is part of that noise because, look, just about all compilers have had intrinsics for decades now at least, popcount is a very common one, so it's not surprising to see it turn up. It's not impressive as the title suggests, it's extremely common. And it's nothing specific about Rust because most production-quality languages do it. So both major elements of the article title are pretty much incorrect.

And it's fine not to know that when you're a beginner, I am not knocking that at all. But there's something about writing articles that then get broadcast, that give the wrong impression to other new people who are trying to learn. It's useful information that there is a popcount intrinsic in the Rust compiler, but this would be much more educational coming from someone who understands the context of all this stuff and can explain the real situation. Which may be the author of this article someday, maybe even someday very soon -- I don't wish to be inappropriately negative -- but it's not today.

I never liked going to school, and I think higher education is going to go through an existential crisis pretty soon, if it's not happening already. But one good thing about the old system is that at least there was this idea that you should work hard, and really learn the material, before you go presuming to teach people. And I think that's a very good idea. If you're inexperienced and there's a shortage of teachers and teaching needs to happen, then go for it -- but otherwise I think it is very important to keep in mind what one does and does not understand, and who understands it better, and to not presume to teach until one is in a good position to do so.

I know this goes a little bit against the current philosophy of "programming is great! Anyone can do it! Rah rah," but actually I think on closer inspection it doesn't. There's nothing wrong with participation, and community, and everyone contributing, etc. But it's important to keep an understanding of the difference between beginner contributions and advanced contributions, otherwise it seems possible to suffer a severe degradation of skill in the field over time, because how do people know what to shoot for if people of all expertise levels are teaching them and they can't tell the difference because they themselves are beginners?

searealist · on June 12, 2017

FWIW, I think the story hit the front page simply because it was a positive story about Rust and a lot of people in this community reflexively upvote that sort of thing. I took your comment as well deserved criticism against the community.