Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Faster Integer Parsing (kholdstare.github.io)
168 points by fanf2 on June 25, 2020 | hide | past | favorite | 42 comments


> It’s 16 characters long, and this could also apply to credit card numbers.

Credit card numbers can be up to 19 characters and as short as 12 characters. (I think they can theoretically be shorter but I'm not aware of any issuer that actually does that).

Also, they are strings over the alphabet of decimal digits. They are not intended to be interpreted as representing integers, and doing so is a bad idea. Leading 0s matter in credit card numbers. The first few digits identify the issuer. You could in theory have two card numbers where one is N digits without any leading zeros and the other is N+k digits with k leading zeros and the remaining N digits the same as the first card's. These would be different cards from different issuers but if you just parsed them into numbers would come out the same.


Credit card numbers are a good example of the test 'do you need to do math or compare them?' If not, keep the numbers as strings.


Are the strings checksummed, though? In this case, you’d need to do math, but still keep them as strings...


The checksum calculation does not involve interpreting the whole multidigit number as an integer, so you don't need to parse the number; you're doing math on specific characters/digits only.


Wouldn't it in fact be a lot slower using an integer? Lots of divisions by 10, instead of just adding every byte.


https://en.wikipedia.org/wiki/Luhn_algorithm

There's exactly one division required.

(Implementing it using vector instructions is left as an exercise for the reader. A very elementary exercise, so I encourage you to try it if you're curious about SSE and such.)


Luhn's algorithm works on a digit string, not a single integer.


Checking divisibility by a known constant can be done in a better way than doing a division.

https://lemire.me/blog/2019/02/08/faster-remainders-when-the...


Certainly a nice read.

On the "culture of 'optimization is the root of all evil'" remark in the conclusions: I find this to be a nice example for the full Knuth quote.

If you face an arbitrary task including parsing 64 bit integers, starting by developing/using the technique from the article (as a _premature_ optimization) is probably a bad idea since it costs time for the implementation (and even more time for debugging and understanding the code a few months later), while in most cases, it is probably not what dominates the running time of your code. If you however have built a solution that does the job, but is just not fast enough, and profiling shows you that you spend considerable time parsing integers, this kind of optimization is the way to go.


Or even better: if you can change the problem so that you don't need to parse at all, then do that instead.

Recently I made a comment on the overhead of textual formats from the validation perspective: https://news.ycombinator.com/item?id=23582056

I think that a textual format is overall a horrible idea if the use-case is not presenting the vast majority of the content to humans.

In an ideal world, communication between machines would be all in binary protocols, and developers would know how to read/write them with tools like hex editors as naturally as a second language.

Instead, countless amounts of time and space are wasted by machines converting their native integers into strings, wrapping it in JSON, base64'ing that, wrapping it into XML, then sending it over the network (whose lower layers are thankfully binary) to another machine where the reverse process happens, but with additional checks during parsing. (I am not exaggerating. I have seen systems like this.) 99.99999...% of this data will never be seen by a human. What a disgusting waste of computing power.

"The fastest way to do something is to not do it at all."


Author of the article here - I completely agree. This text parsing problem came up because I was lamenting how many cycles were wasted in text processing. At my current job a text-based protocol from a 3rd party means around 80% of CPU time for the entire application is spent parsing and rendering text just for the protocol. JSON and HTTP come to mind too.

The human readability argument doesn't really hold any water because if you have a structured description of a protocol (e.g. a C struct), you can always write simple tools to inspect the protocol and make it just as humanly readable as JSON is. This is even easier if a language has reflection to generate all this code.


What prevents you from using CBOR? I do understand that a fixed schema is more of a shackle and can't be used in every situation but CBOR doesn't require an external schema.


This is a nicely written article about getting C++ to parse a 64 bit integer represented as an ASCII string as fast as possible.

It starts with the library version, moves onto a custom version, then unrolls, then does clever tricks then finally onto SIMD.

Worth a read if you enjoy optimization stories!


But it's an apples-to-oranges comparison, between functions that skip whitespace, validate the input, and report error conditions, and a function that assumes that it has exactly N digits and does not need to check. Code that doesn't have to solve the most expensive part of the real problem can run much faster.

It would have been more interesting to try to optimize something closer to the problem the library routines are solving: skip leading whitespace, error if the first character isn't a digit, accumulate digits until a non-digit is reached. Can the article writer beat the standard functions? By how much?


Optimization is often about specializing, and therefore avoiding work you don't happen to need in your particular case. The article acknowledges that it's giving up some flexibility. It could maybe have been louder about it.

This would be different if we were benchmarking to compare libraries or hardware or languages or whatnot, or even calling the new implementation "better" without qualifiers.


Does all that stuff really wind up costing meaningful performance?

If you say:

   while (i++)
     if ( notWhitespace(chars[i]) && isDigit(chars[i]) )
        doWork(chars[i])
speculative execution will give you near full performance when the input is clean, as long as you didn't write it in a way that invites mispredicts. You'll only suffer the cost of checking when your input is not clean.


That’s more elaborate than BM_Naive from the article, which takes over 10 times the time of their last version.

Reason? the next to last iteration of the code grabs 8 characters in an unsigned long, parses them, grabs 8 more characters, parses them, multiplies the first number by 100,000,000, and adds it to the second.

So, it does 2 iterations per 16-digit number, not 16.

The last result further improves on that by grabbing all 16 digits in one go.


Absolutely. Even with no other considerations, you're spending more ALU operations per character in that validation than the naive solution does for parsing. That probably won't drastically affect latency for a single parse if you don't have branch mispredictions and whatnot, but I'd be shocked if it didn't reduce throughput by 2x even for validation as simple as that.


Yes, so an effort to write an optimized version and benchmark it should include such checks (plus, of course, checking for reaching the end of the input).


Yeah. If you have that much control over your input and parsing is your bottleneck, you might as well use a binary format, mmap the file, cast the returned pointer to your data structure, and boom: zero time spent on parsing.

Still, very cool work.


That depends on what else you might want to do with the format.


I find it interesting how slow some standard library functions can be if your code is really pushing them in untypical ways. It's not actually that surprising in most cases once you remember that the library functions often have to handle expensive edge cases and invalid input.

I did encounter a similiar issue a while ago in Javascript parsing files that could be hundreds of megabytes with mostly floating point numbers as text in them. parseFloat turned out to be the bottleneck in this case after optimizing everything else, and I used a very naive (and entirely wrong) function to parse the floats that worked very similar like the earlier optimization in this case. Simply reading each digit, multiplying it by 10*position and adding it all up. Of course this creates the wrong result because of rounding, but it was close enough for my use case. And it was actually something like 2-3 times faster for the overall runtime (not just microbenchmarking the float parsing). What helped here was that the number of digits was generall low, so you typically had ~3 digits after the dot, not the maximum floats could represent.

I'm pretty sure someone that knows what they're doing could have written an even faster and more accurate version there given the same constraints, but I was surprised just by how much I could beat the library function for my specific, restricted use case with a very naive implementation.


What. A lovely article and some of the comments show that its context isn’t understood (despite the author’s efforts to imply it!)

I’m going to start by ignoring this in my code. Just make it work correctly, right? But as we get closer to deployment performance probably matters (it does to us). Then we start to notice lots of important coupling. For example input validation typically* isn’t required inside the system: datastructures and generated and used by our code. Validation applies at the “surface” or external inputs of the system. Log records which are generated by one part of the system may need to be dredged rapidly — forcing the representation to be fixed means the code that parses it can make assumptions the library code cannot. Metering will tell you where optimization opportunities lie.

* super long-lived applications sometimes need to do consistency checks even on internal data to avoid subtle corruption issues that could have hardware or software origin and could be very expensive to encounter. Things like phone switches and spacecraft that need to go years or even decades without a reboot.


The article also doesn't explore another optimization: if you are parsing a log file with a bunch of times in the same vicinity, maybe you can leverage the last time stamp you saw? Only do math on the unique part of the new stamp, then add that to the old stamp to get a new time.


Bit twiddling is always fun, especially where ASCII is concerned. This branchless hex parser[1] of mine was a fun half day to figure out, but was only possible because ASCII has some really useful properties.

[1]: https://gist.github.com/jcdickinson/9a4205287ae107e9f4e5f676...


Json.org has a very nice diagram on how JSON numbers are interpreted. It's fairly comprehensive (accepts scientific notation), but...

That treats all numbers as decimals (double) as Javascript treats all numbers that way. If you need to parse integers only, or even only positive integers, bit shifting is the only way to fly (bonus points for SIMD).

One detail the author didn't go into was cache latency and branching. I can't find the numbers right now, but to get sub nanosecond is impressive. I can't find the numbers right now, off the top of my head: L1 is ~1ns, L2 ~4ns, L3 ~8ns (the L3 cache on Ryzen 3900x & 3950x is 64MB! in 8MB clusters with 3-4 cores), and main memory is around 80ns (let's not get into memory ranking). So to get down to those speeds you cannot afford ANY branching, and have to do operations in parallel.

And I didn't mention localisation with some Europeans using commas instead of periods and periods instead of commas - madness!


What's missing is a `thread_local static std::istream` inside the function so that the expensve steram object isn't recreated every time the function is called.


Not discussed here - there's literally BCD instructions on x86 that take an ASCII string and treat it as a number. I don't know of performance though since these instructions are definitely archaic.


The x86 BCD instructions aren't supported in long mode.


The 8087 floating point coprocessor has the FBLD instruction to load an 18-digit BCD number and FIST to store an integer. So once you convert from ASCII to BCD, the remaining conversion is two instructions. This is probably the most efficient way to do the parsing, if you're in 1980.


BCD is Binary Coded Decimal. Not ASCII. 0x12345678 BCD is 12345678 in decimal.

https://en.wikipedia.org/wiki/Binary-coded_decimal


On Intel unpacked BCD was specifically for working with numbers written in ASCII. https://en.wikipedia.org/wiki/Intel_BCD_opcode


Damn, and I thought I knew all about x86 assembler... never noticed unpacked BCD before.


Always great finding out about new SIMD instructions I've never heard of before. Hope I get the chance to use them in the future and make stuff insanely fast :p


Are SSSE3 instructions faster than their AVX counterparts? It is my impression that AVX should be faster?


There are no AVX counterparts. AVX just adds wider instructions (e.g. 256 bit registers and 512 bit register respectively for AVX-512).

Since the problem in question is concerned with 16-digit integers (i.e. 128 bit wide), SSSE3 is a perfect match.

AVX instructions could be used to convert two numbers at a time, though.


Avoiding integer timestamp parsing overhead (or worse, strptime-based parsing) is in the rationale for djb’s tai64: https://cr.yp.to/libtai/tai64.html


VERY cool work and results :)

I'm always curious on these... microbenchmarks can skew things because of CODE cache localilty? E.g. startup time, ability to inline, etc.... any thoughts on how to do a more "real world" test?


The benchmark graphs didn't show up for me on Firefox mobile. Using the desktop version fixed it.


I suspect he could make it faster using the newer neural network instructions like https://iq.opengenus.org/avx512-vnni/ which pack four multiply and add ops into parallel instead of two that he used.


The problem with that instruction is that to do 4 at a time you would have to multiply by 1, 10, 100 and 1000 respectively. The last multiplier does not fit in a byte.


Use the expand operation to put the operands in larger bitfields - you have 512 bits to play with, the input took 128, so you can expand each byte to a 32 bit value and use 32 bit multipliers.

Then you can use larger than byte operations and larger than byte multipliers if you need them.

Alternatively,

First pass you do all the 1,10,100 cases in one pass, using multipliers 1,10,100,0 (throw out last byte).

Then mask (or use the compression instructions as needed) and use the larger bit versions and do all the 1000 multiplier cases in one pass.

Then you merge them all in one pass.

It merges 4 at a time for the large pass and 3 at a time for the small pass instead of 2 at a time.

I'll write it up later today if I get time.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: