Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
On bananas and string matching algorithms (wabbo.org)
260 points by cjbprime on Aug 23, 2014 | hide | past | favorite | 62 comments


Really interesting post and good debugging work. A couple of take-aways:

1. This is one reason it's a good idea to use signed ints for lengths even though they can never be negative. Signed 64-bit ints have plenty of range for any array you're actually going to encounter. It may also be evidence that it's a good idea for mixed signed/unsigned arithmetic to produce signed results rather than unsigned: signed tends to be value-correct for "small" values (less than 2^63), including negative results; unsigned sacrifices value-correctness on all negative values to be correct for very large values, which is a less common case – here it will never happen since there just aren't strings that large.

2. If you're going to use a fancy algorithm like two-way search, you really ought to have a lot of test cases, especially ones that exercise corner cases of the algorithm. 100% coverage of all non-error code paths would be ideal.


I'm not sure I agree with #1. While it's true that using signed integers for lengths would have solved this problem, a negative length doesn't generally make sense or have any defined meaning. So if you have any logic that is doing case analysis on lengths, you basically have no reasonable action in the case of a negative length: all you can do is complain that the length got corrupted at some point.

For example, imagine that instead of testing whether the strings differed by more than 20, you were just testing whether the needle was >= 20 in length:

    impl Searcher {
        fn new(haystack: &[u8], needle: &[u8]) -> Searcher {
            match needle.len() {
                0..20 => Naive(NaiveSearcher::New)
                20..uint::MAX => TwoWay(TwoWaySearcher::New(needle))
                _ => fail!("Negative length, shouldn't happen!")
            }
        }
    }
Part of the feature of an unsigned type IMO is that it cannot have an out-of-domain value by definition.


The problem is that while unsigned types cannot have an out-of-domain value, they wrap.

Also, often negative lengths make sense as meaning "backwards". In C++ I often hit in algorithms code that looks like some variant of:

    a.size()-1 >= capacity
But when a.size() == 0, a.size() - 1 is MAXINT. Of course I could re-write this as a.size() >= capacity + 1, but that requires remembering / noticing.

Personally I find this so annoying that when I recently had to write a set of my own C++ containers (as I needed 1-indexed rather than 0-indexed containers), I decided to make the return value of size() signed.


> But when a.size() == 0, a.size() - 1 is MAXINT.

This is exactly the same problem class of bug as mentioned in the original article.

But my argument is that while signed lengths fix some problems, they cause others.

Maybe I'd put it this way: I dislike using signed integers for lengths, because it creates an entirely new class of undefined (or arbitrarily-defined) behavior.

Imagine that memcpy() took a signed length instead of unsigned. What should it do if it receives a negative length? There is no great answer to that question.

It's true that if you have a bug, you might accidentally pass a very large unsigned integer to memcpy(). But at least you can predict what memcpy() will do now: it will attempt to copy a very large number of bytes and probably crash. Whereas if you pass a negative integer, it might reasonably do any of these things:

   - nothing
   - crash
   - attempt to copy bytes *before* the supplied ptr
   - assert()-fail and/or call abort()
What's worse, it creates a sense of anxiety for anyone implementing a function like memcpy(), because there is no right answer for what they should do with a negative length. Callers of memcpy() might reasonably want any of the above. But then you get your callers angry with you because "obviously" a negative length means you should do nothing, or "obviously" a negative length means you should assert()-fail (a programmer's opinion about what is the "obvious" right behavior is often unduly influenced by their specific circumstances).


There's only one reasonable behavior for a negative length to memcpy: raise an error. That would catch exactly this kind of error. I feel like it proves my point rather than contradicting it – if the argument to memcpy is unsigned, you have no way of detecting this programming error; your only option is taking that very large value as literal – which leads to a different and far less helpful error (in the best case).


It would be nice to have C/C++ operators/types that failed on overflow/underflow. The sad part is it's available on all processors in the status register. Just no one has brought it up to the higher level language level.

If you could declare

  hard unsigned int x=0;
then things like

  a.size() - 1
could abort() loudly instead of silently doing something that's probably wrong and masking errors later on.


Rust does have those types, but they aren't free: overflow checking has a significant performance cost even when fully optimized. Because of the performance cost, a lot of code doesn't use the overflow-checking types.

I think what we really need is for the hardware manufacturers to help us out here.


Agreed – hardware assistance would be a huge boon.


BOUND instruction on x86 architecture exists for time immemorial, but not widely used and may be slow in modern microcode (like AAM/AAD and other stuff of handwritten assembly era, never emitted by compilers).


> …overflow checking has a significant performance cost even when fully optimized.

That surprises me—I would have thought a quick bit-test on the status register and conditional jump wouldn't be all that bad.

> I think what we really need is for the hardware manufacturers to help us out here.

Are you thinking something like a software interrupt on integer overflow?


There's the secondary effects of missed optimisations too: if every operation is checked for overflow, there's much less scope for reordering/collapsing/vectorising operations, (this is also especially important in tight loops where the core loop can easily be less than 20 instructions, so 2 extra is significant).


> if every operation is checked for overflow, there's much less scope for reordering/collapsing/vectorising operations

Lets pretend that instead of "put a conditional jump after every operation", we try to model things as "pretend all integers are infinite precision, blow up if overflow happens at some point". So for example, we would be OK with rewriting

    int x = MAX_INT
    int y = (x + 1) - 1 //overflow - BOOM!
with

    int x = MAX_INT
    int y = x; //no overflow
Yes, this means that the program does different things depending on the level of optimization but are there any reordering/vectorizing possibilities that get blocked now?


> That surprises me—I would have thought a quick bit-test on the status register and conditional jump wouldn't be all that bad.

20%-30% performance loss at least, IIRC.

> Are you thinking something like a software interrupt on integer overflow?

Right.


> This is one reason it's a good idea to use signed ints for lengths

I disagree. Types are a form of documentation, and a constraint on proper program behavior. Length types can't be negative; that doesn't make any sense. If you end up with a negative length, you are doing something wrong. One should just write correct code and use the proper length type. In C that's `size_t`, and in Rust it's `uint`.


The problem is that C (and many other languages) only have the notion of "signed" vs. "unsigned" values, and nothing else. For example, if 'len' is a variable that holds the length of something, then its range is defined by (len >= 0). So far so good. But what is the range of the expression (len - 1)?

In a perfect world, its range is {-1, 0, 1, ...} (up to some maximum value), but C(++) doesn't have such notion. If we used uint, -1 will silently change to MAX_UINT. Oops.


>But what is the range of the expression (len - 1)?

Ill-defined, which is as it should be. Naturals are not closed under subtraction or negation, and therefore you should never negate or subtract a length.


You subtract lengths from other lengths all the time. How far have you gone recently? How much taller are you than your wife? Subtracting a length from another length isn't a problem at all; a problem would be if you wanted to subtract a length from a weight.

This

> Naturals are not closed under subtraction or negation, and therefore you should never negate or subtract a length.

assumes that a paramount goal in having types is to make function signatures like the following impossible:

    public static String format(String format, Object... args);
Infinite-dimensional vectors (in which the first element is a String) sure aren't closed under format -- it maps them to scalars! But somehow string formatting is wildly popular. Or think of

    public Color(int r, int g, int b);
    public Color(int rgb);
Are you comfortable having a mapping from ℤ to Color, and from ℤ^3 to Color? If so, why does it bother you less than a mapping from ℤ ∩ [0, ∞) to ℤ ∩ [-1, ∞)?

Try a simple rewording of your argument: Hash tables aren't closed under access. Therefore, you should never retrieve values from a hash table.


>You subtract lengths from other lengths all the time.

Indeed. Notice that all your examples are physical lengths. This is of a different type than the naturals (digital lengths). We generally consider physical lengths to be in the domain of Reals, usually defined along a particular path or dimension. This is closed under subtraction, while naturals are not.

> a problem would be if you wanted to subtract a length from a weight.

That is also a problem, not the only problem.

>assumes that a paramount goal in having types is to make function signatures like the following impossible: [vomit-inducing variadic signature]

I would sure like it if function signatures like that were impossible :)

> Infinite-dimensional vectors (in which the first element is a String) sure aren't closed under format

It's very generous to call a set of arguments to variadic function a "vector" at all, and they certainly aren't infinite-dimensional. They are practically bounded by machine memory.

>it maps them to scalars

How?

>But somehow string formatting is wildly popular.

Lots of mathematically ill-defined things are wildly popular.

>Hash tables aren't closed under access.

Not sure what you mean by this.


> We generally consider physical lengths to be in the domain of Reals, usually defined along a particular path or dimension. This is closed under subtraction

This is wrong. We consider physical lengths to be nonnegative real numbers, which are not closed under subtraction. Trying to use negative numbers for physical lengths will get you very funny looks.

> It's very generous to call a set of arguments to variadic function a "vector" at all

It's not an element of a mathematical vector space, but it's a vector in the common sense of an ordered collection of values. The other examples I gave you, like (int, int, int), are vectors even in the algebraic sense.

> I would sure like it if function signatures like that were impossible :)

That was the signature for Java's String.format, the equivalent of C's sprintf (which, like String.format, maps an infinite-dimensional vector to a single string). If you really didn't understand this, you can tell that the return type from String.format is a scalar by looking at the return type declaration, "String". To understand the precise mapping, I can only recommend you read the javadoc.

>> But somehow string formatting is wildly popular.

> Lots of mathematically ill-defined things are wildly popular.

I'm willing to grant this, but it's certainly not relevant to string formatting.

>> Hash tables aren't closed under access.

> Not sure what you mean by this.

Accessing a hash table isn't guaranteed to get you another hash table. Just like subtracting two positive numbers won't necessarily get you a positive number.


Subtracting two length doesn't give you a length, it gives you a difference of length, which can be any real number.

So by analogy subtracting two unsigned ints should give you a signed int by default (unless you specifically ask for the unsafe unsigned int, and then it's on you to make sure the preconditions are met).


All of this is type-safe, if the operations check for out-or-range values and signal appropriately.


But the argument wasn't "be type-safe", it was "don't do things where the input and output are of different types".


I agree with you that negative lengths are not a good solution, but...

"One should just write correct code"

No one writes correct code, so our languages should be designed to avoid the possibility for mistakes to be made or go unnoticed. I'm wondering if the compiler in this case could have done more to inform the programmer of a possible bug. A 'this expression is likely to underflow' warning would have made a big difference I suspect.


Regarding 2., I'm going to assume they have at least some unit tests.

It's not clear that even 100% coverage would have picked up this bug, since it doesn't automatically appear as a result of taking a particular code path.

I think that comparing to a naive algorithm on a large number of strings, or using a fuzzing tool, might be helpful, but this might also slow down the test suite excessively.

I find unit tests are a bit like good form at the gym. Whenever someone gets injured, they think back to the last exercise they did and find some imperfection in their form. They then attribute their injury to not using perfect form. Similarly, whenever a bug appears, there is a tendency to find a unit test that would have caught the bug, and therefore attribute it to insufficient testing.

The real value of testing (and good form) can only be demonstrated by having a criteria for good testing and seeing if following this reduces the number of bugs.


No unit tests for the string library: https://github.com/rust-lang/rust/blob/6843d8ccd5/src/libcor... (That's the hash for tip at the time of this comment.)


There's a lot of tests: https://github.com/rust-lang/rust/blob/f66fd2eed10de2acacd8a...

Testing is something Rust is very big on, e.g. we run the test suite on many architectures/platforms each pull request and merge only if it passes on all: master is always green.


Ah, I see the ones for contains: https://github.com/rust-lang/rust/blob/f66fd2eed1/src/libcol...

Thanks for pointing them out.


There are more tests that test the string library in src/test/run-pass/. This reflects the fact that std::str predates the current unit testing infrastructure, basically.


They did have some tests, obviously not enough though.


I don't think it is the number but the depth. Something like QuickCheck would help. Or, having simpler code.


This is a really good point, and it looks like QuickCheck is supported on Rust -- here's one of several implementations:

https://github.com/BurntSushi/quickcheck

Would be pretty cool to see someone hook it up to the standard library tests.


On #1: Signed under and overflow in C and C++ is undefined behaviour and can do weird things. In combination with compiler optimizations especially, the compiler can simply assume it never happens. Using unsigned types means when someone passes you a negative int at least the result is well-defined and can be debugged when your program explodes.

Use unsigned types for unsigned values, and ensure all (new at least) code compiles cleanly with -Wconversion and -Wsign-conversion enabled, which should really be the default warning level these days.


I think the correct solution is for the type system to be able to say "Hey, this unsigned int is being subtracted from and was not checked to ensure it is greater than the value being subtracted." And the typical solution would be to either a) invert the subtraction to an addition, b) do a check, or c) annotate the type.


Or add a proof.


There are many comments discussing whether length should be unsigned or signed. There are arguments for and against it. This a prototypical carpet bump, you squash it in one place it raises its head somewhere else.

I see it not as a question whether lengths should be signed or unsigned but whether subtraction, assignment etc should be polymorphic w.r.t signed and unsigned. I think the issue here is the polymorphic variants of these binary operators are inherently risky.

Casting gets a little tedious, but languages that do not have operator overloading should disallow subtraction from unsigned and subtraction that return unsigned. You either cast it up, or if possible reorder the expression/comparison so that the problem goes away. Even assignment can be a problem. Ocaml can get irritating because it takes this view but I think it is safer this way. It is very hard to be always vigilant about the unsigned signed issue, but hopefully a compiler error will mitigate risks, not completely, but it is better than nothing.

That leaves languages that allow operator overloading, in those cases if you are overloading an operator you better know what you are doing, and watch out for cases where the operator is not closed.


What is the signed value of 0-UINT_MAX?

There is no way to handle this issue on machine types, unless part of the range is reserved for overflow and checked. But performance nuts intentionally discard safety for performance.


As I suggested there should not be a '-' that returns unsigned. If you want to cast it back to unsigned you would have to do it explicitly, and ideally a checked cast, though I am not averse to a more verbose unchecked cast.

I am not averse to binary operators with wrap around / modulo arithmetic, but they should be look different from the usual '+', '-' etc. This only applies to languages that do not allow overloading operators.


I agree. It seems to me that saying unsigned - unsigned is unsigned is just asking for trouble.


I've seen some very strange things in my career. I find posts like this delightful, because I can point to them and say "see, here is proof that even code written with the best of intentions can still have bugs."

Programmers tend to fall into (at least) two camps: the skeptics and the pragmatists.

Sometimes when I report a finding, programmers accuse me in one way or another of messing something up because “that can’t possibly be failing.” Those are the skeptics, using incredulousness almost like a shield to protect their worldview. They tend to have an up-close/what’s right in front of them approach to programming, are prolific and usually take a positive stance on programming.

At other times, reporting a finding is met with resignation, almost like “please work around it because we just don’t need this right now”. Those are the pragmatists, taking the long view/forest for the trees approach, knowing that programming is more than the some of its parts but also that it’s a miracle it even works at all. They are the daydreamers and sometimes are perceived as negative or defeatist.

I was a pragmatist for as long as I could remember, but had a change of heart working with others in professional settings. I saw that certain things like databases or unix filesystems could largely be relied upon to work in a deterministic manner, like they created a scaffolding that helps one stay grounded in reality. They help one command a very mathematical/deliberate demeanor, and overcome setbacks by treating bugs as something to be expected but still tractable.

But here is one of those bugs, where the floor seemed to fall out from under our feet. One day I mentioned that “SSL isn’t working” and about half the office flipped out on me and the other half rolled their eyes and had me track it down:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=694667

The gist of it is that OpenSSL was failing when the MTU was 1496 instead of 1500, because of path MTU discovery failing and SSL thinking it was a MITM attack and closing (at least, that is how I remember it, I am probably futzing some details).

That was odd behavior to me, because I see SSL as something that should be a layer above TCP and not bother with the implementation details of underlying layers. It should operate under the assumption that there is always a man in the middle. If you can open a stream to a server, you should be able to send secure data over it.

Anyway, we fixed the router setting and got back to work. All told I probably lost a day or two of productivity, because the office network had been running just fine for years and so I discounted it for a long time until I ruled out every other possibility. I’ve hit crazy bugs like this at least once a year, and regret not documenting them I suppose. Usually by the time they are fixed, people have forgotten the initial shock, but they still remember that you are the one weirdo who always seems to break everything.


> One day I mentioned that “SSL isn’t working”

The distinction is who they think you are. I a junior programmers comes up with "I think gcc is broken". Well they could be right, but my first instinct is to say "no, probably not".

Then Linus says:

---

For chrissake, that compiler shouldn't have been allowed to graduate from kindergarten. We're talking "sloth that was dropped on the head as a baby" level retardation levels here

---

I'll probably believe him.

These are two extremes. The pragmatists vs sceptics is just as much pragmatism and scepticism about the person reporting the problem.


For context, that is an actual Torvalds quote, talking about GCC 4.9.0: https://lkml.org/lkml/2014/7/24/584


Yes, there are certain tools that I trust almost completely, like TeX [1], and it helps me focus on finding my own bugs because I'm not tempted to spend some time wondering if is the tool (build tools, library, compiler, etc.) with the bug and not my own code.

[1] http://texdoc.net/texmf-dist/doc/generic/knuth/errata/errorl...


Yes, TeX is almost bug free.

But once you start to use LaTeX you are not long so safe. If you use two LaTeX packages, you have to pray that they are compatible, and they are loaded in the right order. If you have to use babel (because you don't speak in english) you are probably out of luck.


You are so right! (I blame the macro expansion based semantics of TeX packages.)

I end up using a small number of very large packages that do almost everything I need instead of trying to get numerous more focused packages to work together. (Memoir or KOMA, and TikZ handle most of my tasks.)


Yeah, I like the phrase "suspecting a compiler bug is the principle of last resort". (Maybe there should be an "unless you're using Rust" thrown in there.)

It's not that compilers can't have bugs. It's that you're more fallible than the compiler, and should use that likelihood when deciding where to investigate to fix the bug you're seeing.


I remember a long day tracking down a bug in some low level networking code that turned out to be bad RAM on a test machine. I remember thinking "This is probably what going crazy feels like." But, I still trusted the compiler more than the HW.


To be fair, this was a bug in the standard library, not the compiler. You should also trust your stdlib too, though, unless its a demonstrable problem like the article.


Why is this code so damn fancy? Shouldn't the fanciness be offset by proofs or extended testing? Open loop!


Two-way searching is the norm for substring matching. Even your libc is probably using it for strstr(). For example, here's musl's implementation: http://git.musl-libc.org/cgit/musl/tree/src/string/strstr.c


Agreed, this seems like a case of excessive complexity. I hadn't even heard of "two-way" before, but I don't think an algorithm the paper claims to be "remarkably simple" should require twenty-five pages to describe.

"two-way" isn't all that fast either, so I don't see any good reason for its use; in the benchmarks I could find, the simpler BMH easily beats it (and optimised BMH variants go even faster): http://volnitsky.com/project/str_search/benchmark-4.5.1.png http://blog.phusion.nl/2010/12/06/efficient-substring-search...


It would be interesting to put a trace on `strstr` for a couple days and see what the distribution of arguments look like.

ToDo: Profile libc for various workloads, record arguments and code coverage


My guess is that naive searcher is O(nm) and two way is O(n) for haystack length n and needle length m.


I try to never look at GPL/LGPL code if I'm going to implement something at work or under another license.

Sorry for suggesting a good practice to avoid legal liability.


> Sorry for suggesting a good practice to avoid legal liability.

Why is this good practice? Has anyone ever been exposed to a legal liability that could be creditably attributed to their merely looking at (as opposed to using) such code? (Although my suspicion is that the answer is 'no', this is an honest question.)


I think the idea is that it's going to be difficult, having looked at the code, to implement a version that is not similar enough to the original to trigger a copyright "derivative work".

For example, I believe it used to be the case (just a few years ago) that Microsoft employees were contractually prohibited from looking at GPL'd code.

But if alayne is planning on bringing that up every time someone shares a post that might contain GPL code, alayne is quickly going to become the most tedious person at the party, and that's why the downvotes are happening.

If your situation prevents you from studying and learning from GPL'd code, then just don't study it, and don't brag about not studying it.


To be clear, I didn't mean "why might there be fear about this?" (which I can understand, though it sounds a bit hysterical to me), but rather "has this ever actually happened (outside of a lawyer's dreams / nightmares)?", as a weaker version of the question "is avoiding this really good practice?".

P.S. Perhaps a good defence against a claim of copyright infringement would be "you can tell that my version isn't a copy, because it correctly finds 'nana' as a substring of 'banana'"? ;-)


I see. I'm not aware of any case law either.

I don't think it's good practise, and agree that it's hysterical, and think it stems from unthinkingly believing Microsoft's anti-OSS claims during their decade-long freakout about open source and the GPL.

I think that attempting to convince people not to participate in sharing and studying code with the rest of the world is an evil thing that they did, especially since the GPL was the most popular OSS license at the time.


I'm absolutely not anti-OSS. I personally think it is against the spirit of the GPL to copy code that is GPLed, though.

I'm just saying be careful about how you do your engineering. This seemed like an inexperienced developer and he mentioned glibc vocally along with the comment "I'll admit that originally I copied this logic from glibc's implementation without fully understanding it, but I've now taken a closer look at the Two-Way paper and understand what it does." It's certainly not good software engineering to copy implementation like that. It is good that he read the paper.

Now I regret even bringing it up. I don't feel that I deserve your ad hominem comments or characterizations of my motivations.


> I personally think it is against the spirit of the GPL to copy code that is GPLed, though.

Why? I assume that you don't mean literatim copying, which is obviously the whole point of OSS, but rather copying with modification; but, even then, a quick read suggests that http://www.gnu.org/copyleft/gpl.html#section5 explicitly permits this activity, as long as you meet certain conditions. Maybe I should have understood an implicit insertion "to copy, without acknowledgement, code that is GPLed"?


Okay, I see where you're coming from now; I apologize for jumping to conclusions.


Yes, it can come up. See the recent Google / Oracle lawsuit, where it was specifically presented that Dalvik was a 'clean room' implementation.


> Yes, it can come up. See the recent Google / Oracle lawsuit, where it was specifically presented that Dalvik was a 'clean room' implementation.

I have no doubt that it can come up—all sorts of ludicrousness can; but my question was rather whether there have actually been any legal consequences, not just legal challenges.

I haven't been following the lawsuit, and can't find at http://en.wikipedia.org/wiki/Oracle_v._Google any clear indication of whether any of the decisions specifically find Google at fault specifically for copying GPL'd code. This article (http://www.zdnet.com/blog/bott/the-real-history-of-java-and-...), which may or may not be reliable, implies that there are subtle concerns at play:

If this were a simple matter of some code being open and some not, this wouldn't be a multi-billion-dollar lawsuit. But the precise details of how Sun chose to write its licensing terms are extremely important.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: