Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Breaking the MintEye image CAPTCHA in 23 lines of Python (jwandrews.co.uk)
129 points by cidquick on Jan 19, 2013 | hide | past | favorite | 56 comments


For those interested, the minteye captcha has been broken by other methods as well.

Speech recognition: https://gist.github.com/4520930

Laplace: https://gist.github.com/4564489

Fourier transform: http://nbviewer.ipython.org/urls/raw.github.com/rjw57/mintey...


Could someone explain the general idea behind using FFT in these situations (excusing my ignorance)? I recall it from my university days but that was completely out of context. I know you can use it to decompose audio signals into frequency components, I don't understand how it applies to images though.


The basic idea is that just as an audio signal can be decomposed into frequency components, an image can be decomposed into frequency components, but in two dimensions. (Imagine stripes of varying frequency and direction instead of sine waves of various frequencies.)

Sharp edges in an image have high frequency components, just like a sharp transition in an audio signal. Filtering the high frequencies will blur the image, while filtering low frequencies will enhance edges.

This is very oversimplified, but will hopefully give you a bit of an idea. I encourage people to learn more about Fourier theory, because it explains a lot.

One introductory page that you may find useful is: http://cns-alumni.bu.edu/~slehar/fourier/fourier.html


That makes perfect sense. Thanks for the link - that's given me a much better intuitive understanding. I'll have a dig around some of the other transformations mentioned now.


Hhahahaha, comment on the blog

"Please post the Visual Basic codes for this. The language you post in the article is not Visual Basic.

Thank you."

xD God !


As someone who wrote a captcha decoding article http://wausita.com/captcha/ which ranks pretty highly I can honestly say requests like this are real. Heck I even get abusive emails from spammers when they ask for help and I point out the relevant portion they should read rather then just supplying "tha codez"

If it is sarcasm this guy nailed it. Looks like half my inbox.


It just works on so many levels.

It really makes me wish for some kind of federated comments system or unique username so I can go out and find more of his work!


It's totally sarcasm


Wasn't there something about how 'high' sarcasm is undistinguishable from the real thing.

Unfortunately I can't make the distinction, and I think it's most likely the commenter was serious


Sounds like Poe's Law to me. http://en.wikipedia.org/wiki/Poes_law


Not sure it's sarcasm, more like some blackhat SEO who wants to exploit it.


I agree. With a name like Zhou, it sounds like a Chinese guy wants to try to use the code in the wild. I guess one plus for the MintEye captcha is that it's hard to defeat with only Visual Basic.


This is nitpicking. Well, maybe not.

This code does not take a swirled image and solve it. It takes a set of images with various swirl levels and finds the one with shortest sum of edge lengths.

The code does not do any un-swirling of the image. It also uses external libraries to convert to grayscale, apply the Sobel filter and the sum of edges.

I other words, it is far from breaking the swirled CAPTCHA in 23 lines of Python. If you had to write the un-swirling, gray-scaling, Sobel filtering and summation code in Python you'd be looking at a much larger pile-o-code.

The demo and the intent is good. I just wish the title didn't say "23 lines of". I don't think the article would have suffered at all if the title was "Breaking the MintEye image CAPTCHA with Python".

Again, the intent and what is being demonstrated are excellent. The title, in my not-so-humble opinion, is hugely misleading.


>it is far from breaking the swirled CAPTCHA in 23 lines of Python

Except it does break the captcha in 23 lines. I don't get your point.

Why would he need to do any 'un-swirling' in the first place? That's not the point of the captcha. The captcha provides you with multiple images and you have to pick the least swirled one - which is exactly what his code does.

And you argue that because he did use external libraries it's not 23 lines of python. Of course it is. He had to write only 23 lines that make up the logic to break the captcha. If you'd argue that one would have to add any previous work (or lines) done by other people to the linecount of your own project, I could also argue that 'print("Hello")' is not one line of python, because it uses some standard python library, which in turn calls native code, that native code would make a few syscalls and at some point resulting in multiple hundred lines of compiled kernel C code being executed.

So the linecount of his 23 lines python script would actually be: 23 lines of python + various libraries + the entire Linux kernel = roughly 20 million lines of code. Good job.

And yes, that was nitpicking. And I'm angry for some reason.


No need to be angry. None of this is going to remove food from your table or affect your life in any way whatsoever. Take it easy. I am just opening the topic for conversation. We can discuss things without pulling out semi-automatic weapons, right?

When someone publishes code and says something like "solved <problem> in <n lines> of <pick your language>" there generally is an implied "my language is better than yours" subliminal message that, for some strange reason, is based on line count.

Let's not debate the merit of line count as a measure of how good a language may or may not be. If we are going to go there I'll write the same solution in APL and we'll start counting characters instead. Again, line or character count, in my opinion, is not a measure of the "superiority" of a language.

The 28 line solution is not pure Python. How do we define pure Python. Let's just say that "solved <problem> in <n lines> of Python" to me means that you download and install Python:

http://www.python.org/download/releases/2.7.3/

Install nothing else whatsoever and write code that solves <problem>.

OpenCV, which is what the author used for a couple of aspects of this code, is an extensive (and really cool) C++ library that is being accessed from Python. Here's the source:

https://github.com/Itseez/opencv

I'll leave it up to the reader to find the source for the Sobel method.

A much more honest title could have included something like "solving <problem> with <n> lines of Python using OpenCV".

As I said before, the intent of the article and what is being demonstrated are good, no, great. I just wish the title was a little more reflective of reality. That's all. No need to get worked-up. This is not that serious of an issue. Just a comment.


Someone has disagreed with you -- which you anticipated would happen. Therefore he must be angry?

I agree with him. If it's wrapped up in a library and it's distributed by a package manager then it doesn't count towards the total SLOC of your project.


Read the last line of his post please. He said he was angry, no me.

>If it's wrapped up in a library and it's distributed by a package manager then it doesn't count towards the total SLOC of your project.

Think about what you are saying. With that logic I can write a library in C++ that evaluates all images in the current working directory for least edge length and returns the name or index of the winning file. My Python program, then, might look something like this:

    import magic
    print magic.evaluate()
And then I claim that I have written a program that solves a swirly CAPCHA in two lines of Python.

C'mon.

I you want to count true Python lines, download the language from python.org and write a solver without the use of any add-on libraries. Then we can talk about Python lines.


http://www.jwandrews.co.uk/2013/01/breaking-the-minteye-imag...

OK, I didn't write the JPEG decoder, but how far do you want me to go?!


That's great.

Please note that I did not have any issues with the method you used. My issue was only with the title of your post not being accurate. You used Python + OpenCV in addition to pre-un-swirled images.

To me at least, solving the problem with n lines of Python means that I email you a single swirled image and you, using Python as downloaded from python.org and nothing more, write a solver. The image doesn't even have to be JPG. It can be an easy to read non-compressed format. That'd be fine. But you'd have to un-swirl and do everything else, which is a lot more code.

Don't loose any sleep over this. It isn't important. Your original post was fantastic and very informative. My comments were only about the title and how, in every language camp, there's sometimes a tendency to look down upon other languages by quoting such nonsense as line counts.


Well put. The title is indeed completely factual! Including '23 lines' was just to let people know at first glance that the solution is simple.

> If you had to write the un-swirling, gray-scaling, Sobel filtering and summation code in Python you'd be looking at a much larger pile-o-code

I would also like to point out these are also all very simple operations, which would only take a few lines of Python/C/whatever to do. Python just happens to have a pre-written libraries to save reinventing the wheel.


> I would also like to point out these are also all very simple operations, which would only take a few lines of Python/C/whatever to do.

Yikes! They are not. That's a lot of code. I've done a ton of image processing work in both hardware (FPGA) and software (various languages). Swirling, gray-scaling, Sobel filtering and summation collectively are far from "a few lines of Python/C/whatever to do"

> Python just happens to have a pre-written libraries to save reinventing the wheel

That's not true. OpenCV is NOT a Python library, it's nearly 100% C++ as far as I can tell:

https://github.com/Itseez/opencv

That was, in many ways, my point. This is not a solution in 23 lines of Python. Go install Python from Python.org --and NOTHING ELSE-- and solve the problem. Let's see how many lines of Python it takes.

There's nothing wrong with the intent of the author in terms of showing how one can break these CAPTCHA's using edge-length evaluation. I never put any thought into this myself so, yes, I learned something of value from his post. I just wish the title was more genuine, that's all.


My turn to nitpick!

RGB to grayscale: V = r x g x b/3 (one line).

The Sobel operation is at most 4 nested loops (really only 3) for a total of ~7 lines of code, depending on how you like your white space.

Fold the summation into your Sobel function without requiring another line.

The most complicated thing used from the OpenCV library (which I'm well aware is C++!) is JPEG decoding. OK, I'm avoiding swirling, but then it isn't needed to solve the CAPTCHA anyway.


Show me working code and then we can talk. :)

Oh, BTW, "V = r x g x b/3" is incorrect. This is not you convert a color image to a grayscale image.


Ok, I'll do. I will assume that we already have the image in an 1-dimensional array, each element containing an array of the RGB values. Then we can convert it to grayscale with one line of python.

    gray = map(lambda p:sum(p)/3,IMG)
If you want to test it, you can use this 3x3 sample image (or just load one):

    IMG = [[5, 5, 5], [6, 8, 9], [94, 123, 4], [54, 5, 32], [44, 3, 3], [34, 234, 33], [5, 5, 5], [6, 8, 9], [94, 123, 4]]
The result will look like this:

    [5, 7, 73, 30, 16, 100, 5, 7, 73]
If your image happens to be loaded into a 2-dimensional array, use this:

    gray = map(lambda row:map(lambda p:sum(p)/3,row), IMG)
Sobel is slightly more complicated, but can be written like this (you'll need a larger sample image though):

    width, height #width and height of our image
    IMG           #2D array of our RGB values
    sobel         #result image
    # The actual filter starts here:
    for x in range(1,width-1):
        for y in range(1,height-1):
			sx = IMG[x-1][y-1]+IMG[x][y-1]*2+IMG[x+1][y-1]-IMG[x-1][y+1]-IMG[x][y+1]*2-IMG[x+1][y+1]
			sy = IMG[x+1][y-1]+IMG[x+1][y]*2+IMG[x+1][y+1]-IMG[x-1][y-1]-IMG[x-1][y]*2-IMG[x-1][y+1]
			sobel[x][y] = Math.sqrt(sx*sx+sy*sy)
A total of 5 lines. And I have to thank you, because I finally just learned what convolutions are while doing this.


Excellent! Now we are starting down the right path. I don't have time to fully look at your code right now. I will later.

Just one quick observation. You can't convert from RGB to grayscale by simply averaging the three values. Each color channel influences the perceived luminance (grayscale) differently, with green being, by far, the largest component and blue the least significant. Rather than give you the answer I'd suggest you research "converting RGB to grayscale" or "converting RGB to luminance" as this is an important subject to understand if you are dealing with images.

I'll take a look at the rest of it later and comment.


If you had done any serious dealing with colors you would know that there are multiple ways of converting to grayscale. I personally know of three: average, lightness and luminosity. All three are ways of converting to grayscale and every good image manipulation software (GIMP, Photoshop...) will offer you all three. I picked the first one because it's the one that is easiest to understand and the one that almost everyone will be able to come up with.

> You can't convert from RGB to grayscale by simply averaging the three values

Of course I can

Edit: I just googled "RGB to grayscale" and this was the first result: http://www.johndcook.com/blog/2009/08/24/algorithms-convert-...

I recommend reading it


> If you had done any serious dealing with colors you would know

And that's the end of the conversation, isn't it?

You could have simply modified your program based on my friendly input and do it correctly. How was jabbing me in the eye a better choice? Particularly when you don't even know me. That's unfortunate.

Perhaps you might consider emailing me privately? I'll provide you with references. See my HN profile for the address.

I have only devoted somewhere between 20 to 25 years of my life to, among other things, deal with accurate color and image processing in both hardware and software. So, yeah, I know a thing or two about the subject.

There's doing it right and there's doing it wrong. Averaging RGB values to derive grayscale is --and I am trying hard not to say what I really want to say-- not the right way to do it.

Part of the context here is to consider the source of the images you might be processing. The very design of every single camera in the market is based on a relationship between these color primaries that is to be maintained across the processing pipeline.

No device I know of will uniformly average the RGB channels as this is simply the wrong way to process and deal with color accurately. You can get away with this kind of thing for very specific applications (if you are computationally limited AND know exactly what you are doing).

Even then, you can, as I have done in hardware a few times, massage the coefficients to better reflect reality. One such example is the implementation of a "cheap" motion detection facility in hardware (FPGA). In this case floating point math is not an option (and it wouldn't make sense) so you can either futz with the coefficients or use a set of pre-computed lookup tables to do it accurately.

In some cases you can even ignore red and blue and just use green as reference. Again, just like before, knowing the application and fully understanding what you are doing is critical when making such choices.

In this case you are trying to detect edges in an image that is, more than likely, not artificial. In other words, it might be a photograph. It, more than likely, came through or is a JPG image. This means that the image, regardless of source, was converted to YCbCr color space and then handed back to you as RGB. If you want to accurately work with actual image data and not some distorted, contrast-reduced or contrast altered grayscale image the only way to do it is to recover the Y component from the RGB source data by using the correct mathematical approach.

Really, it ain't that hard:

Y = (0.299 * R) + (0.587 * G) + (0.114 * B)

This corresponds to CCIR601. Things can get a little confusing as the primaries were modified slightly for REC709 (another imaging standard). JPEG is defined around CCIR601 primaries, so the above noted coefficients are correct for that application.

To anyone dealing with color professionally these shortcut "solutions" reveal nothing but utter ignorance in the underlying science. I do not intend this as an insult, it's just a fact. Saving a very specific and valid reason for taking such shortcuts these "solutions" are always a bad idea.

I happen to have a pretty good handle on --among other things-- color science. I am, however, clueless about building rockets. That said, if I wanted to build a rocket you can bet I'd spend a non-trivial amount of time learning as much about the subject as possible before using uninformed shortcut solutions.

Real Color Scientists cringe at this sort of stuff because, in darker times, it made it into all kinds of programs written by color-science-ignorant programmers. These programs caused untold havoc with image processing. Thankfully things are far better now as those doing serious work with images have taken the time to understand and learn about color science.

If you really want to learn to process images properly and accurately forget that the idea of (R+G+B)/3 ever existed, remove it from your vocabulary and replace it with the above.

Also, go browse around the Rochester Institute of Technology website. I spent a bit of time there. Color Science is one of their focal points. Lots of good info there, even a number of interesting courses.


Well. My post was intended to provoke an emotional (err angry) response, mainly because I have a personal problem with people being overly nitpicky and sounding extremely arrogant. I don't know if you're aware of it, but that's how you come across. I already knew about your background because you mentioned it in another post.

That aside, I really appreciate your effort of teaching me about converting RGB values to gray scale.

On another hand, I don't think that converting the image to something that is accurate to the human eye is something we want to do here (we're not going to show the result to one anyways). Using your formula we would end up overly favoring green levels for our edge detection, even though we want to treat all colors equally. Well, not if the next thing you're going to argue about is the significance of different levels of blue to edge detection and that green in general is the better color for detecting edges.

Call me mean or just stupid, but for my part I think we both should reflect on how much amateurishness we can tolerate and when to just not reply to something.

Sometimes trying to show someone how ridiculous he comes across by mimicking his behavior leads to people wasting countless hours on a pointless internet argument, that at some point is fighting over something that isn't even really related to the main points the authors were trying to prove.

I would have usually left it like that, but in this case I felt bad because you actually made a great effort explaining all that stuff and yourself. Thought it might be fair to tell you how I see this.

Still some interesting stuff, but a lot of information. You should get a blog and write some lengthy (in a good way) posts about this. You seem to have quite a lot to tell people about this, and also a desire to.

I'll continue feeling a little bad about this while I'm sleeping.

Good night

chmod

Edit/PS: Yes, I thought about Blade Runner while writing the first sentence.


Ah, Blade Runner. You are one of the good guys.

No, I didn't get angry (and no, I am not a Replicant). It actually saddened me that what seemed like a useful conversation stopped dead-cold with such a personal comment. As I said, I have devoted a lot of my life to image and color processing. I guess it's part of the problem of being somewhat anonymous, something I am moving away from slowly.

Look, it's easy to come off as arrogant over email, newsgroups or similar means of communications. Part of it is that sometimes people take it the wrong way when someone comes out in an authoritative manner. I do. However, I only do that when (a) I really, really know what the hell I am talking about and (b) I don't have the time to write twice as much text to cover all corner cases and be sure that everyone sees me as "nice". I've heard talks by Linus where I've thought he came off as an arrogant asshole. Then I slowed down and realized where he was coming from. Once you understand that it all makes far more sense and, yes, it stops feeling arrogant.

I'm not 16 any more, so I don't really care about seeming "nice" online because, well, it's hard and it takes time. This, for me, isn't a popularity contest. I'm simply, honestly, trying to share something and learn as well. For example, I don't use Python that much at all. Inspired by this thread I sat down and played with Python quite a bit. That's a good outcome, at least for me anyway.

With regards to the idea of favoring green more than red and blue. This isn't the intent of the equations. This is actually what happens in the real world. If you look at the spectral power distribution of a captured image you will see that, generally speaking, there's a lot more energy around the green portion of the spectrum. I am over-simplifying and cutting corners here, but that's one way to think of it.

In other words, in normal images with normal lighting there's far more green stuff than red or blue. And so, in converting an image to a grayscale representation you have to account for the fact that green contributes to the image twice as much as red and six times as much as the blue component. If you don't apply these weights to the image you are going to be evaluating such things as noise and attributing far more value to image structures in the other channels.

Another generalization is that image noise is generally found in the blue channel far more prominently than the other channels. If you simply average all three channels you are effectively amplifying the blue channel. Blue should have had a weight of about 10% and you are giving it 33%. You have just tripled it's importance and, if there's any noise there you've just multiplied it by three. When it comes to green, you are halving it's contribution from about 60% to 33%. Here's the component that generally contributes the most information to an image and, by averaging it with the other colors, its contribution is now cut in half. Finally, red is the component that suffers the least (almost not at all) from averaging. Red contributes about 30% to an image; averaging amplifies it to 33%.

With regards to a blog. Actually, I've been thinking about it. Maybe later this year. A blog feels far more "serious" than posting in places like HN.

Don't feel bad either. Life is too short to get worked up about stuff that, in the grand scheme of things, matters not at all.


The problem is that you've entirely missed the point. The code solves the problem. We don't care about recreating grayscale to match human perception, or whatever, we care about solving the answer placed in front of us.


Sorry, it was late!


If you aren't going to have a cut off point for what is considered solving the problem and what is considered ancillary, then you can define new languages where a zero byte input stands for whatever solution you want, simultaneously one-upping everyone and contributing nothing to the conversation.


Unless the test of "human-ness" is the ability to drag a slider, or whoever it is that you interact with the captcha, picking the least swirled of a set of images is solving the captcha because that is the test of human-ness. This seems fairly cut and dry to me.


And if you had to write the OS it runs on you'd have a really huge "pile-o-code". What's your point?


> What's your point?

Simple. To go along the lines of your example:

Headline: "Microsoft Word written in 50 lines of Python".

Reality:

The fifty lines of Python do nothing more than call a set of Microsoft libraries written in C and C++ that, well, result in MS Word.

Did the author really write MS Word in 50 lines of Python?

Of course not. It would be absolutely insane to even suggest the idea that this could even approach a valid metric.

Nope, the author simply made use of external non-Python libraries that are the results of probably hundreds of thousands of lines of code and many man-years of work. He does not, for even a microsecond, get to claim that he wrote MS Word in 50 lines of Python.

That's my point. Is it a little bit clearer now?


I would be interested to know that there are libraries that allow a person to write MS Word with only 50 additional lines of Python.

Again, unless you write the entire stack from the bottom up, you're always depending on vast amounts of code you didn't write. So what's the difference?


Of course that was an exaggeration, a tool to simply drive the point home. It would be impossible to actually do what I suggested.

The point, which for some reason you are not willing to concede, is that if you are going to claim that you solved a problem with x lines of programming language z the line count needs to be real. You can't use something like a library with 10,000 lines of C++ code to then claim that you solved a problem with two lines of Python (or whatever):

    import huge_code_base
    print huge_code_base.solve()
It really is that simple. If you don't see it this way, well, that's fine. Let's leave it be.


That's fine if you want to count it that way. I'm just trying to figure out why a few million lines of OS code gets a pass.


That's a really clever way of detecting the swirl effect. As for simply measuring overall sharpness, that could be thwarted by normalizing the FFT after swirling. That wouldn't help against the sum-of-edges technique though (in fact, it would make it worse).


remembered me of how Interpol was willing to keep the reverse twirling capability as confidential: http://thelede.blogs.nytimes.com/2007/10/08/interpol-untwirl...


Its actually a super fun excercise to do. I did that years ago with a browser game that showed images on login, some years ago. Also with python. It was a fun learning experience because it's so much more visual then your typical "todo list tutorial" or "hello world".


Since the algorithm can only detect the relative swirled amounts of a set of the same images, a solution could be to, rather than using a set of the same image, use a set of different images swirled different amounts (or with some other transformation), and the user must select the unmodified image.


I'd love to see MintEye respond.


"Like all CAPTCHAs, sliding CAPTCHA can be cracked . Still, it's more usable and mobile friendly"

https://twitter.com/EyeOnTheMark/status/291837470486704129


Very interesting article, and an insightful realization. I had no idea Sobel could do that.


What? Edge detection is pretty much the only use for the Sobel operator.

http://en.wikipedia.org/wiki/Sobel_operator


Breaking captcha is bad, mkay :)


If you're using a CAPTCHA, you're doing it wrong already.


Yeah, all these CAPTCHA stories are definitely making me think there's no refuge left of simple tasks humans can do that machines can't.

What's the replacement? Tying every account to a mobile phone or credit card?


The thing is, CAPTCHAs are used to validate it's a human operating the interface, not a robot.

That's a stupid idea, based on some weird assumption that it's somehow safe to give access to your system for a user, but not for a robot. If the only thing stopping people from messing your system is automation, you should rethink it from the get go.


> weird assumption that it's somehow safe to give access to your system for a user, but not for a robot.

Nice point.

> If the only thing stopping people from messing your system is automation, you should rethink it from the get go.

Are there systems that can't be abused through automation, or where that's much less of a concern? (Serious question, not intended as snark.)

Maybe throttling activity is a better approach... it's volume that is harmful to the system, not interactions with programs per se.

Don't know how you throttle spammers distributing activity across a number of accounts and IP addresses though...


> Are there systems that can't be abused through automation, or where that's much less of a concern?

Let's analyze. One place CAPTCHA has been used a lot is in comment forms. It increases the opportunity cost of posting a comment via automation by introducing a complicated challenge.

Fair enough. But that doesn't solve the underlying problem (abusive comments). You'll still have abuse and spam, it will just come from humans instead of robots. If I'm a blackhat SEO with a budget, I pay people to solve CAPTCHAs all day and spam the internet (that's what happens).

If you design the comment system to fight abuse from the ground up (e.g., markov chain spam filters, user flagging), you don't need to care anymore wether it's a human or a robot behind the POST, because all the spam actually improves your training set. You defeat spammers with their own data.


No more so than finding security vulnerabilities. Captchas that are trivial to break are going to be broken no matter what, so it's better for someone non-malicious to break it and make it well known, so that people don't use the captcha, than for it to go unnoticed.


Love it. +1.

But wouldn't an obvious answer to this be to use a background full of swirl, then add people onto it, then swirl again (so the background swirl is swirled twice).

Then when you'd be "deswirling" with you code, you'll be finding something always swirled?

maybe even use a background who's "swirled right" and then "swirl left" the "backgroung + people" (or nyan cat or whatever)?


> Then when you'd be "deswirling" with you code, you'll be finding something always swirled?

It wouldn't change what he is detecting, which is the sum of the edges.


Did you read the article?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: