Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
*NEVER* sanitize your inputs (hackensplat.com)
37 points by billpg on April 14, 2014 | hide | past | favorite | 73 comments


I guess the "never sanitize" headline is clickbait, but the point is valid. "Sanitizing" input is really hard, and can provide a false sense of security. That string has been sanitized, so it's safe! Wait, is it safe for SQL? What about HTML? What about inside a <script> tag? What about a different database engine, or Mongo, or Azure Tables? You are much better off giving up on the illusion of "safe input" that sanitization gives you, and instead always treating user input as data rather than mixing it up with your code.

My major complaint is that after correctly identifying the solution for SQL, he ends up with nothing to say about HTML. The right approach for rendering user input into HTML is with the Javascript createTextNode() function. That's how you tell the browser that it absolutely shouldn't interpret that content as HTML.


Thanks for that. I'll add a note mentioning createTextNode once I've had a chance to read up on it.


"But that's what we mean by "sanitize"! Then you should stop calling it that."

Ugh, eyeroll. Seriously, let's waste time arguing over what to call security vulnerabilities & ways to address them - instead of using consistent terminology that security-minded developers instantly recognize.

To quote the hilarious Mean Girls - "stop trying to make fetch happen".


Although I understand how you feel, I think OP's point was a bit more meaningful: Calling it "sanitizing" leads some programmers to try to "clean up" the input -- but instead they should contain it.

And when they try to "clean it up", they enter the realm of Falsehoods Programmers Believe About X.

e.g. http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-b...


Okay, let's keep advising people to "sanitize" inputs. Even though its confusing and there's another word that isn't confusing. Because reasons.


There is no other word that isn't confusing. If you refuse to actually do some cursory research into something before implementing it, you're going to get the stupid delete-is-sanitization stuff the author describes.

"Doctor, it hurts when I do this." Don't do that!


Except it looks like everyone's confused and there's a lot of misinformation or high signal-to-noise ratio already.

My google search "Input sanitization" yielded these first 2 results

http://en.wikipedia.org/wiki/Secure_input_and_output_handlin...

2nd page (or more with a lesser screen), under "other solutions," this is the only line about parameterization: "In particular, to prevent SQL injection, parameterized queries (also known as prepared statements and bind variables) are excellent for improving security while also improving code clarity and performance." Everything else is about filtering, blacklisting, whitelisting, escaping.

http://www.esecurityplanet.com/browser-security/prevent-web-...

Discusses filtering as solution to HTML injection. Lastly discusses SQL injection, first recommending mysql_real_escape_string(), then in the second paragraph linking to another article about parameterization.

It's not, to an inexperienced developer (this is the web remember?), a clear-cut best practice from just "cursory research". It's a popular tech joke with obvious but non-optimal solutions.


https://www.google.com/search?q=input+containerization

What does the inexperienced developer learn from the new search terms?

I don't know why magically using different, non-standard words would prevent a developer from being inexperienced.


Why do you think that? The idea isn't to tell new developers to search for content that doesn't exist. The idea is to teach better solutions, a small part of which is using correct descriptive language when naming things.

"I don't know why magically using different, non-standard words would prevent a developer from being inexperienced." It's really hard to give any response to this sort of flawless logic...


Well, I mean, after thinking hard about this for a few hours, I do get your point and I see where you're going. I'm still not sure I agree, but I see your point.


This is terribly confusing advice: "NEVER sanitize your inputs!". He means: "just don't call it sanitizing".


Click honeypot


I get the 'click' part, but why is this being upvoted as well?


The argument makes sense in the SQL injection example (don't escape, use prepared statements!) but falls apart when you get to the XSS example. Now we're just trying to redefine words.

"HTML injection" does sound cool though. Since XSS nowadays is not necessarily about sending cookies to another site, perhaps we could adopt "HTML injection" as a more generic term.

Now of course, the problem we're trying to fix is someone who does:

    $content = htmlspecialchars(mysql_real_escape_string(addslashes($content)));
before $content ever hits the database, without any understanding of what those functions really do. It's a surprisingly common cargo cult among newbie web devs. Just throw all the security-related functions together and you'll be safe!


HTML injection already is a term: https://www.owasp.org/index.php/HTML_Injection

Same principle, but different method of exploitation. If we supply plain HTML tags in a vulnerable parameter, it's HTML injection. If we use JavaScript (via a script tag or whatnot), it's XSS.


To me, "sanitize" implies a blacklist approach, which is inherently insecure. For the HTML example, it means you're going through and blocking <script> and such, while allowing the rest through. What you should be doing is keeping a small whitelist of allowed tags and blocking everything else, if you must support user-provided HTML in the first place. That to me isn't sanitizing but rather defining an HTML subset and then translating from it to full HTML.


This whole post is ridiculous. The problem he poorly tries to described has been solved by mathematicians a few millenniums ago. In a single word: CONTEXT.

A word is nothing if not bound by a context. Developers have already developed part of this context. Design patterns names are an example of those words defined within the context. Sanitizing input is just another.


Put yourself in the shoes of an inexperienced programmer building their first website. You've been advised to sanitize your inputs with the example of Bobby Tables.

You know the plain English meaning of "Sanitize". Clearly, you need to remove those single quote characters as they are unsanitary?


Put yourself in the shoes of an inexperienced mathematician solving its first theorem. You've been advised to take care of infinity as described with two parallels crossing at infinity.

Problem, your theorem is dealing with discrete numerable infinity...

On the side note, English meaning of Sanitize is "Make clean and hygienic", nothing more. It says nothing about "removing". Other definitions are extensions based on CONTEXT, once again.


... which is exactly what you should not do. Here, let me post this:

"If you want to create a horizontal line in HTML, you write <hr>"

See that? There is nothing "unclean" about it, hence you should not "clean" it. You just have to encode it if you output it embedded in HTML. That's why calling it "sanitizing" is misleading.


Again, wrong.

Encoding without proper context means "convert in a coded form". Hum that's not exactly what we want. So, let's add the "computing context", now we have, as an example, the ability to encode a WAVE file into a MP3. But wait, we lost information here! Bummer...

Sanitization in the context of computing does not specifically means that you have to "encode", or better, "transcode". It means that you have to take appropriate measure so that your input DATA cannot be interpreted as CODE by the receiver. Bonus point is taken if the measure you choose is lossless in term of information carried by your data.


Well, yeah, "transcode" might be better, but then again there isn't really any hard difference between "encode" and "transcode", or possibly "encode" is just useless because it can not ever happen without an associated decoding of the information source?

But no, in a way, you are getting it all backwards, or at least a bit confusing.

This is how you should construct a system that processes user input:

First, the input format should be defined such that it can only describe things that make sense within the given context, in particular it should usually not be possible to represent in it instructions for programming language interpreters.

Second, whenever you have to represent user input in some context, you have to encode (well, transcode) it into the format of that context. This transcoding generally should only change representation and not change the meaning of the converted information.

This automatically implies that you can not "inject code". There isn't really anything magic about "code". That's what I think is a large part of the confusion around "sanitizing input". The input can not represent code, the conversion does not change the meaning, so if the input can not represent code, the transcoding obviously can not cause code to appear either, and thus you are safe - and not only are you safe, but your system also works as it should otherwise, which it potentially does not if you start "removing dangerous characters".

That is why you should not "sanitize", but only validate and encode/transcode/convert. Which you need to do anyway for your system to work properly. Lack of injection vulnerabilities will result automatically.


If I am an inexperienced programmer building their first website and I refuse to even google what sanitizing inputs is, the last thing I need is some headline telling me "NEVER sanitize [my] inputs." Presumably, I won't read that either.


English is not my primary language and i don't know the exact etymology of the word 'sanitize' but it sounds more to me like you have to make the input 'sane' or acceptable. It doesn't imply to remove anything, rather to escape problematic characters, in this case the quotes.


Which is exactly where the confusion is. The input is perfectly sane, it just isn't SQL or HTML, but perfectly sane plain text, which can be converted into perfectly sane HTML or perfectly sane SQL, but none of those is in any way "more sane", it's just the right format for a given use - if you were to put the plain text into a plain text email body, for example, you would not have to do any conversion at all.


This article almost gets it right, then screws it up with the HTML example.

Both SQLi and XSS have the same cause: concatenating strings when you are working with active code of some sort.

They both have the same solution: you need to know the escaping rules for the active code you are assembling.

You shouldn't be solving XSS by stripping tags (that's a great way to build a discussion forum where no-one can talk about how to use HTML) - you should be escaping user input before assembling it in to HTML.

To protect against dumb mistakes (because it's really easy to screw up just once and have a huge security hole) you should use abstractions that do this for you. If you're working with Django the ORM will do this for SQLi and auto-escaping in the template language will do this for XSS (watch out for variables you are outputting in a script tag context though).

Escaping, not sanitizing, should be the message.


Hate to disagree with you but you can have plenty of flexibility with element and attribute white-lists without abandoning sanitizing. Sanitize as much of your inputs as you are comfortable with and escape the outputs.


Escaping is a form of sanitizing.


I've always seen "sanitization" as more of an output-encoding problem.

People love to consider sanitizing the inputs, but how you do so doesn't depend on the inputs but on the specific usage of it - more-or-less the output of your program.

Rather than trying to think of all the ways the inputs to your program could be abused to cause abuse, I find that it is safer to start at where the output occurs - database calls, system calls, etc. The most commonly used of these (database calls, shell commands, etc) tend to have a variety of encoding capabilities to ensure that when you want to stick a string in a particular place it does exactly that regardless of whether the string came from user input or elsewhere. For example, bind parameters for databases, or proper escaping functions.

If you think about it as sanitizing input it means you tend to misplace your attention to detail and only consider the entry to your application. A single input is often used to do multiple things through a program so you cannot properly handle sanitization at input.

The real push should be for proper output encoding, not input sanitization.


The purpose of sanitizing input is not to prevent security vulnerabilities. It is to make sure the values taken by your program are valid. If you accept a number range, and the user inputs a word, it's invalid input for your parameter and your program will crash. Input sanitizing validates the input is correct for your use. It indirectly improves security, but is not itself a practice of making an app more secure.


The term "sanitizing" is not used to reference this, as commented on, what you are describing is "validating" the user input. That should, of course, happen. Many validations will result in only accepting input that happens to be safe for many uses - i.e., if it's a valid number between 1-100 you could of course send it to an integer field in a database without doing any special encoding, but I wouldn't rely on my input validation doing this in my model layer.

Encoding a "safe" value doesn't make things any less safe. Failure to encode it, however, leaves potential holes in your application. Something may bypass input validation and be given to the database as an unsafe, unvalidated value. Usage of the value may change (new functionality using it differently, changed storage in database, etc) and in the new usage the value may not be safe.

Input validation is obviously something you want to do, but it should never be relied upon for protecting from injection attacks.


You actually said it, which is funny, but the right word for this is "validating".

Here's the chain:

1. Get raw input.

2. Validate it (number, not number, in range, not in range?)

3. Optionally format it to canonical format (i.e. trim whitespace etc.)

... later....

4. Encode it for where you want to use it (SQL, HTML etc.).

Sometimes steps 2 and 3 are done in the opposite order, or as an atomic single operation, but point is, we have perfectly reasonable words for all that: validating, formatting, encoding.


This is why I just call it encoding and decoding. Proper words, and assume context (encoding for what... decoding from what).


> Perhaps this is why some Irish people prefer to spell their name using the letter Ó. After years of having their name mangled by naive software developers, they made a new letter.

Stopped reading here as I assumed the rest of the article was satirical


This is stupid and I don't see anyone quite hitting the mail on the head as to why.

People normally dumb web vulnerabilities together. Xss and sqli especially. Preventing xss you have to sanitize. Preventing sqli you used parameterized queries.

To prevent stored xss you sanitize what you put in the database. So really... You still need to sanitize.

I've also seen people make arguments about inexperienced web programmers and how this advice can cause them to write bad code. I think the argument is bad because so many resources exist to help them. There is real code on stack overflow, w3 schools, owasp, and other blogs that can be copied and pasted in to their projects.


No, you don't sanitize what you put in your database, you validate what you put in your database, and convert into the output format when using data from the database. Sanitizing is always(!) wrong.


You mean that you put it unsanitized (for HTML) at the database, and sanitize only when converting to HTML... Well, I completely agree, but how can you then claim that sanitizing is always wrong?


I mean "sanitize" as in "clean up" (as in "remove 'special characters'"). If you use "sanitize" to mean "encode as" (as in "replace '&' with '&amp;'"), then there is nothing wrong with that, I would just suggest that you don't call that "sanitize", because that is highly confusing, if you look in the dictionary what that word normally means.

Assume a user uploads a TIFF file to your web application. Browsers don't understand TIFF. So, in order to display it on a web page, you convert it into a PNG. You wouldn't call that "sanitizing it for PNG" either, would you? For the same reason, you shouldn't call it "sanitizing" when you convert plain text to HTML.


If you never sanitize how do you prevent xss ...

For the average web Dev my approach is plenty good enough.. It's funny because your approach still requires sanitizing


Using validation and encoding. You check input for conformance to your data model and reject anything that fails the validation (you tell the user about the error and ask them to correct their mistake), and then you convert from your data model to the output format that you are generating.

So, for example, you could have a data model of "plain text field", in that case you check that the input is a valid character string (so no undefined codepoints present and, for example, no syntax errors in the UTF-8 encoding if that is what you are using). Thus you can be sure that you have only characters strings in that column of your database. Then, if you want to output one of those strings to be displayed within an HTML page, you convert it from plain text to HTML (replacing "<" with "&lt;", "&" with "&amp;", and so on). That way there is no XSS possible, and also, any input the user makes is displayed back exactly as they entered it.


SDepends on what you need. Depends on the input field. Another example for why this is stupid


by properly escaping the output, not sanitizing the input


And... How is escaping different from sanitizing. I would expect escaping to inheret from sanitizing.

The idea is the same. You take user input and put it in a safe format. The programmers needs may be different.


The problem is in the confusion that leads people to think in terms of a "safe format". There is no such thing. "&amp;" is not a "safe form" of "&", but rather the HTML (among others) _encoding_ of what in plain text is represented by "&". If you need to generate output that causes an "&" to be displayed, you have to encode it according to the rules of thet target format, not in some general magic "safe format". If you are generating a plain text mail, you have to encode it as "&", encoding it as "&amp;" is just wrong, because it leads to the user seeing "&amp;" instead of "&". Only if you are generating HTML, you have to encode it as "&amp;" in order for an "&" to be displayed. It's all about encoding things so that after decoding you get back the original input, not about "making things safe" - it's just a side effect that if you encode everything such that it causes a dumb series of characters to be displayed, that that tends to not cause any security problems.


A somewhat related term, I really like "mogrify": http://initd.org/psycopg/docs/cursor.html#cursor.mogrify


I've been wondering where this word comes from. The only other occurrence I know of is the ImageMagick command of the same name. It doesn't seem to be a real English word. What does it evoke to a native English speaker? (ESL here)


It's short for transmogrify: http://www.merriam-webster.com/dictionary/transmogrify

Calvin & Hobbes may have played a part in popularizing the term? http://calvinandhobbes.wikia.com/wiki/Transmogrifier


From transmogrify, presumably. And despite probably being most commonly associated with Calvin and Hobbes transmogrifier these days, that _is_ a real English word with a few hundreds years worth of use.


> "Perhaps this is why some Irish people prefer to spell their name using the letter Ó. After years of having their name mangled by naive software developers, they made a new letter."

I hope this is satire, Irish didn't "make up" the letter Ó, it was the standard historical form but was converted into O' when the names were anglicized.

Frankly his advise about sanitizers seems equally suspect, I've processed a lot of complex scientific abstracts using html5lib and Bleach without any mangling like he describes. He must be using very naive sanitizers.


The overall point is very true indeed, though I think it's not made particularly clear what the actual problem with sanitizing input is.

The problem is that you are silently changing information, and that's an absolute no-go for reliable data processing, and the cause is that people think of, say, html, as "some kind of text/strings".

HTML is a serialization of a tree, similarly, SQL is a serialization of a syntax tree ... - and if you want to add plain-text user input to such a serialized tree, you have to _convert_ it from, say, "plain text" to "HTML character data". You have to think of them as two different data types, and so when you want to use a value presented as one of the types as the other type, you don't have to "sanitize" it, even calling it "escaping" is confusing - you have to _convert_ it. And if it happens that some input can not be represented in the target type, then you have to _validate_ the input and _reject_ broken input.


Not sure why this is being downvoted. This "some kind of text/strings" approach is why we have "escaping" functions which turn strings into strings, instead of conversion functions from, say, "plain text" to "HTML character data".

This prevents our computers from helping us, even if we're using an ivory tower type system with a whizz-bang IDE, since everything's just "String" so the compiler says OK.

Here's an example of the alternative http://blog.moertel.com/posts/2006-10-18-a-type-based-soluti... (remember that most of the code there is building the libraries; using them is simple and terse).


"Convert" is ambiguous. When you read the text "<head>" from a template file, you probably want to represent it as the HTML "<head>", but when you read it from the database, you may, or may not want to represent it as "&lt;head&gt;".

People started using the word "sanitize" exactly because it conveys that information that "you want to treat it differently, depending of where it comes from". We also use the words "dirty" (sometimes "tainted") and "clean" conserving their usual relations to "sanitize".

Now somebody wants throw away a very concise and expressive jargon just because some people are giving bad advice on the Internet?


This has nothing to do with whether you read it from "a (template) file" or "the database", it's only about what _format_ it is in. If the template contains HTML, then the conversion to HTML (for the HTML part) is the identity function, of course, the same if the database contains HTML - if you want to use the same thing in a plain text email, you will have to convert to plain text. If, on the other hand, the template file or the database contains plain text, the reverse applies: conversion to plain text is the identity function, conversion to HTML is the usual replacement with entity references.

That you are using "dirty/tainted" and "clean" only shows how deep the confusion is. There is some justification to use those terms when talking about before and after validation, but other than that it's probably an indication of confusion (which also seems to be the common usage).

Take, for example, a general plain text field for optional free-form text. There is essentially nothing that could be validated (other than maybe that it's a valid UTF-8 string). Now, you want to generate a plain text email using the user input - how would you "sanitize" it?

There is nothing "better"/"cleaner" about any particular encoding, be it plain text, HTML, SQL, or any other, they are simply different encodings, and you have to always use the correct one, not the "best one"/"cleanest one", and you have to always know what format the data that you are processing is in so that you can convert correctly.

This jargon is not at all concise, actually (some people mean "remove 'strange' stuff/clean it", others mean "escape it", ...), and it makes you think in ways that obscure the actual problem that you are solving: Conversion between data types/data representations.


False surely, as another poster commented you want to sanitise inputs for example for user signatures to remove Javascript and other nasties. Sanitising inputs isn't just about protecting against SQL injections.

What the author actually means is the removal of apostraphes to prevent SQL injections can affect your data integrity, so paramatise your queries.

Alternatively replace single apostraphes with double apostraphes in your queries also works, but paramatising queries is a much better practise to get into.


No, TFA is right. If the user wants to post <script>alert("I am a hacker.")</script>, so be it. Display it literally. You do have to take care to escape it when you are rendering your HTML. But guess what? You have to anyways since <script> is not the only evil tag out there. XSS can be performed a number of ways and you are not going to catch them all by removing stuff from user input.


No. What you are describing is a way of doing things that inherently leads to security vulnerabilities, because it depends on someone else remembering to include something in their code, and people will always forget to do the right thing at least some of the time. Developers should never allow input into the system that doesn't match their expectations about what is valid for that value/field/etc. If you have a user text input that only requires alphanumeric characters, space, period, and comma, then strip out any character which is not one of those things. That field is now no longer a possible source of XSS.


The problem is that there are plenty of fields where your attempts at filtering will break user expectations horribly if you filter the data even remotely strictly enough to ensure security.

Such as, say, comment fields. It'd be terribly restrictive for your users if they can't write about <script> tags on a technical forum without munging it.

And you're still not safe. All the characters needed for an SQL injection attack, for example, commonly occur in normal English usage. All the characters needed for XSS commonly occur too, so you'd need more restrictive filtering.

And have fun when a bug that causes your filter to be more restrictive than it should now means data is unretrievable because you've just stored the sanitised output of your buggy filter.

Once you've dealt with that, you're still facing the issue of changing filtering requirements: What is safe for HTML may not be safe for your CSV export. What is safe for your PDF generation may not be safe for your HTML generation, and vice versa. Suddenly you're asked to pass data via an API, with different expectations of what a "safe" value contains. Boom.

In other words, if you believe that what is in your database is safe from causing security problems, you've lost. You need to treat every piece of data that may possibly contain user input as a potential cause of problems whenever you output it or pass it on anywhere, whether or not you've (attempted) to validate and restrict the input.

A typical example I used to have to deal with: Mail systems. HTML that is entirely safe when downloaded and rendered by a mail client that contains the HTML in a document that is just for that one e-mail, can leak data all over the place and compromise the users account if left unfiltered when rendered on the web server. You can't insert it pre-filtered into the database without inserting the raw content too because the user may want to download it.

And because the only reasonably safe filtering method is white-listing tags and CSS due to evolving standards, you will regularly have to revise the filters and add functionality and people will be very annoyed if their e-mails still don't render correctly after you've fixed the bugs (and if you have to tighten the filters again, you don't want to have to re-filter all the data).


Because we can talk about <script> tag on HN, the string "<script>" surely is a valid and expected input.

> If you have a user text input that only requires alphanumeric characters, space, period, and comma, then strip out any character which is not one of those things. That field is now no longer a possible source of XSS.

Except when somebody puts such "safe" string in an unquoted HTML attribute... Seriously, thinking of data as "safe" (safe to be carelessly mishandled...) is a fragile approach.

> it depends on someone else remembering to include something in their code

Get a template engine that escapes everything everywhere by default, so you won't need to remember to escape (or "sanitize"!) each thing.


But instead you are randomly changing information, which is a terrible idea.

If you get invalid input, you have to reject it, not just silently make it fit by changing the input (the only exception is when you can be sure that changing the data will not under any circumstances change its meaning).


1. Paste <script>alert("I am a hacker.")</script> in TFA's comment box

2. Select Anonymous from the auth options.

3. Click Submit.

4. Laugh your heart out.


Your comment highlights exactly why the article makes a compelling point. I mean, you make several suggestions before you get to one thing you need to do on input:

> but paramatising queries is a much better practise to get into.


Several suggestions?


Read on, there's a section titled "Isn’t sanitization still needed with HTML?".


Well in HTML you can use a sandboxed iframe (or <webview> in technologies that have it), but it's not cheap.


I just remembered visiting a website that used iframe's with script tags disabled for users their signatures. It was a pretty interesting approach.


I just hope they used a proper lib like Purifier to do it, or someone's going to have fun with `onmouseover`.


I think that when it's sandboxed with the proper attributes, you can't do anything appart from trashing the content of the frame.


I had no clue that was so widely supported. No IE8/9 but virtually everything else. Neato! http://caniuse.com/#feat=iframe-sandbox


I can't post a comment containing <script> in this comment form, because it's "disallowed", instead of just escaping it as plain text.

The thick, thick irony of a guy who can't even follow his own advice.


He is perfectly following his own advice. Apparently, his form field takes HTML syntax with a subset of HTML tags. Your input does not conform to that. So, instead of silently altering what you wrote, it tells you about the validation failure and asks you to correct your input instead of silently changing what you wrote. The input field takes HTML, so you have to write "&lt;script>" (I suppose, haven't tested it) in order to display "<script>" - that is perfectly consistent.


No form should accept just "HTML" if you don't want just any "HTML" in your form.

I was actively trying to talk about his script example and instead I had to second-guess his parser to get past the validator (I eventually resigned and replaced < and > with [ and ]).

If you want to support some tags, have your parser be an HTML-like DSL language with those tags supported. Don't disallow perfectly good input.


I don't really understand what point you are trying to make, but in any case, he has a particular input format for that form field, your input did not conform to it, so it was rejected, nothing particularly surprising or wrong there.

Now, I haven't tried it, but I suppose his form field expects HTML syntax? Have you tried entering your text in HTML syntax? Was that rejected?


I know. I'm awful.

:)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: