If you haven’t already upgraded your data sanitization suite, it’s definitely time to get started, before it’s too late. Dirty, unclean, unsanitary data is creeping into your application layer; leaving unwanted residue behind. Users with un-bathed data are everywhere and they’re going to stain your clean databases and persistence storage layers!
If you haven’t already realized I’m being cynical, and truly believe there is such a suite, you should definitely keep reading. If you have, you should keep reading anyway. You might actually learn a few things.
There’s no doubt about it. Users can supply your application with data that can break it whether it’s intentional or unintentional. Whether there is a malicious intent, or not. So we certainly don’t want to blindly insert data from the user into our application layer and allow it to inadvertently seep through our code, like in the case of Bobby Tables. However, the problem isn’t that the user’s data is dirty, but that your code has been engineered in such a way that it has developed a data germophobia. This alluding side effect only exasperates your problems in dealing with user-supplied data.
Data Belongs To The User
First, you have to consider that this data isn’t yours to begin with. It’s the user’s data and the user should have control over their own data. However, the code and the application are yours and you, likewise, should retain full control over your code and your application layers. The heart of the problem really lies in the places where those two things tend to meet and the line between where your code ends and the user’s data begins becomes quite blurry. In fact, they may, very well, be indistinguishable at times. So it’s easy to mangle the user’s data and break it just as it can be easy for the user to intentionally — or unintentionally — break your application. In both cases the intent of both the user and the software engineer is unimportant in solving this problem. What really matters is that the solution allows both parties to retain their respective rights in not intruding on the other’s property.
Code Belongs To The Programmer
As a programmer — and especially as a web developer — you’re always taught that user-supplied data is never to be trusted. Sometimes this loosely transpires into “the user is an idiot“, however, that’s just a misnomer. The collective user-base of a web-based application or Software as a Service translates to more valuable knowledge than any two programmers have, combined. This is usually because the engineer works to solve a problem and the user is just looking to get things done without a big hassle. The two have different objectives, but one also has a broader picture than the other.
The notion tends to be that if you soak the user’s data in enough bleach, hose it down with enough Lysol, and dust/vacuum around it regularly, it shouldn’t be a problem. Speaking purely from an analogical point-of-view, of course. However, not all data is created equal, and not all sanitary products are the same. Have you ever accidentally thrown a colored shirt in the wash along with your whites and added bleach? Just as that’s going to ruin all of your whites the same mistake will likely corrupt a bulk of your user’s data (and probably might break your application) as well.
So the real end-goal here is not “this data is a problem“, but more so that “this code and this data don’t seem to get along“. The most obvious solution is to then separate the two and create a clear layer of segregation between them so that one does not interfere with the integrity of the other.
Examples Of Poor Data Sanitization Practices
One common beginner mistake is to think that stripping things from the user’s data might solve the problem that that data imposes on your code. That’s a problem though, because it ultimately means you have removed things the user (for all we know) had every intention of keeping. To me this just means you’ve introduced a new problem (breaking the integrity of the user’s data).
An example of this in PHP is where you want to output user data in your HTML, but you don’t want to allow the user to inject HTML into your output (breaking your application). Thinking that if you just strip all of the (less-than <), (greater than >) characters, from the user’s supplied data, this will keep your code integrity, is a biased view. What about the integrity of the data presented by the user? Why would you assume that the user had the intent of injecting HTML or performing some malicious XSS injection just because their data contained invalid characters that your code cannot accept? Instead, we should find a way to retain the integrity of both our code as well as the user’s data without posing any vulnerability or crippling of the application.
Code
$_POST["input"] = "X < Y && Y > Z";
$output = str_replace(array("<", ">"), "", $_POST["input"]);
echo "<p>The user said: $output</p>";
Output
<p>The user said: X Y && Y Z</p>
The above example in PHP is a horrible idea. This is not something you ever want to do. Now the user’s data has lost all its integrity since, in this example, the user simply wanted to present data that states “X < Y && Y > Z“, but your application as rendered their data useless. Think if this were meant to be a post on a public math forum what the impact of your code would be on your user-base.
Lets consider another example…
$_POST["input"] = "
<html>
...
</html>
";
$output = strip_tags($_POST["input"]);
echo "
<h1>User Data</h1>
<div>$output</div>
";
The output becomes tainted data…
<h1>User Data</h1>
<div>
...</div>
Imagine if this were a public forum for web developers and someone were attempting to present some HTML code as a example to some question? Certainly the intent here is to prevent HTML/XSS injection, but that shouldn’t result in breaking the user’s data either. So lets present a real solution to the problem that doesn’t pose yet another problem.
Escaping Vs. Stripping
$_POST["input"] = "X < Y && Y > Z"; $output = htmlspecialchars($_POST["input"]); echo "<p>The user said: $output</p>";
<p>The user said: X < Y && Y > Z</p>
This allows us to encode the user’s data to HTML entities that the browser will not confuse for markup. It means the user’s data appears in the browser just as they typed it and there will be no unwanted intrusion of that data into your HTML. Great, now your code doesn’t break and the user’s data retains its integrity. Imagine that, no vulnerability to your application layer and no data corruption! It’s a WIN-WIN situation. The point remains that we do not intrude onto the user’s property and the user does not intrude upon ours. Here both the application layer and the data layer can co-exist in harmony.
What you don’t want to do is escape the user’s data and then store it in it’s escaped form. For example, don’t use htmlspecialcahrs or htmlentities before storing user data in your database. These are meant as transport mechanisms for the document character set, not to be confused with fortifying your SQL against malicious injection. This just means what you have in your database isn’t what the user supplied you with. You would need to take additional measures to unescape the data back to its original encoding in the event you need to perform any actual work on the data, such as searching, transferring, etc…
Instead all you really need to do in order to clearly separate the user data from your application code is make sure you escape it properly (in this case for HTML) before it gets mixed in with your code. The most common approach to avoiding this blurry line, we mentioned earlier, of mixing the two together is to use a templating system. This would be a place where your HTML can live cozily and accept values to be inserted in the template at will, handling the rendering of this view through abstract method. The data you hand it from your modal would than be transported in the proper encoding and the data itself remains unaffected. It’s also, not just, user-data you’re worried about here, but if any of your own application data might break that HTML, as well. So this abstraction is a fitting solution for a common problem.
The SQL Injection Problem
Of course this brings us to the infamous SQL injection attacks that a user can pose on your application by presenting you with data that might break your SQL code if you were to combine that data with your SQL code in the same way you tried to combine the user’s data with your HTML code. The obvious solution has always been to escape that data before allowing your code and the user’s data to mix and mingle together.
$input = mysql_real_escape_string($_POST["userdata"]);
$sql = "INSERT INTO `table` VALUES('$input')";
mysql_query($sql);
It does the job, but the problem of not being able to clearly distinguish your code from the user’s data remains. Here, escaping user data for SQL isn’t as easily solved with the templating system as it is in HTML. Templating SQL code that stores, retrieves, and operates on the very thing your application code depends on, is quite challenging. The method of escaping has been prone to user-space error for many years. Until DBMS developers discovered some better options for abstracting the process in much the same way you attempt to do with an HTML templating system.
Prepared Statements With Parameters
Using parameters in a prepared statement means much the same to your DBMS as an HTML templating system means to your application’s business logic. The purpose of the template is to serve as an abstract idea of what you want rendered in the view. Your application code might want to do various things with the data before it is presented to the user for output. So separating what ends up on the screen, from what’s going on behind the scenes in your code, is important. Equally as important, is the prepared statement that makes it possible to bind parameters to values that can never be confused as code.
$pdo = new pdo("sqlite::memory:");
$sql = "SELECT `username`,`userage` FROM userlist WHERE `userid` = ?";
$stmt = $pdo->prepare($sql);
$stmt->bindParam(array(1, $_POST["uid"], PDO::PARAM_INT));
$stmt->execute();
Here the separation between your code and the user’s supplied data is quite clear to your database. The statement is prepared separate from the data and the parameter is used to bind some value into the statement, then both the SQL code and the data are sent along separate paths. We can’t confuse the user’s input for the SQL code or vice-versa. The same goal you hope to achieve when you want that data placed in your HTML, but don’t want it affect the HTML code and without changing what the user handed you.
Validation vs. Sanitization
This another prevalent assumption that if I only take from the user what I need I can keep my application unaffected and working smoothly. The information a user supplies on the web can be vast. We input everything from our names, addresses, zipcodes, credit card numbers, phone numbers, even to entire documents on the web. There’s a great deal of potential for malicious intent to try and sneak bad data passed our applications to break them. However, there’s also an important need to keep in mind that your application is all about the user. If all your code is doing to make things safer is drive the user more and more annoyed with the process of supplying their input or uploading their information you are only degrading the very people you wrote the code for in the first place.
/* If you're doing this you are causing your users a lot of pain! */
$name = preg_replace("/[^a-z ]+/i", "", $_POST['name']);
Here are some reasons why this is wrong.
- Why can’t my name be Robert Jr. or O’reilly?
- Or how about a hyphenated name like Lee-ann?
- Why would you assume my name can only be presented by the letters A-Z?
- Have you never met a Jérôme or an Aimé or a Noël before?
- If you know Afrikaans you should know some vowel sounds in Afrikaans are represented by an apostrophe.
- Your mangling people’s name! It’s not yours to mangle…
- Mostly, YOU’RE PISSING PEOPLE OFF! STOP!
If the data the user has presented you with, is unacceptable for your application then what you should be doing instead is simply validating that the data is acceptable to you and then use that validation result to determine either (A) the data is acceptable and we can proceed, or (B) the data is unacceptable and we must reject it entirely and notify the user to retry according to our requirements. But under no circumstance should you just change the user’s data without notifying them about it and continue on as if what you have is what they gave you.
However, not everything the user submits necessarily requires validation. As developers, we sometimes like to assume that we know everything and that everything must be validated by us before it should be allowed. This is simply not true. We don’t know everything and we certainly shouldn’t have to be the overseeing party of what is or is not allowed as someone’s first and last name. Or characters the user is allowed to publish to a public forum. This stuff does not break our application code.
In retrospect, there are places where validation is a requirement for the application to function properly. For example, we may need to verify that a user has supplied a valid zip code or postal address on in an order form where they are placing an order for us to ship. If the user enters invalid information there the order can not ship and it presents a problem for the intended users of the application software. We also don’t want users entering one or two letters for their password. This makes the account insecure and opens our application up to attacks. So we may want to validate the user has supplied a password of a specific requirement, like say at least 10 to 20 characters (or perhaps just a lower-bound to prevent people from easily guessing a password). We might want to ensure the user includes at least one upper case character or one special character as well to increase password strength and reduce ease of brute force attacks. However, if the user’s data does not meet these requirements we should be rejecting the data entirely and informing the user of the problem so that they may retry. You wouldn’t just change the user’s password to have it meet your needs and simply carry on. That wouldn’t make any sense! How is the user going to know you changed their password? So you also shouldn’t change the user’s data anywhere else unless you’ve made the user completely aware of what you’re doing to their data so that they have an option to decline or at the very least may chose not to supply their data under such conditions.
$regex1 = "/[a-z0-9]{10,20}/i";
$regex2 = "/[A-Z]/";
if (preg_match($regex, $_POST['password']) && preg_match($regex, $_POST['password']))
{
/* Password is acceptable */
} else {
/*
Password is unacceptable
Reject and inform the user
*/
}
Also, don’t go out of your way to make it very difficult for the user to meet your requirements. For example, have you ever seen an online order form where you’re asked to enter such information as your phone number or credit card number and prompted not to use dashes or spaces? Why should this be a burden the user is faced with? Here you aren’t really validating anything but the user’s ability to follow instructions. Have you ever heard of a regular expression? Have you ever heard of client-side code that improve the user experience? For example, if you want the data sent to your application in a specific format why not make that a part of your front-end? You can use separate input fields to make it clearer to the user. You can also validate the formatting according to regular expressions on the back-end. But don’t constrain the user experience when you have other options that can still ensure your data validated and keep the user happy at the same time.




