Sherif's Tech Blog

Just another guy on the Internet with a keyboard…

Data Sanitization Suite 2.0

Data Sanitization SuiteIf you haven’t already upgraded your data sanitization suite, it’s definitely time to get started, before it’s too late. Dirty, unclean, unsanitary data is creeping into your application layer; leaving unwanted residue behind. Users with un-bathed data are everywhere and they’re going to stain your clean databases and persistence storage layers!

If you haven’t already realized I’m being cynical, and truly believe there is such a suite, you should definitely keep reading. If you have, you should keep reading anyway. You might actually learn a few things.

There’s no doubt about it. Users can supply your application with data that can break it whether it’s intentional or unintentional. Whether there is a malicious intent, or not. So we certainly don’t want to blindly insert data from the user into our application layer and allow it to inadvertently seep through our code, like in the case of Bobby Tables. However, the problem isn’t that the user’s data is dirty, but that your code has been engineered in such a way that it has developed a data germophobia. This alluding side effect only exasperates your problems in dealing with user-supplied data.

Data Belongs To The User

First, you have to consider that this data isn’t yours to begin with. It’s the user’s data and the user should have control over their own data. However, the code and the application are yours and you, likewise, should retain full control over your code and your application layers. The heart of the problem really lies in the places where those two things tend to meet and the line between where your code ends and the user’s data begins becomes quite blurry. In fact, they may, very well, be indistinguishable at times. So it’s easy to mangle the user’s data and break it just as it can be easy for the user to intentionally — or unintentionally — break your application. In both cases the intent of both the user and the software engineer is unimportant in solving this problem. What really matters is that the solution allows both parties to retain their respective rights in not intruding on the other’s property.

Code Belongs To The Programmer

As a programmer — and especially as a web developer — you’re always taught that user-supplied data is never to be trusted. Sometimes this loosely transpires into “the user is an idiot“, however, that’s just a misnomer. The collective user-base of a web-based application or Software as a Service translates to more valuable knowledge than any two programmers have, combined. This is usually because the engineer works to solve a problem and the user is just looking to get things done without a big hassle. The two have different objectives, but one also has a broader picture than the other.

The notion tends to be that if you soak the user’s data in enough bleach, hose it down with enough Lysol, and dust/vacuum around it regularly, it shouldn’t be a problem. Speaking purely from an analogical point-of-view, of course. However, not all data is created equal, and not all sanitary products are the same. Have you ever accidentally thrown a colored shirt in the wash along with your whites and added bleach? Just as that’s going to ruin all of your whites the same mistake will likely corrupt a bulk of your user’s data (and probably might break your application) as well.

So the real end-goal here is not “this data is a problem“, but more so that “this code and this data don’t seem to get along“. The most obvious solution is to then separate the two and create a clear layer of segregation between them so that one does not interfere with the integrity of the other.

Examples Of Poor Data Sanitization Practices

One common beginner mistake is to think that stripping things from the user’s data might solve the problem that that data imposes on your code. That’s a problem though, because it ultimately means you have removed things the user (for all we know) had every intention of keeping. To me this just means you’ve introduced a new problem (breaking the integrity of the user’s data).

An example of this in PHP is where you want to output user data in your HTML, but you don’t want to allow the user to inject HTML into your output (breaking your application). Thinking that if you just strip all of the (less-than <), (greater than >) characters, from the user’s supplied data, this will keep your code integrity, is a biased view. What about the integrity of the data presented by the user? Why would you assume that the user had the intent of injecting HTML or performing some malicious XSS injection just because their data contained invalid characters that your code cannot accept? Instead, we should find a way to retain the integrity of both our code as well as the user’s data without posing any vulnerability or crippling of the application.

Code

$_POST["input"] = "X < Y && Y > Z";
$output = str_replace(array("<", ">"), "", $_POST["input"]);
echo "<p>The user said: $output</p>";

Output

<p>The user said: X  Y && Y  Z</p>

The above example in PHP is a horrible idea. This is not something you ever want to do. Now the user’s data has lost all its integrity since, in this example, the user simply wanted to present data that states “X < Y && Y > Z“, but your application as rendered their data useless. Think if this were meant to be a post on a public math forum what the impact of your code would be on your user-base.

Lets consider another example…

$_POST["input"] = "
<html>
    ...
</html>
";
$output = strip_tags($_POST["input"]);
echo "
    <h1>User Data</h1>

    <div>$output</div>
";

The output becomes tainted data…

<h1>User Data</h1>

<div>
    ...</div>

Imagine if this were a public forum for web developers and someone were attempting to present some HTML code as a example to some question? Certainly the intent here is to prevent HTML/XSS injection, but that shouldn’t result in breaking the user’s data either. So lets present a real solution to the problem that doesn’t pose yet another problem.

Escaping Vs. Stripping

$_POST["input"] = "X < Y && Y > Z";
$output = htmlspecialchars($_POST["input"]);
echo "<p>The user said: $output</p>";
<p>The user said: X &lt; Y &amp;&amp; Y &gt; Z</p>

This allows us to encode the user’s data to HTML entities that the browser will not confuse for markup. It means the user’s data appears in the browser just as they typed it and there will be no unwanted intrusion of that data into your HTML. Great, now your code doesn’t break and the user’s data retains its integrity. Imagine that, no vulnerability to your application layer and no data corruption! It’s a WIN-WIN situation. The point remains that we do not intrude onto the user’s property and the user does not intrude upon ours. Here both the application layer and the data layer can co-exist in harmony.

What you don’t want to do is escape the user’s data and then store it in it’s escaped form. For example, don’t use htmlspecialcahrs or htmlentities before storing user data in your database. These are meant as transport mechanisms for the document character set, not to be confused with fortifying your SQL against malicious injection. This just means what you have in your database isn’t what the user supplied you with. You would need to take additional measures to unescape the data back to its original encoding in the event you need to perform any actual work on the data, such as searching, transferring, etc…

Instead all you really need to do in order to clearly separate the user data from your application code is make sure you escape it properly (in this case for HTML) before it gets mixed in with your code. The most common approach to avoiding this blurry line, we mentioned earlier, of mixing the two together is to use a templating system. This would be a place where your HTML can live cozily and accept values to be inserted in the template at will, handling the rendering of this view through abstract method. The data you hand it from your modal would than be transported in the proper encoding and the data itself remains unaffected. It’s also, not just, user-data you’re worried about here, but if any of your own application data might break that HTML, as well. So this abstraction is a fitting solution for a common problem.

The SQL Injection Problem

Of course this brings us to the infamous SQL injection attacks that a user can pose on your application by presenting you with data that might break your SQL code if you were to combine that data with your SQL code in the same way you tried to combine the user’s data with your HTML code. The obvious solution has always been to escape that data before allowing your code and the user’s data to mix and mingle together.

$input = mysql_real_escape_string($_POST["userdata"]);

$sql = "INSERT INTO `table` VALUES('$input')";

mysql_query($sql);

It does the job, but the problem of not being able to clearly distinguish your code from the user’s data remains. Here, escaping user data for SQL isn’t as easily solved with the templating system as it is in HTML. Templating SQL code that stores, retrieves, and operates on the very thing your application code depends on, is quite challenging. The method of escaping has been prone to user-space error for many years. Until DBMS developers discovered some better options for abstracting the process in much the same way you attempt to do with an HTML templating system.

Prepared Statements With Parameters

Using parameters in a prepared statement means much the same to your DBMS as an HTML templating system means to your application’s business logic. The purpose of the template is to serve as an abstract idea of what you want rendered in the view. Your application code might want to do various things with the data before it is presented to the user for output. So separating what ends up on the screen, from what’s going on behind the scenes in your code, is important. Equally as important, is the prepared statement that makes it possible to bind parameters to values that can never be confused as code.

$pdo = new pdo("sqlite::memory:");

$sql = "SELECT `username`,`userage` FROM userlist WHERE `userid` = ?";
$stmt = $pdo->prepare($sql);
$stmt->bindParam(array(1, $_POST["uid"], PDO::PARAM_INT));

$stmt->execute();

Here the separation between your code and the user’s supplied data is quite clear to your database. The statement is prepared separate from the data and the parameter is used to bind some value into the statement, then both the SQL code and the data are sent along separate paths. We can’t confuse the user’s input for the SQL code or vice-versa. The same goal you hope to achieve when you want that data placed in your HTML, but don’t want it affect the HTML code and without changing what the user handed you.

Validation vs. Sanitization

This another prevalent assumption that if I only take from the user what I need I can keep my application unaffected and working smoothly. The information a user supplies on the web can be vast. We input everything from our names, addresses, zipcodes, credit card numbers, phone numbers, even to entire documents on the web. There’s a great deal of potential for malicious intent to try and sneak bad data passed our applications to break them. However, there’s also an important need to keep in mind that your application is all about the user. If all your code is doing to make things safer is drive the user more and more annoyed with the process of supplying their input or uploading their information you are only degrading the very people you wrote the code for in the first place.

/* If you're doing this you are causing your users a lot of pain! */

$name = preg_replace("/[^a-z ]+/i", "", $_POST['name']);

Here are some reasons why this is wrong.

  • Why can’t my name be Robert Jr. or O’reilly?
  • Or how about a hyphenated name like Lee-ann?
  • Why would you assume my name can only be presented by the letters A-Z?
  • Have you never met a Jérôme or an Aimé or a Noël before?
  • If you know Afrikaans you should know some vowel sounds in Afrikaans are represented by an apostrophe.
  • Your mangling people’s name! It’s not yours to mangle…
  • Mostly, YOU’RE PISSING PEOPLE OFF! STOP!

If the data the user has presented you with, is unacceptable for your application then what you should be doing instead is simply validating that the data is acceptable to you and then use that validation result to determine either (A) the data is acceptable and we can proceed, or (B) the data is unacceptable and we must reject it entirely and notify the user to retry according to our requirements. But under no circumstance should you just change the user’s data without notifying them about it and continue on as if what you have is what they gave you.

However, not everything the user submits necessarily requires validation. As developers, we sometimes like to assume that we know everything and that everything must be validated by us before it should be allowed. This is simply not true. We don’t know everything and we certainly shouldn’t have to be the overseeing party of what is or is not allowed as someone’s first and last name. Or characters the user is allowed to publish to a public forum. This stuff does not break our application code.

In retrospect, there are places where validation is a requirement for the application to function properly. For example, we may need to verify that a user has supplied a valid zip code or postal address on in an order form where they are placing an order for us to ship. If the user enters invalid information there the order can not ship and it presents a problem for the intended users of the application software. We also don’t want users entering one or two letters for their password. This makes the account insecure and opens our application up to attacks. So we may want to validate the user has supplied a password of a specific requirement, like say at least 10 to 20 characters (or perhaps just a lower-bound to prevent people from easily guessing a password). We might want to ensure the user includes at least one upper case character or one special character as well to increase password strength and reduce ease of brute force attacks. However, if the user’s data does not meet these requirements we should be rejecting the data entirely and informing the user of the problem so that they may retry. You wouldn’t just change the user’s password to have it meet your needs and simply carry on. That wouldn’t make any sense! How is the user going to know you changed their password? So you also shouldn’t change the user’s data anywhere else unless you’ve made the user completely aware of what you’re doing to their data so that they have an option to decline or at the very least may chose not to supply their data under such conditions.

$regex1 = "/[a-z0-9]{10,20}/i";
$regex2 = "/[A-Z]/";

if (preg_match($regex, $_POST['password']) && preg_match($regex, $_POST['password']))
{
        /* Password is acceptable */
} else {
    /*
        Password is unacceptable
        Reject and inform the user
    */
}

Also, don’t go out of your way to make it very difficult for the user to meet your requirements. For example, have you ever seen an online order form where you’re asked to enter such information as your phone number or credit card number and prompted not to use dashes or spaces? Why should this be a burden the user is faced with? Here you aren’t really validating anything but the user’s ability to follow instructions. Have you ever heard of a regular expression? Have you ever heard of client-side code that improve the user experience? For example, if you want the data sent to your application in a specific format why not make that a part of your front-end? You can use separate input fields to make it clearer to the user. You can also validate the formatting according to regular expressions on the back-end. But don’t constrain the user experience when you have other options that can still ensure your data validated and keep the user happy at the same time.

Do Computer Science Geeks Need Glasses?

CompSci Ggeek

You’re a CompSci Major And You Don’t Know It

In my provocative attempt to get the attention of a computer science major student I have had to go to a great lengths to make a subtle point. Unfortunately for me this took up a bit of my time. Fortunately, for the person involved in the discussion I probably got through to them. Although, you can never be too sure. Some people simply refuse to accept that they may be wrong about something out of delusions of grandeur, perhaps self-fixation, or even for reasons of pride. Whatever the case may be it stands to reason that no matter how well you think you’ve learned something you may still be wrong. Medicine is one controversial field where this notion holds true on a constant basis. Physicians and researchers in the medical field find out that what they thought they knew actually isn’t true only many years after their work has been published. The same can be said about the field of computer science.

Allow me to explain how this discussion began. I was speaking with a programmer – let’s call him Foo – on an online forum for PHP. Now, Foo, is new to PHP. He’s trying to optimize his code and has posted some PHP code on the forum looking for advice from more experienced programmers in PHP. So far Foo seems like the typical case where they understand the language syntax and just need assistance making the working code better. Now, another programmer steps in on the forum – let’s call him Baz – and contributes at length with suggestions that seem to me like pathetic micro-optimizations that aren’t really going to help Foo. His suggestions begin with things like use shorter variable names, don’t define keys for arrays, use include and not include_once, and use switch and not if constructs to make your code faster. I start to see Baz is clearly not as experienced at PHP as he’s trying to make himself out to be.

Let’s examine why I think so and why Baz might be very wrong.

Will The Real PHP Please Stand Up

First, Baz is making some non-optimal suggestions that don’t necessarily improve the performance of your PHP code, but might actually break it. For example, suggesting the use of include vs. include_once for performance reasons is not a reason at all. Since both do different things they have different uses. You should not be basing your use of either construct on performance reasons alone. You definitely want to use include_once where your intention is never to re-parse the same file twice; where the order of the calling and called scripts is not clear.

Second, it can not be easily determined that using either include or include_once will hinder performance or improve it. The include_once call itself is merely a wrapper around include that checks the list of already included files. Yes, you’ll be hitting the hashtable, but if you don’t already know it yet PHP hits that hashtable for virtually everything you do in PHP. The hashtable was micro-optimized to be fast for this very reason. Additionally, such a performance hit isn’t even considerable when the need for include_once is apparent. This is like saying to improve my car’s gas mileage I’m going to take out my backseat. Sure, removing any weight from the car means less energy required to move said car, but when you understand the mechanics of the modern motor vehicle and apply just a little bit of common sense you soon realize this is a pretty silly move when you have passengers on board and they are uncomfortable or even injured on the ride because you don’t have a back seat in order to spare $0.01 of gasoline costs on a 15 minute trip. If you’re so concerned with cost you really need to seek out an entirely different solution. Like move away from the horrid design of the combustion engine altogether!

Third, suggestions like a switch is faster than the if construct are completely unfounded and very much baseless over-generalizations. The semantics of the if and switch construct are pretty similar in most high-level languages including PHP. They may differ slightly, but most of it comes down to simple evaluation and jump operations.

When Baz began explaining to me that I’m an idiot that doesn’t know what he’s talking about, I merely shrugged it off as someone who was about to rant on a public forum that there was an obvious performance benefit between switch and if in php. So I moved on. Then when Baz realized I wasn’t responding to his rants he began daring me to disprove him. This is a common false burden of proof fallacy since he is the one posing the argument it really is his burden to prove himself right and not make me do the work to prove him wrong. But, in good nature I laughed and proceeded to prove the poor compsci student wrong. To do so I showed him the branch analysis of an if versus a switch construct using the same 4 expressions to simply check for a true or false and print a single line.

PHP IF Branch Analysis

PHP IF Branch Analysis

PHP Switch Branch Analysis

PHP Switch Branch Analysis

Now, here’s the kicker. If you want to be pedantic about it… The switch will actually cause PHP to generate more opcodes than using the if/elseif construct in this case. This still really says little about performance though. Since they ultimately just lead to a simple branch analysis in the end. All PHP is doing here is evaluating the expression given, to a boolean value of either true or false. If it is true then it executes the opcodes for that branch, if not it moves on to the next branch.

Here’s the code for both scripts.

IF – test1.php

$baz = false; $bar = 0; $foo = ""; $var = true;
if ($baz) {
        echo "We have baz!\n";
}
elseif ($bar) {
        echo "We have bar!\n";
}
elseif ($foo) {
        echo "We have foo!\n";
}
elseif ($var) {
        echo "We have var!\n";
}

Switch – test2.php

$baz = false; $bar = 0; $foo = ""; $var = true;
switch (true) {
        case $baz:
                echo "We have baz!\n";
                break;
        case $bar:
                echo "We have bar!\n";
                break;
        case $foo:
                echo "We have foo!\n";
                break;
        case $var:
                echo "We have var!\n";
                break;
}

So all of a sudden, Baz doesn’t seem to have a leg to stand on and wants to start arguing branch analysis theory with me and vivaciously explaining what little he seemed to remember from his professor in his last ASM class. I could tell he was taking out the book at this point, but rather than be drawn into a pointless debate about things that aren’t helping the status qua, I tried reminding him to get back on topic and help out Foo, instead of trying to convince me he was a smart compsci student. Baz seemed to be infuriated with this and resented getting back on topic. To the point where Baz had now been banned from the discussion and continued to pursue me in private sending me quote after quote from his books.

To indulge the poor fellow I offered some pointers on what it really comes down to when you step back and look at the bigger picture. PHP is not a very efficient way of doing things to begin with. PHP is built on share-nothing architecture. It’s an interpreted language, which means you’re basically recompiling your program from scratch every single time you want to run it. It doesn’t break down to simple x86 ISA in the end and you can’t justify low-level micro-optimizations in PHP since it will ultimately just end up breaking your code in the process of trying to make it faster. The smart PHP developers, the ones who really know PHP inside and out, will always tell you don’t try to outsmart the interpreter, because the interpreter will likely keep outsmarting you.

The Interpreter Is Out Smarting You!

Take people who try to force references into their code as a good example. They think that by using references everywhere they will save on memory and will ultimately make their code faster in ways the interpreter couldn’t. Let us examine the following code snippet to see why this can be a horrible idea and isn’t really offering any benefits worth pursuing.

$array = array(1,2,3,4,5);

foreach ($array as &$value) {
    /* Do some stuff here... */
}

Alright, so this looks great. We didn’t have to use more memory and we can modify the array values in our loop, right? Wonderful! Now we need to just finish this code with one last loop where we’ll print the array elements to the page in a table.

echo "<table><tr>";

foreach ($array as $value) {
    echo "<td>$value</td>";
}

echo "</tr></table>";

The result of our code is now:

1	2	3	4	4

If it isn’t already obvious to you what happened here… You basically just got outsmarted by the interpreter. In trying to outsmart the PHP interpreter you ultimately broke your code and now have an undesired side effect. This is completely expected behavior, by the way. The reason for this is because $value is still a reference. The reference never went away, right? You made it a reference so its use anywhere else in your code still keeps it a reference. And since you decided to use it in the next foreach loop PHP is still assigning the value of each element upon iteration to the $value variable and thus upon the last iteration we’ve now reassigned the value of this variable by reference. Yes, you brokeded it!

Not to worry though. When you’re iterating over an array with foreach PHP is already using the same amount of memory you would use as if you assigned the variables by reference anyway. Why? Because PHP is using copy-on-write. Which means it doesn’t use the memory unless there has been a write operation that has now changed the values of one of the variables. Otherwise the variable is nothing more than an extra refcount for the ZVAL.

Goodbye Baz

Now, Baz eventually gave up and agreed that what he was arguing was not only baseless egocentric banter, but also not very helpful to Foo. I explained to Baz that he should never think that information from a single source amounts to intelligence. It’s just information. Intelligence needs to be gathered from multiple independent sources and corroborated through peer-review. Otherwise, it’s just pointless to say that it’s intelligible when you can’t even get your peers to agree on it.

So the next time you get a compsci student trying to tell you off like he’s the boss, ask him or her an intelligible, objective, question. If they can respond with an intelligent and completely objective answer they might be worth listening to. Otherwise, tell them to go back to school. Computer Science students that have the attitude “I learned it this way and it can never possibly be right any other way” are always going to find it difficult to get work once they step outside of their school and into the real world. Word to the wise: CompSci Geeks DO need glasses!

Why You Need a Database

There are a lot of developers that start off building their applications with the notion that a database is only necessary if they have a lot of data to work with or that the data they have will be easier to manage if they can avoid the complexities of building and maintaining a database or dealing with a DBMS (Database Management System). In the area of web-based development, this is rarely the case. The reason for this is that web-based applications tend to grow very rapidly. This is easy, because there are billions of people with access to the Internet and virtually anyone with access to the Internet usually gains such access from a web-enabled device. Having access to the Internet has become synonymous with having access to the world-wide web. Since the number of potential users is so huge the potential for data is equally huge. Not only that, but beyond the sheer amount of data that maybe collected from users of the application software and stored for use by the system there is the factor of maintainability. Databases make organizing and maintaining long-term data easier. This comes in several forms. Without a database solution you have to worry about concurrency issues for replication. You would also have to consider race conditions, access time, permissions, and scalability among others.

Databases Are Overkill

For those who start off building small web-based applications or even trying to put together a tiny CMS (Content Management Systems) they sometimes fall victim to the illusion that having a very small amount data would mean that building a database for this data would be overkill. This is simply not true anymore. Today databases are easier than ever to build, grow, and manage. With lite-weight solutions like SQLite you actually improve on performance with small amounts of data and make it easier to manage. SQLite is actually a small foot-print library written in C that implements an embedded DBMS. It’s only a few hundred KB in size and implements most of the SQL standard. You can use it to store databases in memory or on disk and still get the full benefits that relational databases offer with a minimalistic foot-print and without compromising on performance for small data sets. It’s adopted by PHP, Python, Perl, Ruby and even Javascript as well as many other languages. So there really is no excuse to avoid using a database when the solution is widely available in so many popular platforms and especially in web development.

Databases Are Slow

This could not be farther from the truth. A relational database can maintain indexing for records across different tables. This means rather than looking through the entirety of the data set and then trying to expose some underlying structure in order to find a particular set of data the relational database takes advantage of composing structures as you build your data sets. These structures make things like fetching a record with a primary key much much faster than you would get by using a flat-file solution.

Lets examine the alternatives. Even if you had a very small amount of data – say just a few hundred lines of text. Even if the data structure was overly simplistic – we’ll assume each line represents what would be a single row in a database table. Even if the data will only ever be maintained by a single developer – you. You are still overlooking so many problems that are not easily solved by using a flat-file to maintain this data. First lets consider the race-condition. You have a script that opens a specific file on the server and appends a new line each time a record is added. The script can also open the file for reading and retrieve the entire contents of the file into memory. The script can then do any necessary sorting and filtering to return the required data sets to the user. The most apparent problem with this approach is the race condition. It is entirely plausible that two requests could be made simultaneously to the same script – one to open the file for writing and append a record and one to open the file for reading and retrieve the data. If the data is stored into memory before the line is appended the result is stale and potentially corrupt. If the new data is appended before the read, no problem. However, what happens when you want to delete a record. Now the problem is three-fold. If three individual requests all come in at the same time – one to read, one to write, and one to delete a record – it is now likely the case that your entire data structure has been corrupted. Remember that HTTP is built on a request-response model and no two requests are treated as if they are tied to any previous requests. So there’s no central point of control over your script’s ability to manage which process can access the data and to what extent.

In a DBMS, on the other hand, the control is transferred away from the script and to the central management system of the database. The DBMS then gets to decide how requests will be served and the order of treating the data. This creates more dependable data that has a far lesser chance of corruptibility. Now, it’s entirely possible that you may not be concerned the integrity of your data for a small application, but then you might as well not waste your time building it.

I’ll Use A File Now And Learn To Use A Database Later

If you’ve said this phrase it’s already too late. It doesn’t take a lot of time to get started with a database in the first place. If you’re using languages like PHP, Python, Perl, or Ruby you probably already have the necessary libraries installed on your system to work with a database. These libraries and drives are usually packaged with these software stacks as standard. It’s actually uncommon to not have some DBMS solution already available in most of these environments. So why would go out of your way to reinvent the wheel when the solution is already at your fingertips? Not only that, but it takes very little time to set these DBMS solutions up and get them to run smoothly on virtually any platform. You will probably spend more time trying to write a script that stores, retrieves, sorts, filters, locks, and validates data using a flat file then you would installing the DBMS and getting a simple schema started.

If you’re using PHP interfacing with a database has become easier than ever. It only requires a couple of lines of code to open a database connection to virtually any database you have a PDO driver installed and loaded for in PHP. So whether you’re using SQLite, MySQL, PgSQL, etc… you shouldn’t need to spend a lot of time learning how to interface with each of these databases if you simply stick with the PDO extension. You use the same functions regardless of the database. This is opposed to having to learn the individual database-specific extensions in PHP to interface with each of those databases. Not to mention PDO supports many of the popular new database features such as prepared statements and is a lot easier to learn and use than extensions like MySQLi.

PHP and Databases

Being a PHP developer, I also take notice of many PHP developers that tend to have the misconception that when they start using a database (usually it’s the case that their first database is MySQL) they should start by learning the old mysql extension in PHP. This is simply not true. Some of the underlying reasons this is widespread, however, is mainly due to the fact that the old mysql extension has been around for quite a long time in PHP and it’s fairly common to see a lot of PHP code demonstrating the use of a database in PHP using this extension. It’s also become familiar to a lot of old PHP developers and is bound to be present in their older applications. However, the use of the old mysql extension is highly discouraged for new development. It’s an old extension that’s no longer well-maintained and has been planned for deprecation for years. There’s no guarantee that if a new bug creeps up that someone will go back and fix it. This leaves your application vulnerable and exposed. If the code base gets large enough this might leave developers scrambling for migration path. Additionally, the extension does not support prepared statements or parameterized queries. This makes things like making sure you properly escape user data to avoid SQL injection, prone to error. The extension lacks in many areas that are not conducive to future development. Learning the old mysql extension first before you learn the new improved mysql extension in PHP or before you learn PDO will gain you nothing. In fact, it will cause you to have to unlearn some of the very poor design of the old extension and its implementation details in order to become more accustomed to the newer extensions.

Some developers also complain that PDO seems too complicated or more difficult to use than the old mysql extension. This might come from the lack of understanding as to what PDO even is or how its used. Since PDO can only be used with the newer OOP features of PHP (you have to use objects and methods instead of procedural-style functions) it makes it seem unapproachable or even scary to developers who aren’t used to OOP in PHP. There is also the idea that PDO has a lot more features due to its vendor-agnosticism and the fact that requires further configurations such as installing and loading the individual drivers needed for interfacing with your specific database (where the drivers aren’t already packaged or loaded). I can understand the intimidation, but most of this has been alleviated with new versions of PHP coming pre-packaged and loaded with most of the popular drivers and the documentation offering up examples that are now easier to follow and get started with. Most of the intimidation is actually coming from having to unlearn old habits that older extensions like the old mysql extension once taught.

Once you get past the initial intimidation phase and actually get started with a PDO and with a database you’ll find that it doesn’t take nearly as much time as you’d think to get up and running. Most of the reservations people have are 90% of what’s holding them back. Not that the investment to get started is actually that significant. Beyond that you’ll find that learning to build on data normalization not only make development easier, but makes your users happier. When you can organize and maintain data that’s more clearly structured and accessible you can serve your users more effectively and efficiently. That will keep users coming back and eventually help you grow your application!

Load Balancing Software as a Service

I’m sure many of you have seen this statue before, perhaps not the very same one in the picture, but possibly similar statues around the world. This one is located in New York City.

Statue of Atlas in NYC

This particular statue is the Titan Atlas (a God from ancient Greek Mythology) who was supposedly burdened with carrying the weight of the world – or the weight of the heavens – on his shoulders as a punishment from Zeus. Whether it was the weight of the world or something else is unclear, but most people seem to follow this same observation. In general it’s nothing more than a myth, but the lesson history teaches us is that it constantly likes to repeat itself. Clearly, no one can bear the entire weight of the world on their shoulders just like no one computer can either. If you are running SaaS (or Software as a Service) you are online 24/7 and so is your service. The problem is there are over two-billion users online (or with Internet access) today. What happens when too many of those users all start using your service at once?


What Is Load Balancing


The idea behind load balancing is that a single machine can only handle so much work at one time and you can only go vertical for so high. Notice that even in large cities you can only build so high before you have to start building out. Since on the Internet virtually anyone can be using your server at any time you run the risk of overloading without warning. If too many users all send requests to your server too quickly, the server will reach a point where the load is higher than its capacity and eventually crash. This particular vulnerability of typical client-server relationships on a network is exploited by what is commonly referred to as a DDoS attack or a Distributed Denial of Service attack. Basically, a number of clients (sometimes a bot-net controlled by one or more users) will attempt to send a lot of requests to a server or number of servers very fast in order to overload the server and prevent its intended users from being able to access the service. Sometimes this is done just to destabilize the service running on the server or for other malicious intents. There are ways to mitigate DoS attacks with firewall software/hardware or through other means depending on the service, but not all DoS attacks are malicious or even intentional in nature. Google, for example, experienced what was at first glance considered a DoS attack on its search service during one afternoon on June 25th of 2009. This actually wasn’t a malicious user or users at all. It was the world receiving the tragic breaking news of the death of Michael Jackson. Literally, millions and millions of users from all around the world flooded Google Search all at once with the same search phrase “Michael Jackson”. Google had never seen such a tremendous amount of traffic coming in all-at-once on a single search query, before, so their first thought was “ohnoes, we’re getting DdoSed!

Scaling Out - SaaS

Scaling Out - SaaS


Why Do I Need It


The fact remains that any number of users can suddenly surge the number of requests coming in to your servers at any given time and whether that is malicious or not is unimportant. What is important is that you are better prepared to handle such situations so that your service will suffer as little downtime and degradation as possible. So load balancing allows you to distribute the load on a particular service or services over a larger array of resources. It’s basically making your service, as a whole, more tolerant of failure by being able to efficiently make use of all available resources.

If you are running any kind of high availability service over the Internet you need load balancing. Though, even small applications with just a few thousand users can benefit deeply from load balancing, as well. The only potential down-side is that you may need more than just one node to it. This isn’t always necessary as load-balancing can come in many shapes and sizes. For example, you might be doing load balancing on the same host node using multiple guest nodes on the same machine. All of the major services you probably use on a regular basis like your email, search engines, or popular social networking apps all make use of load balancing because it keeps things running a lot more smoothly as the number of users grow. If you’re not on-board with this yet – you probably should get on board quick.


How Do I Use It


There are few broad categories you can place load balancing techniques in. The easiest form of load balancing relies on existing system already built on top of how most systems function over the Internet (or large networks in general) and that’s DNS. DNS is a distributed system so it relies on multiple components in the network to do their job in order to make things more efficient. It reduces bottle-necks like those created by routing enormous amounts of packets across the planet in fractions of a second. Like most complex systems everything starts off small and simple and grows both horizontally and vertically, but at the core the protocols are fundamentally very simple.

DNS Load Balancing is simply relying on the DNS system to take care of the most basic problems for you. The way this works is you set the DNS record for a particular domain name to multiple IP addresses (usually one for each server) using low TTL (or Time to Live). Since DNS is cached at various levels this makes things like geographical loads efficient for services like name servers. A name server tells the DNS where to send the request for a particular domain name and can route packets to different locations depending on the geographical origin of the request thus alleviating network latency and allowing packets to travel shorter distances. Once the request comes in and is routed effectively the DNS is cached at multiple levels so that future requests are made to the same place. This can be cached at the local level, the ISP level and other levels in the parent zone. The name server then doesn’t become a bottle-neck since not every single request has to rely on that name server entirely. There is a TTL involved that will let the caching servers know when the cache has become stale and that it’s time to refresh. Also when requests to a particular server are no longer getting through the DNS server will know to try a different IP. So if you have different servers with different IPs in the DNS record that ultimately means if one server becomes unresponsive (potentially having gone down) the load is directed to a different server. The inherent problems with this approach are that it isn’t making very efficient use of all of your resources. It doesn’t take into account which servers are currently busy and if the DNS record has already been cached to a server that is now down you end up potentially being stuck with a poorly responsive server until the cache is refreshed. Additionally, you are exposing your infrastructure to the outside world by revealing the public IPs of your servers with no way to control the flow of traffic to an internal network. It’s very easy to have an unstable system this way. Most services that use this approach are usually just creating what is known as mirrors (servers that back each other up so that in case one goes down a backup can still be reached).

Software Load Balancing is another approach to solve some of the short-comings of the DNS offloading techniques described earlier. Software load balancers attempt to keep track of the available resources and when an incoming request is received it determines how to best allocate those resources in-order-to service that request. The benefits of this technique are that you don’t have to reveal your network setup to the outside world. Everything can be done on the internal networking configuration setup (whether that’s a local area network or otherwise), or in other words, you won’t expose your communication channels directly. Also, you have a tighter hand on security and distribution since you can more easily control the flow of traffic over the network. Some examples of common open-source load balancing software are Pound, Varnish, mod_proxy for Apache’s httpd, and Gearman. There are all sorts of nifty ways to balance the load across your network. You can have the load balancers poll the servers and check on resources like CPU usage, available memory, storage space, network traffic or open TCP connection, etc… The load balancer can then use this information to figure out how to best direct the incoming requests and serve up the responses as quickly and as efficiently as possible. There are still a few problems inherent to this technique depending on how you use it. If you’re only relying on a single machine you have a single point of failure. If the host node goes down the load balancer and all of your resources go with it. If you’ve only got one load balancer and multiple servers you still have a single point of failure. Additionally the load balancer itself can be DoSed given an attack of enough magnitude and proficiency. Not only that, but you have to worry about things like session storage consistency across multiple servers, file-system access, database synchronization between different database servers, and some network bottle-necks that might not always be easy to resolve with load balancing – to name a few.

Hardware Load Balancing there are some hardware load balancers as well. You can actually buy very expensive firewall/routers that take care of many of these things for you. Most people usually just setup a dedicated node or two with software load balancers that pretty much do the same thing. These hardware load balancers might do a better job of handling security and high bandwidth loads like Cisco’s ASA, but they do come with a heavy price tag.


Some Load Balancing Tips


There are some pretty common approaches to some of the problem inherent to distributing a service over multiple servers. For example, take your session storage as the most obvious problem. If you’re using PHP you are probably using the built in session handler, which makes use of file-based sessions. If you have users being directed to different servers by the load balancer you end up with the user having multiple sessions across those servers (that might be a little problematic for your application and annoying to the user). Some people will try to avoid this by creating what’s called a sticky session. Once the session is generated for that user they’re sent a cookie that lets the load balancer know upon subsequent requests to direct the user to this particular server. There are a few minor problems with that, but nothing you couldn’t work out through a well-planned architectural approach. Another way to approach this is by creating a centralized session storage server where all the requests will look for the session. Depending on your infrastructure this may or may not be a good idea and keep in mind it also creates a single point of failure. For example, if your servers are built on stacks (you have several software-based servers running on the same node like a webserver, database server, application server, etc…) it takes some tinkering to configure each stack to work from a centralized session storage. You can use something like Redis where you can have master/slave replication across all stacks. This takes a little less configuration and puts the dynamic into the software stack layer – thereby removing it from the load-balancing layer.

The other obvious problem is file system storage. If you allow your users to upload files to your server, or you store large amounts of files that your application relies on heavily, there needs to be some system whereby your application layer can access those files considering the load balancing may send requests to different servers. Again there is a centralized approach like with session storage, but even with a replication approach – to avoid the single-point of failure down side – you might create the problem of over redundancy. If your servers are set up in stacks having four or five copies of each file (or more depending on how many servers you have) on each server stack is a bit of a waste, especially if you’re already using RAID arrays for redundancy. Even if you have a centralized set of servers for storage you still face the problem of network overload. For example, consider that if your backbone bandwidth capacity is at 100Mbps but your central network bandwidth capcity is at 10x100Mbps you eventually create a bottleneck with increased usage as your backbone can only serve up to 100 megabits per second of traffic at any given time.

Using a CDN (or Content Delivery Network) is one solution often used when large amounts of files need be shared across a network, but this can also be a bit costly depending on your needs. In its simplest form a CDN is really just a group of servers that store files or data objects for you and replicate them across multiple nodes allowing many other servers on the network to access that data with improvements in caching and high bandwidth to reduce latency. The servers in the CDN clusters are usually strategically located on the edges of the core network to minimize the bottlenecks involved in the centralized network loop. So you are redirecting the traffic to access file storage away from the central network and off to the edge servers expanding on bandwidth and minimizing on bottle neck traffic. This solves both the single-point of failure problem as well as taking the complexity mechanism away from the server stack which can ultimately help reduce loads and create more efficient load balancing. Most services that utilize CDNs are usually ones that need to offer high-bandwidth access to a large user base with consistency. For example, a service that offer Hi-Definition video streaming, large photo sharing web sites, or other media services with high availability needs. You don’t always have to build this infrastructure yourself. You can rely on services like amazon Cloud Front which is a pay-as-you-go CDN service offered by amazon. There are many other competitors, of course, that can offer cheap CDN solutions. Depending on the sensitivity of your data this may or may not be an option for your particular SaaS needs. Still something to consider.

Besides just file storage you probably have a lot of database concerns in a system that scales horizontally, as well. If you’re just using a single LAMP stack with little more than PHP, MySQL and Apache running your back-end it might seem easy to scale wide at first. The problem you’re likely to run into head-on is the data-replication across your MySQL servers. The database is almost always the biggest bottleneck in SaaS. It usually contains tons of data that virtually every one of your users will access with each hit. There’s only so much traffic a single database server can handle, but setting up two or more database servers can show some significant improvement. Your load balancer can also play a role in this. There can be data object caching mechanisms in place to ease off some of the load for the most frequented queries. There can also be network latency issues to deal with once you have several database servers all replicating (especially if these servers are geographically spaced out across different data centers, cities, countries or even on different continents). Chunking is definitely not something I’d advice. It throws way too many variables into the equation and presents more problems than solutions – for the most applications.