Sherif's Tech Blog

Just another guy on the Internet with a keyboard…

Using PDO with MySQL in PHP

The PHP community has been making an effort to steer people away from the old mysql_* functions (an extension that is no longer well maintained by core PHP developers) and encouraging the use of newer and more feature packed interfaces like MySQLi and PDO. MySQLi offers both Object Oriented and Procedural style coding; PDO does not and as far as I know there are no plans for that to change. So if you aren’t familiar with the OOP paradigm this is a good time to start in case you plan on using PDO in future development. This helps encourage more modular development as well. This is important where sometimes your implementation details can be hidden. The old mysql extension has been discouraged for developing new applications for a while now and since MySQL is a very popular Database Management System a lot of developers used to the old interface have been giving some push-back about switching. Personally, I must admit that I have been using mysql_* functions for years with MySQL databases and so I can understand some of the hesitation in switching, but I’d like to address some of the benefits and trade-offs involved. At least the ones I’ve found to stand out most during my encounters with other developers this year.

PDO Is Database Agnostic

This is probably the driving point of most developers – at least the one I’ve seen come up frequently since last year. The PDO extension in PHP is nothing more than a means to interface with various database-specific PDO drivers. In PHP the PDO extension supports a vast array of databases like Cubrid, FreeTDS / Microsoft SQL Server / Sybase, Firebird/Interbase 6, IBM DB2, IBM Informix Dynamic Server, MySQL 3.x/4.x/5.x, Oracle Call Interface, ODBC v3 (IBM DB2, unixODBC and win32 ODBC), PostgreSQL, SQLite 3 and SQLite 2, Microsoft SQL Server / SQL Azure, and 4D. This means you can use the same PDO functions to issue queries and fetch data from any of these databases. Some people were a little confused about this at first and seemed to think this implied you would not have to rewrite your SQL. This is certainly not the case. The SQL is still dependent on your database. However, you will not have to rewrite your PHP code should you chose to switch databases or use different databases in the same project or just reuse a PHP abstract database class in another project with a different database. The PHP code works the same way whether I’m using MySQL, PgSQL, or any other database. Obviously my database’s SQL syntax may vary here and there, but I’m not required to do large refactoring of my PHP code.

Now, in all fairness, the likely-hood of having to use different databases in the same project or switching databases in any project are fairly slim. So some developers don’t see this as such a great benefit. However, should that be the case you suddenly find yourself in a world of hurt if your application relies very heavily on the database and you have thousands of lines of code to rewrite. Which is why it makes sense to just learn an extension like PDO and use it everywhere so that you never have to relearn another extension should you start a project that uses a different database than what your existing extension was built for.

What PDO Is Not

Some developers also seem to think PDO is a full-blown database abstraction layer or even an ORM. Let me assure you that it’s not. You can’t actually perform any database functions using the PDO extension by itself. You have to install the database-specific PDO driver to access the database you want. These drivers basically implement the PDO interface and thus you get to use those database specific features as regular functions of the PDO extension. PDO is not a magical solution or replacement for you DBMS. It does not provide a tool for userland it simply provides an access layer in userland. So if you had any of these misconceptions before, now is a good time to get rid of them.

Developers Migrating to PDO for the MySQL Database

If you’re constantly developing with MySQL databases like me you probably either still use mysql_* functions in PHP or have tried or even switched to MySQLi. If you’re one of the few that have taken the leap to PDO great! If you still have a few reservations consider that whoever told you PDO is too complicated or harder to learn or use is lying to you. They may not even know they’re lying, but you ARE being lied to. First of all PDO is no more difficult to use or learn than MySQLi or mysql_* or any of the other database extensions in PHP. The fact remains that mysql_* functions have been in long-standing plans for deprecation and they will deprecate it. Eventually when mysql_* functions are gone you will be in a much more difficult position were you to continue developing new applications with it than if you were faced with legacy applications that still had a great deal of code relying on the old extension.

The truth is the old extension for MySQL is no longer well maintained. You’d be lucky if any of the core developers went back and made any major improvements or fixed any real bugs (should they happen to arise). For the most part the new MySQL Improved extension MySQLi is still being worked on and can expect further development, but with the old extension it’s just not likely to happen. There isn’t anything fundamentally wrong with the old mysql extension. The purpose of this blog post isn’t to knock on mysql_* functions, but rather to encourage developers to explore the pros and trade-offs that an extension like PDO has to offer. This mainly because I feel a lot of the developers I’ve come into contact with that have reservations about switching to PDO seem to have been completely mislead or just never bothered to learn about what PDO really does.

If you’d like a good tutorial to follow on migrating I recommend taking a look at the PDO Tutorial for MySQL Developers from the ops at hash php (##PHP on freenode on IRC). If you aren’t sure about just how easy it is to use PDO with MySQL I drafted up a small working example of using PDO with MySQL here as well. It’s using the MySQL world sample database and both the code and a working example are provided.

One of the first gatchyas you want to look out for is specifying the character encoding for your connection. In my example above I’m using MYSQL_ATTR_INIT_COMMAND to SET NAMES and SET CHARACTER SET telling MySQL to stick to UTF-8. This is so I don’t break the encoding during transport. Of course you still have to remember to specify the encoding to the client upon output, but this just demonstrates how you can go about making sure the connection to MySQL server uses the proper encoding. The escaping rules should be updated accordingly. Remember you had to do this with mysql_set_charset() in the old MySQL extension.

The first thing you notice when you start using an extension like PDO or MySQLi is that when you use prepared statements where you are binding parameters you don’t have to worry about escaping rules that are otherwise more pertinent in the old string concatenation style used to build SQL queries in mysql_* functions. PDO is actually separating the SQL from the parameters bound to that SQL. This not only prevents SQL injection (which mysql_real_escape_string() could do just as effectively), but it makes it easier for the developer to work the user data into their SQL. This an underlying benefit I’ve come to admire very much with PDO.

Dealing With Various Time Variables in PHP

The Problem

When you’re dealing with things like date/time stamps there can be numerous challenges in making such information useful. The most obvious would be translating between different date formats (such as 31-12-1999 and 12/31/1999 or 1999-12-31 or even Fri Dec 31, 1999). The easiest way to deal with this in PHP is just to rely on native date functions such as date(), strtotime() or strftime() and even PHP’s 5.3+ DateTime class. Then comes the problem of being able to store these dates in your database and being able to sort/search through the data based on these date/time stamps. Of course MySQL offers a native date type just as most RDBMS‘s. The much less obvious of these challenges is being able to deal with various timezones or daylight savings time offset changes. Since you’re dealing with the web you can have users from potentially any part of the globe and obviously 2 AM on Friday in New York isn’t 2 AM on Friday in Los Angeles, London or Tokyo, for example.

The Solution

Well, if you don’t want to have to deal with customizing date format or timezones for date/time stamps to each user based on custom settings (e.g. each user specifying their timezone or location) you can always use ordinal time formats, which makes it a lot more universal and a lot easier to handle. While 2 AM on Friday in New York is actually 11 PM on Thursday in Los Angeles, it can also vary from New York to Phoenix, because Phoenix does not observe DST (Daylight Savings Time), while Denver (which is in the same timezone as Phoenix) does. So the easiest solution, would be to just simplify things more rather than complicate them further. No matter what time it is anywhere in the world, something that gets a timestamp on your server (according to whatever timezone you’re in) right now will have also happened right now in any other part of the world. So an email that was received 2 hours ago can simply be displayed to any user as “Received 2 hours ago“  and everyone can easily interpret that.

Of course the DateTime::diff in PHP makes these increasingly easy to do this regardless of timezones and formats. So I’ve written a short class to make use of such functionality that will make it very easy to build output for such diffs in various formats. It doesn’t currently extend the DateTime class, but it works with the extension quite well. Simply by feeding a DateTime object to the constructor of my class it can render the result for you with very little effort on your part. It accounts for proper language such as plural/singular, etc… hasn’t been thoroughly tested or documented so use the code at your own risk.

This class also makes it relatively easy to add up different intervals of time as string. For example, TimeInterval::__construct(’1 day, 6 weeks, 941 seconds’); does all the overloading for you and you can also repeat the same periods in the same constructor call. Because with TimeInterval you can specify exactly how you’d like the output back you can also overload virtually any period – unlike PHP’s DateInterval (though this class does not make use of ‘months’ unless a DateTime object is specified).

/**
* @class TimeInterval
* @package TimeInterval
* @copyright Sherif Ramadan
* @method changeTime() Change the time input to the object (accepts string and object).
* @method changeResponse() Change the response format of the object (output which intervals).
* @method changeType() Change the return type of the output ('text' [default], 'clock', 'hybrid')
*/
class TimeInterval {
        const intv_symbols = 'second,minute,hour,day,week,year';
        const intv_names = 'second,minute,hour,day,week,year';
        const intv_def = '1,60,3600,86400,604800,31536000';
        protected static $intv_name, $intv;
        protected static $sign = array('-','+');
        protected static $suffix = array('ago','since');
        public static $intv_sep = ', ';
        public static $intv_pad = 1;

        public function __construct($x = false, $y = null, $z = 'text') {
                $this->responseSeconds = false;
                $this->requestArray = array();
                $this->IntervalArray = array();
                $this->_defineIntervals();
                $this->_setIntvs((array) $y);
                $this->regex_Cons = array();
                $this->invert = true;
                $this->use_invert = true;
                $this->type = $z;
                foreach (static::$intv as $key => $val) $this->regex_Cons[] = $key;
                $this->regex_Cons = implode('|',array_unique($this->regex_Cons));
                if (is_array($x) || is_object($x)) $this->_translateDateTime($x); else $this->_translateTime($x);
        }

        private function _translateStringtoTime($x) {
		if (!preg_match_all('#((?P<intv>\d+)\s*(?P<prd>' . $this->regex_Cons . '))#i',$x,$match)) return false;
		foreach ($match['prd'] as $key => $period) {
			if (isset($arr[$period])) $arr[$period] += $match['intv'][$key];
			else $arr[$period] = $match['intv'][$key];
		}
                $this->requestArray = $arr;
                $this->responseSeconds = $this->_translateTimetoSeconds($this->requestArray);
	}

	private function _translateTimetoSeconds($x) {
		if (!is_array($x)) return false;
		$n = 0; $t = 0;
		foreach ($x as $key => $val) {
		        $key = strtolower($key);
	                $n += (static::$intv[$key] * $val);
		}
		return $n;
	}

	private function _translateSecondstoTime($x) {
	        $arr = array();
		foreach ($this->intvs as $key => $val) {
		        $arr[$key] = sprintf('%00' . self::$intv_pad . 's',floor($x / $val));
       			$x -= $arr[$key] * $val;
			if (!$arr[$key]) unset($arr[$key]);
			else {
			        $this->IntervalArray[$key] = $arr[$key];
			        $arr[$key] .= (isset(static::$intv_name[$key])) ? (' ' . static::$intv_name[$key] . (($arr[$key] != 1) ? 's' : '')) : '';
			}
		}
		$out = implode(self::$intv_sep,$arr);
		return $out;
	}

        private function _defineIntervals() {
                $symb = explode(',',self::intv_symbols);
                $name = explode(',',self::intv_names);
                $def = explode(',',self::intv_def);
                static::$intv_name = array_combine($symb,$name);
                static::$intv = array_combine($symb,$def);
        }

        private function _setIntvs($y) {
		if (empty($y)) $this->intvs = array_reverse(static::$intv, true);
                elseif (is_array($y)) {
                        $m = array();
                        $this->intvs = array_reverse(static::$intv, true);
                        foreach ($y as $n) if (isset($this->intvs[$n])) $m[] = $n;
                        foreach ($this->intvs as $key => $val) if (!in_array($key,$m)) unset($this->intvs[$key]);
                }
        }

        private function _translateTime($x) {
                 $this->use_invert = false;
                $this->_translateStringtoTime($x);
                $this->updateResponse();
        }

        private function _translateDateTime($x) {
                $str = NULL;
                foreach ($x as $y => $z) {
                        if ($y == 's') $y = 'second';
                        elseif ($y == 'i') $y = 'minute';
                        elseif ($y == 'h') $y = 'hour';
                        elseif ($y == 'days') $y = 'day';
                        $str .= $z . $y;
                }
                if (!empty($x->invert)) $this->invert = false;
                $this->_translateStringtoTime($str);
                $this->updateResponse($x);
        }

        private function updateResponse() {
                switch ($this->type) {
                        case 'clock':
                                static::$intv_sep = ':';
                                static::$intv_pad = 2;
                                static::$intv_name = array();
                                break;
                        case 'hybrid':
                                static::$intv_sep = ', ';
                                static::$intv_pad = 1;
                                $this->_defineIntervals();
                                if (in_array('second',array_flip($this->intvs))) $this->_setIntvs(array('day','hour','minute','second')); else $this->_setIntvs(array('day','hour','minute',));
                                $this->_translateSecondstoTime($this->responseSeconds);
                                if (!empty($this->IntervalArray['day'])) $append = $this->IntervalArray['day'] . ' day' . (($this->IntervalArray['day'] != 1) ? 's' : NULL) . ', ';
                                $h = (!empty($this->IntervalArray['hour'])) ? $this->IntervalArray['hour'] : 0;
                                $m = (!empty($this->IntervalArray['minute'])) ? $this->IntervalArray['minute'] : 0;
                                $s = (isset($this->IntervalArray['second'])) ? (':' . sprintf('%002d',$this->IntervalArray['second'])) : NULL;
                                $clock = sprintf('%002d',$h) . ':' . sprintf('%002d',$m) . $s;
                                if (!empty($append)) $append .= $clock; else $append = $clock;
                                break;
                        static::$intv_sep = ', ';
                        static::$intv_pad = 1;
                        $this->_defineIntervals();
                }
                $response = $this->_translateSecondstoTime($this->responseSeconds);
                if ($this->use_invert) {
                        $bl = (int) $this->invert;
                        if ($this->type == 'text') $response .= ' ' . static::$suffix[$bl];
                        else $response = static::$sign[$bl] . $response;
                }
                if (!empty($append)) $this->time = $append; else $this->time = $response;
        }

        public function changeTime($x) {
                if (is_array($x) || is_object($x)) $this->_translateDateTime($x); else $this->_translateTime($x);
        }

        public function changeResponse($y = false) {
                $this->_setIntvs($y);
                $this->_translateTime($this->time);
        }

        public function changeType($x = 'text') {
                $this->type = $x;
                $this->updateResponse();
        }
}

Example 1

$start = new DateTime('now');
$end = new DateTime('2/19/2011 4:43 PM');
$diff = $start->diff($end);

$time = new TimeInterval($diff);
echo "{$time->time}\r\n";

/*
######## OUTPUT ########

5 weeks, 2 days, 9 hours, 49 minutes, 4 seconds ago

*/

Example 2

$start = new DateTime('yesterday');
$end = new DateTime('9/7/2009 9:22 AM');
$diff = $start->diff($end);

$time->changeTime($diff);
$time->changeResponse(array('year','day','hour')); // You can change the response by using the changeResponse method
echo "{$time->time}\r\n";
/*
######## OUTPUT ########

1 year, 201 days, 14 hours

*/

Example 3

$newtime = new TimeInterval('312984second'); // We can overload the seconds as much as we want and still get the output in whatever format we'd like.
echo "{$newtime->time}\r\n";
/*
######## OUTPUT ########

3 days, 14 hours, 56 minutes, 24 seconds

*/

Example 4

$uptime = sprintf('%d',array_shift(explode(' ',shell_exec('cat /proc/uptime')))) . 'seconds';
$time = new TimeInterval($uptime,array('day','hour','minute','second'),'hybrid');
echo $time->time;

/*
######## OUTPUT ########

22 days, 14:05:12

An example of how to get your system uptime with TimeInterval Class...

*/

The out put here would be something similar to this format, which is the same format top would use, without the seconds. You can actually chose to specify the seconds or not, in the constructor array or changeResponse, and get just the days, hours and minutes. The hybrid format will override any other formatting to overload at days and only use hours & minutes (with the option to specify seconds). Otherwise you can use the ‘clock’ or ‘text’ (default) types for formating with whatever response format you’d like back.

Developing Software as a Service: SaaS

Today we expect to use our computers anywhere we go and relatively we expect to use our software anywhere there is a computer. This introduces a few key problems for the developer which we aim to solve with SaaS or Software as a Service! This model indicates that the software can be introduced to virtually any environment (Operating System, Application Programs, Devices, etc…) from anywhere in the world and can still run effectively and efficiently. Much different from the SWS or Shrink Wrap Software we’ve known just a decade ago, this makes things a lot more convenient for the end-user, but a lot more daunting and challenging for the developer.

We begin by understanding what the end-user needs and wants. We’ll take e-mail as a common example. Almost everyone today has an email, sends emails, and receives emails on a daily basis. The problem that email introduces is that it requires a certain piece of software called the Email Client in order for us to be able to send and receive emails from our computer. Downloading and installing an email client is simple enough and we’d all agree there’s no shortage of them these days. All the major operating systems come bundled with some default client. The problem takes place when you happen to be at a computer that doesn’t have an email client or has one that isn’t configured to access your particular email account. Rather than having to configure every email client you come in contact with when trying to access your email from a different computer we turn to a simple solutions like Web-Mail.

It’s not entirely uncommon that you may have to access your email from a number of different computers or devices in a given day. I may read my email from my desktop at home when I wake up, later on my desktop from work, later on perhaps I might check my email from my laptop during lunch at the local hot spot from the restaurant across the street from my office. Perhaps, later I may chose to check my email from my mobile phone on my way home. I might need to check my email from a friends house on their computer, my public library, a public computer at a hotel lobby where I’m staying for a short duration or even from a neighbors house. These are all fairly common scenarios and pretty realistic. So to spare the user the trouble of having to carry around physical media containing a copy of their email client software or having them download it to every computer that doesn’t provide an email client we turn to SaaS solutions such as Gmail, Y!Mail or Hotmail, right? Additionally, because the software is readily accessible from any computer with an Internet connection the user can retain all their configurations, personal settings, and other various bells and whistles no matter which computer they use to access the service.

So how does a developer build their software as a service? The concept started when the web began to take its web 2.0 form and beyond. The problem with HTTP (the protocol commonly used for the web) is that it’s stateless. One HTTP connection does not necessarily effect another and thus web pages don’t really care to retain a state beyond each visit. This may be fine for the person surfing Wikipedia to read an article on Economics, but it presents a problem for someone interested in writing or editing that article on Wikipedia or someone who is shopping on eBay or amazon. Things like URL queries were originally introduced early on during the development of the web to resolve the stateless problem of HTTP over the web. This became ugly and insecure and so cookies were later introduced to retain a client-server mechanism for retaining state or creating sessions.

When you visit an online store like amazon it knows who you are because your browser stores a cookie given to you by amazon that helps identify you upon each visit. Every time you visit a page on amazon.com your browser sends amazon’s web server something called an HTTP Header Request. The web server then processes this header, which includes your cookie and some other useful information about your visit, and returns something called an HTTP Header Response. The response will normally include the same cookie or a slightly modified cookie that you originally sent in your request along with some other useful information about the page that amazon is serving up to your browser. The header is then followed by the web page which may be made up of several components such as scripts, cascading style sheets, XHTML, etc… All of this takes place behind the scenes but the result is that you now see a page customized just for you. You can see the items in your shopping cart or on your wish list because you retain a cookie that helps identify you to amazon’s servers.

The real magic that takes place in SaaS is all in the environment where your servers hold the key components to your software. Naturally there is a web server that constantly listens for traffic on a specific port (usually port 80 for HTTP traffic) and responds to those requests in various ways. The most basic outline of a SaaS setup is depicted in the diagram below.

Three Tier Architecture

Three Tier Architecture

We have Peter sitting at home at his computer and logging in to amazon or his favorite e-store of choice. His request is routed across the network and arrives at our web server that’s readily listening for traffic and accepting incoming requests for further processing and routing. The web server receives Peter’s request header which is requesting a URI, say the home page of amazon for example, and expecting a response back. The header tells our web server what client software the request came from (this could be used later to produce output in a specific format based on different client software necessities), what the time stamp is, what the request method is (whether Peter just wants to GET the contents of a page, POST information to a page, etc…), what cookies are being presented (if any), and some other basic information to help us process the request.

The web server then sends the request to your application layer. This layer could be made up of a number of things like parsers, frameworks, libraries, etc… The application or Web App will do some processing, which is where all the magic takes place. At this point we are basically constructing web pages on the fly. For example, the web app might take the contents presented by Peters cookie in the request header and attempt to identify or validate this cookie. The cookie normally contains a session identifier (in the event the user has authenticated themselves to the web app in some prior request), which the web app can use to identify which session file on its server contains information specific to this user. Once the session file is identified it can obtain data about the user such as their name or username and use this data in producing the output for the page. Because sessions are normally short-lived, however, the data in them is rather meta in nature. It probably only holds some pointer or reference to a more immutable object. This is where our database server comes in or what developers usually refer to as the Persistence Tier or Storage Tier.

It’s at the persistence tier that our long-term data is retained for the web app to make full use of things like our shopping-cart, our profile information, etc… The web app now has enough information to go on that it can look up Peter’s profile in the database and determine things like new items in stock that he may be interested in buying based on his past shopping habits, billing information that may need updated like expired credit cards or profile information like his name. The application layer (in blue – also referred to as an App Server) communicates to with our database server (or persistence layer) to retrieve this information and assembles the data in a format that is readable to the client. This means putting together things like HTML and CSS that will produce a neatly formated page for the client to view in their browser. It sends this back to the webserver (normally the application layer is just an abstraction layer for the developer to work with that is integrated as a web server module) which may need to make additional requests for things like images, vidoes, or other files stored on the file server and further assembles the request to send back to the client over the network. When this two-way pipe is cycled completely Peter is able to see amazons homepage with a custom greeting that includes his name and shopping cart contents (assuming he had anything in his shopping cart from his last visit) and any additional messages amazon may want peter to see (like some on sale items he may be interested in).

So, because Peter isn’t going to have much of a problem using this online store from his desktop PC or his laptop or even his iPhone, the SaaS model gives him the flexibility he expects as a user to utilize the service from anywhere and at any time without a need for reconfiguration or extensive local software dependency requirements.

As a developer you may be using a web server, file server, and database server all on the same machine or you may have all or part of them on separate machines. Your application may even require clusters of these servers together to function on high availability needs. For those of us whom are small-time developers working on minuscule projects we may have questions like what production or development environment is most suitable for building and testing my Software as a Service? What hosting companies should I turn to or what should I look for in a hosting company for my SaaS development needs? These are some pretty good questions and I’ve run into some similar problems myself so I’ll attempt to share my advice based on past experiences.

The Development Environment

When it comes to development you need to accommodate for you and every other developer that will be working with you on the project. Whether you’re going to use Ruby, Python or PHP you’ll need a basic web server setup with some plugin module for your language of choice or the executables necessary for your servlets or what not. Because languages like Java work in a sandboxed fashion where they run on top of your hardware layer these requirements may vary from one purpose to the next. It’s a good idea to have some revision control system in place if you’re working in a team or just for your own purposes. If you setup a development environment locally, whether that’s in addition to some larger development environment for your team or because you’re the only developer, just keep in mind that others may join you later so make it easy to reproduce your environment. You may also need to move this environment later and this means knowing what servers were setup with what configurations. Try to produce a dev environment similar to the production environment you expect this application to be launched on so that you can minimize any potential conflicts between development/staging and production. It’s OK to just run a local WAMP/LAMP stack and keep everything sandboxed (I do it all the time).

The Production Environment

Most development teams are actually quite small (usually in the order of just three to six people per team). If we divide the production environment responsibilities and roles to individual teams we can have teams look after their designated application layers and their respective servers without much fuss. When you’re only a one-man-show you’re probably going to be doing everything from one server yourself.

In this sense learn to keep everything tidy. If you’re using PHP look for mod_php setups (if you’re using apache) rather than CGI setups. Spawning different instances of PHP for every single request is inevitably going to slow down your server upon high loads. Look for web hosts that offer easy admin tools like Workbench or phpMyAdmin for MySQL. If you’re comfortable just doing everything from the command line that’s great. Remember that your SaaS application may grow to have many many abstraction layers and keeping track of all the portability may be a daunting task so don’t put together systems that are doomed to fail as you build up and scale out. For example, a database layer with a single point of failure is going to be very bad as it will likely take down your application entirely. In any case having some graceful way to fail – albeit losing your persistence tier is still going to render your application useless – is something you need to plan. Have many backup plans in effect. All too often I see developers struggling to restore catastrophic data meltdowns due to failed hard-disks that didn’t have RAID backups or complete services going down due to running all servers on the same machine. That machine may need to go down from time to time for maintenance or patching. Separating the database servers from the web servers and even the file servers and keeping the OS on a separate disk can help resolve many of these problems and eliminate single-point-of-failures on most common systems. Off-site backup is also not entirely out of the question even on a small system and most web hosting companies will offer this at an additional cost.

If you’re going to choose a web host that doesn’t provide 24/7 support in the event of critical service outages or just or in any event know that while you may not see that short window outages as a problem your SaaS app can be in request at any given moment of the day by any number of users from anywhere in the world and this only amplifies your problem. When a user was in the middle of a checkout process or saving an email draft to send out and is unable to complete that request for even ten or twenty minutes that can seem like an awfully long time in user space. Remember that desktop software means I can use my software any time without interruption and like-wise software as a service should not lack in that area.

Developing Your own Search Engine

I recently decided to explore the possibility of developing my own search engine. This involved understanding a few key concepts of how a search engine fundamentally works as well as some independent research on what popular languages would be suitable for such a project. It seems that while using a basic LAMP stack might prove too faulty on a large scale search engine it is not impossible. Building a basic web crawler with PHP/MySQL may not be as horrible of an idea as one might think!
It’s only mid-December and already I can feel the on-slew of cold as winter approaches. With a few inches of snow piling up outside my window, I warm myself up with a fresh cup of coffee and plug back in to my laptop; Yet, again I am exploring the wonderful world of search.
We’ve come a long way from the Google Search of 1999 and the Yahoo Search of the past. Bing has recently stepped onto the web scene (as a mass-marketed re-branded web solution). Google, still tops the lot in sheer volume of search traffic, but aren’t we forgetting about all of the other search engines on our massive list of web search?
The reason it sparked my interest to find out more about some of the requirements and specifications necessary for developing your own search engine was due to a recent project that involved seeking out and identifying broken links on a web site. A part of the project involved writing a small light-weight bot that would crawl only the internal links that pointed to your web site and detect any broken links in order to notify the web master. This is actually one fundamental part of what a search engine does. In my efforts I needed to find the most effective and efficient methods to crawl a site internally in order to find all of its internal links without using any site-map or file-system structures. It needed to be done strictly over HTTP and this is what led me to some key search engine and web crawler articles.
The most difficult part of writing this bot was having to rely solely on the links that the web pages of the site provided. Since a dynamic or database-driven web site can potentially have thousands of pages it proved more difficult than I had originally anticipated to map out this link structure directly over an HTTP crawl.

HTML Parsing
The first challenge was trying to obtain all the links provided by the <a> tags in the page. At first I thought to use a regex search pattern, but later on it proved more ostentatious than necessary. So, instead I turned to a common solution using the PHP libxml DOM object. With DOM you can easily obtain all of the HREF properties of the <a> HTML tags on the page through the getElementsByTagName method. Once we have the specific HREF attribute of that tag we can iterate through the object and produce an array for all other tags.
Some of the problems further presented by parsing the HTML could involve javascript, improper markup, or other external requirements. In the spirit of keeping things simple I chose to ignore these underlying factors.

The Domain Name Space Problem
The second challenge was identifying which of the links on the page were pointing to another page on the same website and which were pointing to external websites. Domain name-space is so vast that it has become increasingly difficult to identify the validity of a domain without having to rely on DNS. Since TLDs have grown so tall and stretched so wide that they now encompass UTF-8 characters and even CTLD catenation, many of my test methods have failed. Again, for purposes of simplification I chose to ignore the one-off problems and deal with the most common scenarios and their proven solutions.
I solved this problem by writing a very simplistic function to help me break down the individual components of the URL into some very distinct properties and return them in an array. The code below demonstrates the basic functionality of this method.

function breakURI($URI) {
  // This function identifies four key components of any URI that's the protocol, the domain, the port and the path.
  // The domain can be further broken down in to tld and sub domains, and the path can be further divded in to path + query string
  $pattern = "/^(?<protocol>[a-z]+:\/\/)?(?<domain>[a-z0-9\.-]*\.[a-z]{2,})(?<port>:\d+)?(?<path>[\/|\?].*)?$/i";
  if (!preg_match($pattern, $URI, $match)) return false;
  // Get the query string from the path (if any) and move it to it's own element in the array
  if (isset($match['path']) && strstr($match['path'],"?")) {
    $match['querystring'] = substr(strstr($match['path'],"?"),1);
    $match['path'] = str_replace('?'.$match['querystring'], null, $match['path']);
  }
  if (isset($match['port'])) $match['port'] = substr($match['port'], 1);
  // Verify that the domain is indeed a valid FQDN and classify any of its sub-parts
  $match['subdomains'] = explode(".",$match['domain']);
  $match['fqdn'] = $match['domain'];
  $match['tld'] = array_pop($match['subdomains']);
  $match['domain'] = array_pop($match['subdomains']);
  foreach($match['subdomains'] as $domain) if (strlen($domain) < 1 || $domain == '.' || $domain === null) return false;
  if (substr($match['domain'], 0, 1) == '-' || substr($match['domain'], -1) == '-') return false;
  $keys = array_keys($match);
  foreach ($keys as $var) if (is_numeric($var)) unset($match[$var]);
  if (!$match['protocol']) $match['protocol'] = 'http://'; // Protocol is always assumed if not supplied
  if (!isset($match['path']) || strlen($match['path']) < 1) $match['path'] = '/'; // Path is always assumed to be root if not path is supplied
  $add_port = (isset($match['port']) && $match['port'] && $match['port'] != "80") ? ':' .$match['port'] : null;
  $match['URI'] = (isset($match['querystring'])) ? $match['protocol'] . $match['fqdn'] . $add_port . $match['path'] . '?' . $match['querystring'] : $match['protocol'] . $match['fqdn'] . $add_port . $match['path'];
  return $match;
}

Since all web crawlers need to start somewhere it was easy enough to work from a web sites home page and then identify relative and absolute links from there. It’s assumed that the homepage should contain links to all of the key areas of the web site and that all other pages should point back to the home page. This also led me to a deeper understanding of a primary component in SEO. Search engines like web sites that make it easy to discover their site-map through proper linking.

Link Resolution
I had to make a few basic assumptions in sifting through the links of each page with my crawler. Whenever a link leads to a ‘javascript:’ or ‘mailto:’ it is safe to ignore it for the purposes of crawling the site since it is not likely to point to anything in which the crawler will be interested. Whenever the link does not contain a domain name it is not likely to be an internal link. Since some websites will use a relative-path link for some or all of their pages, I needed to ensure that those paths were properly used in forming the complete URL. Anything with a relative-path is obviously considered an internal link. For those links that contained an actual domain name or full URI we needed to determine if the domain name of the web site was a part of this URL to ensure that it was still pointing internally to the same web site. In my test-runs, sub-domains and IP addresses proved somewhat problematic to my methods. I worked around this by simply resolving all IP addresses to their rDNS (where available) or just ignoring it all-together. Allowing the crawler to blindly follow external links was, no-doubt, a bad idea since this would lead to an overload of my modest development server. This is why I had to be very careful in making sure the crawler was only following internal links.
During my early test runs the bot had actually followed a twitter link, which caused it to crawl thousands of pages on twitter and many other sites that just flooded my db with hundreds of thousands of external links. Needless to say I learned that such a bot would need constant monitoring when working in a small development environment that might not be able to handle that much traffic.

Indexing: The Most Cumbersome Search Engine Task
An actual crawler-search-engine is made up of more than one part. It usually has the crawler, which is the automated bot that follows the links on each page to explore a significant portion of the web. It also has an index, which stores the information about the web pages the bot has crawled and sorts them in terms of components that make up its indexing algorithms. The simple bot I described earlier was just one basic part of a search engine, but it was the most relevant part to my particular project.
Later, I decided to take my interests a little further by starting a personal project on the side that involved developing a simple indexing algorithm. This is where everything got really interesting! It appears that different search engines have different ways of indexing the web.
When we use a search engine like Google to search for things that interest us we aren’t really searching the web, but a sort-of copy of the web that is stored in Google’s index. This index doesn’t actually contain every web page on the world-wide-web, but a rather significant portion of it (billions and billions of web pages). It is significant enough so that we can safely assume that when we search Google we are in fact searching the web. This index is also updated constantly, but it isn’t always 100% accurate. Sometimes search engines will fail in finding the most relevant information for your search. The reason for this is that there is so much information on the web and so little information is provided by your search query in order to accurately determine the most relevant results in a reasonable amount of time and without causing the user to supply additional information about their search.
Search engines aim to solve this problem by developing various indexing algorithms that ask additional questions – not to the user, but to the pages in its index. These questions focus mainly on where your search terms fall on the web pages found to contain those terms and how frequent they are on those pages. That’s the most fundamental aspect of most indexing algorithms, but many search engines today take it a lot further. Google, for example, will attempt to exclude your pages from its index if you use terms or phrases too often (possible spaming); Use them too many times and the search is not likely relevant; Use them too few times and the search is also not likely relevant; Use them in your title, but not your URL and the search is likely ranked lower by the indexing algorithm.
At this point what I had was a very simple bot that constantly crawled through all the pages on a particular web site and stored the pages in a MySQL database. There was another script that would sort this index using only a few factors (namely: key-word location, frequency, and external link relevancy). This could be either achieved through writing your own daemon or just simply going about it the easy way, like I did, and using a custom script run by a cron job. I noticed how quickly the index could grow if I let the bot crawl external links. My database went from just a few Megabytes to several Gigabytes within just a couple of weeks.

The Search
Now, all I needed was an actual search component. A script that would return results for a particular search phrase by gathering things like Titles, Web Snippets, Links, and Cached Pages. This is all very similar to what Google does albeit on a much-much smaller scale. I didn’t intend to write a fully functional search engine so some of my techniques are actually quite error-prone and most will probably fail the scalability test. Though I think for a personal project I’m quite psyched about taking up building a search engine further. I may consider one day releasing the code and open-sourcing the project through some distributed source revision control system like git or mercurial. Who knows? For now I’m just happy to share some of my minuscule findings.

Other Uses
So what possible reason would there be to develop your own search engine? It’s true that trying to do what Google has already done falls in the cliché of “re-inventing the wheel”, but that shouldn’t stop us learning from, experimenting with, or redesigning the wheel. There could be a number of other practical applications for these techniques. For example, one might need a search engine for a private Intra-web. You may have a need to index the web in a different manner than those proprietary methods depicted by Google, Yahoo, or MSN. Smaller search engines today that do not get as much traffic as the big search engines can still be found very useful in some niches.