Sherif's Tech Blog

Just another guy on the Internet with a keyboard…

Viral Videos and the Web

There’s a lot of power in viral videos on the web today. You can express an opinion or a thought and deliver to millions of people around the world with just a few minutes or even seconds of video. Youtube didn’t become popular because it did something revolutionary with video or because it developed any significant technology that made video better on the web during it’s early days. In fact, it grew too large too quickly for it to withstand the demand of its users and thus took on the deal with Google to maintain funding and the backing of a web startup that had the infrastructure necessary to expand their service. Youtube did, however, put the power of speech (or in this case video), back into the hands of the average person.

Television networks have been broadcasting what they see fit for decades before the web and the Internet ever came along. They do some market research, try to figure out what people want to watch and what forms of entertainment are most demanded and then they try to figure out a way to produce that content and broadcast it so that they can make a profit. There’s a key difference between that and what youtube did. Sure, there are plenty of users on youtube that will still try to submit some copyright-infringing video clip that some big-time production studio will try to get taken off, but there are also plenty of videos on youtube that are completely user-generated content. Like “Charilie Bit My Finger” as of today that video has gotten more than 377 million hits in just around four and a half years. That’s an average of 2 to 3 hits per second for four and a half years. So why are people clicking on Charlie Bit My Finger two or three times per second for years to watch two child and infant in a home video? For the same reasons millions of people watched America’s Funniest Home Videos on broadcast network television for decades. Except that it isn’t America’s Funniest Home Videos anymore and it isn’t owned by any network and there isn’ta  T.V Guide listing for when the show will air. There also isn’t the intervention of a studio editing your video. Individual users choose to share their own videos and the whole world decides for themselves if they’d like to watch.

Now, I’m going to assume that since hundreds of thousands of people have voted this video up over the years (or “liked” the video more than they “disliked” it) that the general audience finds this entertaining. But we didn’t have to pay an executive a six-figure salary and hire a marketing team to spend tens of thousands of man hours to figure this out in order to get there. It just works… Because people let it work.

So Viral Videos do have a significant impact on the web today. We apparently like watching other people or at least watching some creative expression of what they have to say. Whether that’s serious, comedic, or for any other various purposes. Sites like GoAnimate.com are pretty popular today. Allowing you to easily make and distribute your own animated videos and take your blog viral. I just wanted to include a brief demonstration of how easy it is to put your own videos on the web these days. It took me exactly 3 minutes to signup for a free account at GoAnimate.com, produce this video, and post it on my blog. But shooting your own home videos can also be just as easy.



GoAnimate.com: Facebook Changes Everything by GoogleGuy

Like it? Create your own at GoAnimate.com. It’s free and fun!

How To Build A Photo Sharing Application On The Web The Right Way

Personally, as a web developer, I’ve come across a number of clients that seem interested in doing some sort of web-based photo sharing application like flickr, or imgur, or photobucket. These are all pretty popular services on the web that allow you to share your photos with friends, family, colleagues, etc… They seem to be extremely popular and there’s no doubt that it’s evident people love sharing their photos online. Just take a look at facebook; possibly the world’s largest online photo sharing application that claims to get over 100 million uploads per day from it’s now more than 800 million users. With cameras found standard in such personal devices as phones, notebook computers, desktop computers, various other hand-held devices, and even (and don’t ask me why) TVs it’s no wonder we find it easy to store lots of digital photos and inevitably share them with others.

So I had to think long and hard about how I’d build an application or service like this so that it made good use of photos and made them easier to share and more accessible. The first thing that came to mind was checking out all the features these other services already had to offer and how they put them to good use. flickr allows you to do stuff like geotagging where you can tell people where the photo was taken. I remember either facebook or some app I may have tried a long time ago being able to do this as well. Since I log into my facebook account maybe two or three times a year I couldn’t say for sure, but what I do know is that facebook certainly got one thing right and that was not taking their users’ demand in wanting to share photos with friends for granted. When you’re able to point out who is in your photo that makes the information just that much more valuable and apprises the application for making photos more useful and accessible. There’s also a search value in tagging. But it can be pretty boring for a user to have to sit there and manually enter in all the information about each photo so we can rely on things like Exif where embedded information about the photo can be extracted by computers. You can get such information as GPS coordinates of where the photo was taken, a time stamp of when the photo was taken, the camera make and model that the photo was taken with, whether not the flash went off when the photo was taken, and even various other things such as focal length, exposure time, shutter speeds, etc… You can learn a little more about the Exif specifications here. However, keep in mind that not all cameras provide this format and not all of them are equipped to provide all of the different parts in the Exif header. Newer phones like the iPhone 4 are GPS capable and can embed GPS data into your photos if the GPS is turned on. There are also many digital cameras that either come with GPS devices embedded or can be purchased separately. Some other features you might want your application to have to make the process of having the user input information about each photo easier is facial recognition. This doesn’t have to be so sophisticated that it can automatically detect faces found in other pictures and tell you who’s who, but it can be helpful to let the software detect if a face exists in the photo and highlight it so that the user can simply type in who each person is for tagging purposes.

There are also some potential uses that I’ve found for OCR (or Optical Character Recognition) in photo sharing. If you can manage to extract enough significant textual data from the image you might be able to make certain aspects of search easier to locate photos. This is probably not going to be easy given that what little time I did spend fooling around with various OCR software proved that it has many setbacks and is clearly in the very early stages of development. Mostly OCR has trouble detecting text if the font size changes through the image, if the text encounters large skewed angles or if the text is rotated so that its orientation is not top-to-bottom and left-to-right. It’s also difficult to detect hand-written text, or text surrounded by other images, logos, or with too much depth or noise. There’s also quite a challenge trying to detect text written in various languages that don’t use the Latin alphabet. If it’s not a scanned image coming straight from the page of a book using a single evenly-spaced sans-serif font, with a fixed font size of around 12-20pt it will prove rather difficult to get any decent results from the OCR software.

So to give you an idea of what some good features of a photo sharing application might look like I put together a small working demo. My example makes use of most of the features I’ve discussed here, but leaves much to be desired, of course.

You can see the working Photo Application Demo here.

Here is a sample photo with embedded Exif information including GPS to demonstrate.

Here is a sample photo that demonstrates the applications ability to utilize some OCR techniques.

A Note About User Experience

OK, so I wanted to make sure this demo illustrates some of the basic functionalities a user might expect in a photo sharing app. The first thing you’ll notice is that uploading large photos and especially a lot of them can be a boring thing to do over HTTP. That’s because HTTP is built on a request/response model. You have to send a request first and then you eventually get a response back. Now if your request happens to be a 10 MB photo or even worse a 250 MB video (or even worse you don’t have high-speed Internet or your ISP offers lousy upload speeds) well… that’s a long wait with no indication at all to the user as to what’s happening on the other end. What’s even worse is if the user is uploading a file they aren’t aware is too large for you server to accept that can be a lot of waiting around for nothing. So I used some javascript combined with a script I found online at phpfileuploader.com and with a few modifications for security and improved user experience I hacked up a somewhat better interface for the user to deal with uploading multiple photos.

One thing you don’t want is to prohibit the user from being able to do anything else while they’re uploading their photos. Pop-up windows are just annoying, in my opinion at least, and forcing the user to wait until the entire upload is done before they can do anything else in the window is even worse. So if you try out my demo you can see that it doesn’t prevent you from browsing other photos on your computer to select for upload even while it’s doing the uploading simultaneously. You can even cancel an upload mid-way or cancel all pending uploads at once. It will also alert you ahead of time if you’ve selected too many files at once or if your images are too large or of the wrong file type. However, not all of these features are entirely reliable for security reasons as the user can easily bypass them. But they are their to improve the user experience. All the real security work is actually done on the server side. Even if you were to rename foobar.exe on your computer to foobar.gif and managed to upload it the server will still detect that it is an incorrect MIME type and reject the upload. This can be done relying on PHP’s File Info extension where it can check the file’s MIME type from the server’s supplied Magic MIME file.

Now, keep in mind that tracking the progress of the upload is just half of the battle. There’s also the part about viewing your photos after you’re doing uploading them. In my demo I used a javascript library called easybox which is based on the lightbox framework, but it plays nice with jQuery and seems to work a lot more smoothly in my opinion. You can download easybox from Google code and try it yourself. On the top right hand side of my demo where you see your gallery you can click on any of the thumbnails and it will use easybox to nicely let you view all of the photos in your gallery in a slide-show fashion without ever having to leave the page. You can also, of course, easily get a permanent link to each of your photos to share with others from the list of recently uploaded files below that. Your session will expire after 30 minutes of inactivity, however, and you will no longer be able to delete those photos. If you have cookies disabled you won’t be able to see what you’ve uploaded, but your photos will remain on the server indefinitely.

A Note About Using Javascript or Flash

One thing I wanted to point out was that your applications should definitely work better with javascript or flash or even Java Applets or whatever client-side components you may want to use to improve the functionality and user experience of your application. However, you should also consider that if your applications can’t work at all (if even with a degraded user experience or a limited use of features) then you’re not degrading your web applications in a graceful manner. In my demo, for example, I made sure even though I’m using a lot of javascript and even some flash to make the upload process a lot more user-friendly you can still use the application even without the javascript or if you don’t have flash installed. You can test this yourself and I even tested it on the Links web browser (which is a completely text-based browser with no javascript capabilities) and I could still manage to uplaod my photos and download them just fine with zero problems. This isn’t always possible, or easy, for every application, but it’s definitely a good idea to at least put in the effort to gracefully degrade your applications so that they can be somewhat usable to browsers of lesser capabilities. I even managed to get it to work on my phone (a Samsung) which is a pretty horrible phone with an even more horrible browser, but hey having a bad user experience is still better than having no experience at all.

A Note About Links

So I wanted to emphasize a little on how the link structure of an App like this should work. One thing is you certainly want to be able to provide permanent links to the uplaoded photos so your users can share their photos with others. The link should be as short as possible. If your links are 500 characters long it’s probably not going to look that great when you paste them in an email or an IM window for a friend to take a quick look. However, facebook and flickr don’t seem to mind too much about how long the link is. I do have to point out, however, that they probably store billions of photos and my system would certainly not scale for them. With that said in my demo you’ll notice every photo gets a random five character alpha-numeric (case sensitive) link that is directly pointing from the application’s web root. This is the same system imgur uses for their gallery links as well. Except that I noticed a few deficiencies in their method. For one thing they don’t seem to care much about the extension you use to directly view the image. For example, if I upload an image to imgur I might get a link that looks something like http://imgur.com/abcde which would give me a web page with my photo and some information about it (much like you see in my demo) and the direct link to my photo would probably look something like http://imgur.com/abcde.jpg or whatever the file extension was. However, if I were to visit http://imgur.com/abcde.gif or even http://imgur.com/abcde.pngabc I would still be able to see my image. However, I can’t go to http://imgur.com/abcde.exe because that seems to give me an image stating the requested image was not found or has been deleted. Upon some investigation I noticed that their servers return the Content-Type header based on whatever the extension you supplied is as long as it starts with an extension they accept such as jpg/gif/png but it doesn’t matter if it’s proceeded by anything else. This is actually pretty bad, because the file still comes back with the same exact MIME type I uploaded it in. So clearly they aren’t providing the same image in various formats just conforming to some loosely thought-out rewrite rules. Basically I have similar features in my demo where the webserver (in my case Apache) uses rewrite rules and conditions to verify the requested URL and route the request to the proper PHP script and with a little magic you have access to all of your photos from the webroot even though the image files themselves aren’t even stored in the same physical directory as the webroot on my server. So far this demo has only been up for a couple of weeks from the date of this blog post and at around 1,000 uploads and 30,000 views it seems to be reasonably responsive enough that it proves scalable with a little work. I’m using GOCR for the Optical Character Recognition stuff, which is an open source tool developed under the GNU Public license and you can visit their website here to download it or to get more information if you’d like.

A Final Note About Photos

Well, that about covers what I wanted to say about sharing photos on the web and building applications that can do this nicely. Just keep in mind there is a lot to be done with photos that we are yet to uncover. So be prepared to make some good use of these features in building your own applications. I hope my demo gave you some ideas to work with. They are all feasible and not incredibly difficult to implement as you can see this brief demo only took me a few hours of work to put together and works rather well for it’s purposes. Sorry for the horrible interface though that’s one thing I didn’t have time to actually work on. Do let me know what your thoughts are on this subject and if you have any photo applications you’ve built or ideas you’d like to share.

Using PDO with MySQL in PHP

The PHP community has been making an effort to steer people away from the old mysql_* functions (an extension that is no longer well maintained by core PHP developers) and encouraging the use of newer and more feature packed interfaces like MySQLi and PDO. MySQLi offers both Object Oriented and Procedural style coding; PDO does not and as far as I know there are no plans for that to change. So if you aren’t familiar with the OOP paradigm this is a good time to start in case you plan on using PDO in future development. This helps encourage more modular development as well. This is important where sometimes your implementation details can be hidden. The old mysql extension has been discouraged for developing new applications for a while now and since MySQL is a very popular Database Management System a lot of developers used to the old interface have been giving some push-back about switching. Personally, I must admit that I have been using mysql_* functions for years with MySQL databases and so I can understand some of the hesitation in switching, but I’d like to address some of the benefits and trade-offs involved. At least the ones I’ve found to stand out most during my encounters with other developers this year.

PDO Is Database Agnostic

This is probably the driving point of most developers – at least the one I’ve seen come up frequently since last year. The PDO extension in PHP is nothing more than a means to interface with various database-specific PDO drivers. In PHP the PDO extension supports a vast array of databases like Cubrid, FreeTDS / Microsoft SQL Server / Sybase, Firebird/Interbase 6, IBM DB2, IBM Informix Dynamic Server, MySQL 3.x/4.x/5.x, Oracle Call Interface, ODBC v3 (IBM DB2, unixODBC and win32 ODBC), PostgreSQL, SQLite 3 and SQLite 2, Microsoft SQL Server / SQL Azure, and 4D. This means you can use the same PDO functions to issue queries and fetch data from any of these databases. Some people were a little confused about this at first and seemed to think this implied you would not have to rewrite your SQL. This is certainly not the case. The SQL is still dependent on your database. However, you will not have to rewrite your PHP code should you chose to switch databases or use different databases in the same project or just reuse a PHP abstract database class in another project with a different database. The PHP code works the same way whether I’m using MySQL, PgSQL, or any other database. Obviously my database’s SQL syntax may vary here and there, but I’m not required to do large refactoring of my PHP code.

Now, in all fairness, the likely-hood of having to use different databases in the same project or switching databases in any project are fairly slim. So some developers don’t see this as such a great benefit. However, should that be the case you suddenly find yourself in a world of hurt if your application relies very heavily on the database and you have thousands of lines of code to rewrite. Which is why it makes sense to just learn an extension like PDO and use it everywhere so that you never have to relearn another extension should you start a project that uses a different database than what your existing extension was built for.

What PDO Is Not

Some developers also seem to think PDO is a full-blown database abstraction layer or even an ORM. Let me assure you that it’s not. You can’t actually perform any database functions using the PDO extension by itself. You have to install the database-specific PDO driver to access the database you want. These drivers basically implement the PDO interface and thus you get to use those database specific features as regular functions of the PDO extension. PDO is not a magical solution or replacement for you DBMS. It does not provide a tool for userland it simply provides an access layer in userland. So if you had any of these misconceptions before, now is a good time to get rid of them.

Developers Migrating to PDO for the MySQL Database

If you’re constantly developing with MySQL databases like me you probably either still use mysql_* functions in PHP or have tried or even switched to MySQLi. If you’re one of the few that have taken the leap to PDO great! If you still have a few reservations consider that whoever told you PDO is too complicated or harder to learn or use is lying to you. They may not even know they’re lying, but you ARE being lied to. First of all PDO is no more difficult to use or learn than MySQLi or mysql_* or any of the other database extensions in PHP. The fact remains that mysql_* functions have been in long-standing plans for deprecation and they will deprecate it. Eventually when mysql_* functions are gone you will be in a much more difficult position were you to continue developing new applications with it than if you were faced with legacy applications that still had a great deal of code relying on the old extension.

The truth is the old extension for MySQL is no longer well maintained. You’d be lucky if any of the core developers went back and made any major improvements or fixed any real bugs (should they happen to arise). For the most part the new MySQL Improved extension MySQLi is still being worked on and can expect further development, but with the old extension it’s just not likely to happen. There isn’t anything fundamentally wrong with the old mysql extension. The purpose of this blog post isn’t to knock on mysql_* functions, but rather to encourage developers to explore the pros and trade-offs that an extension like PDO has to offer. This mainly because I feel a lot of the developers I’ve come into contact with that have reservations about switching to PDO seem to have been completely mislead or just never bothered to learn about what PDO really does.

If you’d like a good tutorial to follow on migrating I recommend taking a look at the PDO Tutorial for MySQL Developers from the ops at hash php (##PHP on freenode on IRC). If you aren’t sure about just how easy it is to use PDO with MySQL I drafted up a small working example of using PDO with MySQL here as well. It’s using the MySQL world sample database and both the code and a working example are provided.

One of the first gatchyas you want to look out for is specifying the character encoding for your connection. In my example above I’m using MYSQL_ATTR_INIT_COMMAND to SET NAMES and SET CHARACTER SET telling MySQL to stick to UTF-8. This is so I don’t break the encoding during transport. Of course you still have to remember to specify the encoding to the client upon output, but this just demonstrates how you can go about making sure the connection to MySQL server uses the proper encoding. The escaping rules should be updated accordingly. Remember you had to do this with mysql_set_charset() in the old MySQL extension.

The first thing you notice when you start using an extension like PDO or MySQLi is that when you use prepared statements where you are binding parameters you don’t have to worry about escaping rules that are otherwise more pertinent in the old string concatenation style used to build SQL queries in mysql_* functions. PDO is actually separating the SQL from the parameters bound to that SQL. This not only prevents SQL injection (which mysql_real_escape_string() could do just as effectively), but it makes it easier for the developer to work the user data into their SQL. This an underlying benefit I’ve come to admire very much with PDO.

How Much Storage Does the World Really Need?

The question of exactly how much storage space humanity needs is a rather tough one to answer, because both our needs and our technologies change on a rather erratic basis. However, I believe that a single Yottabyte of storage space is sufficient to store all of the world’s data. In order to thoroughly explain how I arrived at this answer I’ll have to show you where technology – around storage space – began, where it has arrived today, and further explore possible avenues it may head down in the future.

Some of the underlying factors that depict how our needs and technologies change and shift our determination of digital storage may include the following observations… As an example, compression was not as prevalent in daily computer usage just a few decade ago — as it is today. Now, due to the advances in compression algorithms we can store images with millions of pixels and video at incredibly high resolutions in just a fraction of the space it took only 15 years ago. Not only that, but we also rely on compression to transmit data, over networks and between devices, faster. In the 1990s not everyone cared to store or watch video on their computers or consumed nearly as much video media as they do today. DVDs replaced VHS cassettes in just a few years and CD-players have been eradicated by hand-held devices like the iPod. So lets review some history of storage technology and jump into modern day storage.

Punch-cards to Magnetics to Flash

Punch cards (or punched cards) actually trace back to the 19th century long before the invention of the modern PC and were used as late as the 20th century in computer storage. They are rather cumbersome to both produce (write) and consume (read). They would not be capable of efficiently storing nearly the amount of data the average person stores today. Later, as computers became more and more wide-spread in everyday use, we moved to magnetic tapes. The problem with this type of media storage device is that it’s slow and prone to failure. Just a few decades ago we came up with a bit more resilient form of storage, also based on magnets, called HDDs (or hard-disk drives), which are made up of magnetic platters and pins that read and write to those magnetic platters. They were a lot more durable and lasted far longer than magnetic tapes, but even more recently in the past decade or so we came up with Flash drive technology. It’s name says a lot about how it works and it was derived from the same technologies that made flash memory possible (like your BIOS or other read-only memory chips). The memory is stored by flashing, which requires negatively charged electrons to pass through parts of the device that allow certain gates to open or close which creates a representation of digital storage. This type of memory is evidently the cheapest, but also the slowest and least reliable form of storage we have today.
Because flash drives wear down quicker over time due to the nature of how are they are built we normally don’t rely on them for long-term storage or mission-critical storage media. Let’s put it this way, you won’t find banks relying on flash-drives to store their financial data any time soon. However, because storage is cheap and easily replaceable we can expect that the average person will replace their hard-drive once every three to four years if not more. This is about how long it can take for a hard-drive to fail or start showing signs of failure.
The future may be in flash, but until we can build infrastructure to maintain the speeds and reliability that’s presently demanded from modern storage needs they may be replaced with other future technology. It’s hard to say where we will end up, but at present SSD and SATAII drives offer the speeds and reliability most of us require.

What is a Yottabyte

First thing’s first; If you’re not very computer savvy a yottabyte is 2ˆ80 bytes, or to put this in terms some of us may be more familiar with, it’s equivalent to about 1 trillion Terabytes. To give you a perspective of just how much storage space that is — consider that the average notebook or netbook device comes factory-standard with around 200-500 GB (Gigabytes) of storage space and the average consumer desktop PC usually comes with a standard 500GBs to 1TB of storage. Since it’s not unusual for many of us to own both a laptop (or notebook computer) and a desktop PC we can assume that the average person normally has around a terabyte of storage space at their personal disposal from their combined personal computer devices. Consider that many of us also own smart phones these days – with such an increase in the smart-phone market – and these devices too can also come factory-ready with several dozen GBs of storage. This means that it’s not unrealistic for the average person to consume more than a TB of storage just for personal use.

According to census from the US and other governments there are an estimated seven billion people in the world today. Now, it is not evident that all of them have access to or are capable of using computer devices. However, if we pretend that every man, woman and child on earth were to have access to a computer and are capable of using one we can allocate a specific portion of storage space for their personal consumption based on some modern usage statistics we’ve come to know today. If we divide the Yottabyte of storage space I estimated for humanity equally amongst every person on earth we can allocate around 157-158 terabytes of storage space for everyones personal consumption. This is actually several orders of magnitude greater than what most people have access to in storage today.

What Does Everyone Store?

Everyones needs are different, but basically we can break down storage based on types of media consumed on a regular basis.

  • Photos
  • Video
  • Audio
  • Software programs such as applications, games, etc…
  • Documents or raw data

This pretty much covers the broad categories we can fit data into when we consider personal usage. Everybody loves taking pictures today. That’s evident from the 800 million facebook members that upload photos to share online at a rate of over 100 million photos per day. People also love watching video on their computers. That’s also evident from the hundreds of millions of videos being uploaded to Youtube every month. There’s no-doubt we love games and gaming consoles too; based on the millions of popular game consoles being sold over the last few years. Another thing we also seem to love on a rather unanimous basis are documents like email. Whether you store word documents on your computer for school, work, or other miscellaneous uses you are likely to transmit them from one device to another at any given time. We share text messages, IMs, and video conference on a regular basis. Many of us store tabular data at work like spread-sheet documents. Of course everyone is in need of utility software like word processors, spreadsheet programs like Excel, presentation software, operating system software, video or photo editing software, and much more in order to make use of all this data being stored. So, if we evenly break up the allocated storage space per person into these 5 broad categories we can say that roughly 32 TiB of physical storage should be plenty to retain each of these types of data.

Putting it into Perspective

To give you a more comprehensible picture of what this amount of storage can hold let’s look at it in relative terms we use today. Let’s see what we can store on 32 Terabytes of space:

  • Photos
    • We can store more than 4 million 20 Megapixel high-resolution photos
      • Consider that an iPhone or most smart-phone cameras are only capable of around 2 or 5 megapixels and that a 2 megapixel image can fill a screen resolution of 1900×1080 pixels.
  • Video
    • We can store more than 7,000 hours of 1080p Bluray digital video.
    • Similarly that’s around 16,000 hours of DVD quality video in 720p or about 8,000 DVD movies.
  • Audio
    • We can store roughly 50,000 hours of CD-quality Audio
    • With MP3 or MP4 compression that can translate to around 150,000 hours. So basically you can store a music library of several hundred thousand titles with this amount of space. That’s likely more than most Studios and Radio stations keep on hand at a single location.
  • Software
    • We can also store thousands of computer programs and games – if not tens of thousands (more than the average person usually stores today).
    • Similarly this is like being able to store Microsoft Windows 7 — 2,000 times over
  • Documents
    • You can potentially store several million, or even billion, documents with this amount of space. Considering that documents vary greatly in size – depending of course on how much data you store in each document – we can still safely assume that if the medial person stores no more than a few megabytes of data per document on average this gives us plenty of room to store an archive of documents for an entire lifetime.

Now, I don’t know about you, but I actually don’t even have the time to take 4 million photos, let alone the time to spend looking through them. At an average rate of just 1 second per photo viewing a photo album of that size alone would require nearly 7 weeks of looking through photos without time to eat, sleep, or even go to the bathroom. Extrapolate that to the amount of time it would require to snap that many pictures, store them, upload them, edit and organize them and we’re probably looking at a decades worth of photo archives for a single person. That’s plenty by my estimates. If you consider compression and storing these photos in even lower resolutions we can push that number to nearly 100 million photos without sacrificing too much in quality.

Looking Into the Future

The actual problem with storage isn’t that we don’t have enough of it or that we can’t afford to get more, because storage is actually very cheap. To give you an idea I bought a 2TB external USB hard-drive a few weeks ago for just under $100 and you can get even faster storage media in the terabytes at incredibly low prices today. In fact, HDDs cost pennies on the dollar what they used to just 8 years ago. The problem of storage today is that we don’t have the infrastructure in place to make it accessible fast enough. Consider that if you have a camera phone you may also have a digital camera and even a cam-corder. Now you likely also have a desktop or laptop computer to which you transfer your photos and videos. It’s not enough to store this data locally, but we upload and share our photos and videos online as well. On top of that hard-drives are prone to failure. It’s no longer a question of if a disk will fail, but when. So we create backups in multiple locations in order to preserve our data. Now we have the problem of data-synchronicity. How do you know which device or location stores the most up-to-date version of the file you’ve transmitted from one location to another? This is where some say storing your data on the cloud makes sense. It’s because data centers are built with redundant storage capabilities likes RAID arrays that can sync quickly and consistently. They also have disaster recovery scenarios in place with offsite or tape backup facilities where necessary.

It’s not just about the hardware capability either, but software capability also plays an equally important role in keeping data in sync. When I store my word documents on Google Docs I have the freedom to modify and restore them from virtually any location or device as long as I have access to the Internet. I can even share them and modify them with others in real time and retain a revision history. This is great and it’s fast, but word documents are cheap. They don’t take up much space. We can’t say the same for other media like audio or video which are keen to take up far more space. If you don’t have a high-bandwidth connection you can’t exactly stream DVD quality video in real time. What’s even more prevalent is that you won’t find it feasible to transfer several hundred DVDs even with high-speed Internet like Verizon FiOS and Comcast’s Xfinity (capable of up to 50Mbps and 10Mbps download and upload speeds respectively). Even with these speeds it’d take you several days to backup a few dozen of your favorite DVD movies to an online backup facility.

Google Introduced an idea to make high-speed Internet in the home possible. I’m not just talking about 50 or 100 Megabit speeds like the cable companies and ISPs are offering today, but up to 1 Gigabit of bandwidth for the average person. Let me remind you that even some small-businesses don’t have this type of bandwidth capability today.

However, with that kind of bandwidth capacity you’d be able to stream  Bluray quality video in 1080p and simultaneously transfer several terabytes of data over the Internet everyday with ease. In fact those speeds are so fast that if they were any faster your hard drive’s write speeds might not be able to keep up with your network transfer rates.

Being able to share data has it’s advantages. Similarly having to worry about local vs. online storage shares trade-offs that we are only beginning to tackle. Whether we will make a permanent move to the cloud or not remains to be seen. What we do know so far is that no single Data Center in the world has exceeded the Zetabyte threshold for storage capacity; not even Google! A Zetabyte is 1,000 Exabytes or 1 Million Petabytes and the Petabyte is 1,000 Terabytes. So in order to reach a Yottabyte of world storage space we’d have to have 1,000 or more Data Centers consuming about a Zetabyte of storage. That’s not accounting for personal storage devices like your laptops, desktops, smart phones, etc… There are no concrete figures for me to prove whether or not the world indeed has reached or exceeded one Yottabyte in storage space with all these devices and data centers combined simply because it’s hard to say who’s actually using what amount of space on their personal storage devices at any given time… However, it’s been estimated that the world has pushed the Zetabyte last year in 2010 based on data centers and Internet usage metrics and considering the number of personal computing devices sold over that time period. It’s safe to assume we are just one order of magnitude closer to the next threshold.

With that said I don’t think we will continue to consume greater amounts of storage space on a global level once we’ve reached the Yottabyte treshold. I think at that threshold technology will have developed ways to consume that data more easily without the need for growing storage. This is based on things like memory expansion and the growing trends of cloud storage or Internet bandwidth increases and even compression ratios. I wrote an article about how cheap memory has become just a few months ago. It’s due to this dramatic drop in the cost of memory over the years that computers are now able to rely on slower storage like flash since we can do a lot more caching for files that are read more often. When you read from disk less you lower I/O rates and disk seeks and ultimately increase performance.

For example, because it’s actually faster and cheaper for me to search for the exact information I need on Google at the time I need it — rather than store millions of documents covering a broad assortment of information and then seek that information when it’s needed — I don’t bother consuming that storage myself. Instead I let a company that has entire Data Centers filled with machines to take on the cost of bandwidth, storage space, and compute power necessary to collect and assort that information for me. Google relies on building massive networks of low-voltage computers that can cache the indexes that make it possible for them to produce such instantaneous results for your searches at any given time. This spares me a few petabytes of data and produces even more accurate and speedy results than I would have attained on my own. Ultimately it’s saving me storage space and this is one of the reasons I believe we will one day reach a point when storage is no longer the concern but how quickly we can transmit the data being stored is.

I may want to save a few PDF documents or web pages here and there from Wikipedia on a particular topic I find interesting for my research, but ultimately I will never need to store the entirety of Wikipedia on my local machine. Mostly because I will never have to time to read through it all. Besides, it’s faster for me to search and read only the bits and pieces I need at that particular time over the web than it would be for me to do it locally even if I did store all that information (remember Wikipedia is one such organization with the Data Centers powerful enough to out-perform the read/write/compute capabilities of my tiny little machine).

Why Is the Global State Bad?

It’s very easy to add global states and very difficult to test in them. This pretty much reflects on why the global state is a bad idea in almost any language. However, in the spirit of being focused on PHP we should make a few things clear that are specific to PHP. In PHP functions, by default, do not retain the global scope. There is a very good reason for this and that’s to avoid adding to the global state by accident. Unlike the javascript world where, by default, everything is attached to the global state of the window the script was compiled in PHP actually attaches the state of a function outside of its global context. In order to take variables from the global scope within a function, in PHP, you have to specifically declare the variable using the global keyword. In javascript in order to separate the variable from the global state you have to declare it using the var keyword, which is the exact opposite of the behavior in PHP. However, I’m not going to try drawing comparisons between different languages. Instead let’s just focus on why this is bad in PHP and then you can draw your own similarities to why it’s bad in virtually every other language you may use.

In PHP we have something called superglobals which are still made available within any scope even though they may not have been declared as global in user-land. The superglobals are things like $_POST, $_GET, $_FILES, $_COOKIE, $_REQUEST, $_SERVER, $_SESSION, etc…

Now, why are these overriding the rules of separating function scope from the global scope, you might ask? The reason is simply because they make using the specific web features of PHP easier. The variables themselves are not normally populated in user-land, but actually populated by PHP using information obtained from the SAPI/webserver, in this case. It’s true that you can certainly write to these variables. In the case of using sessions it makes sense that I would want function foo() { $_SESSION[‘var’] = ‘bar’; } to be available in any scope. After all I don’t expect the garbage collector to cleanup my session if I move through different scopes unless I specifically write-close the entire session file. You just don’t ever expect these variables to be different in any scope unless you specifically change them and so they’ve take on the superglobal concept that we have in PHP. So then does this mean they are bad? Not necessarily. Now, keep in mind PHP also provides a feature called register_globals which has been deprecated as of PHP 5.3.X and is highly discouraged. Please see the manual on register globals for more details. Basically, this directive makes PHP code a nightmare as it injects all sorts of variables into your script (and you would have no way of knowing what the variable names will be). It’s much safer to rely on superglobals and this behavior is now deprecated so it would never be on by default in latest releases (expect it’s removal some time soon).

Let’s start with the simplest concept…

Hello World Scripts are Boring!
This is true, but since almost any idiot’s guide to programming (at one point) started by demonstrating how to write your first program in the form of print “hello world!” we’re going to stoop down to this level for a moment to understand something. For those of us who are a little more advance just bare with me, but for those of us who are a little newer to programmer or to PHP this might be easier to relate to.

	echo 'hello world'; // Yay we can haz PHP!
Let’s say I just installed PHP and will write my very first PHP script ever! I want to write a hello world script. So I open a file called helloworld.php and put the following code in the file. I then run the file from my command line to test it out and make sure it works… Guess, what? It works! But, it’s also very boring. So I want to be able to make it say hello <my name> instead.

Now, I modify the same script and by passing my name in as a command line argument when calling the script with ‘php helloworld.php GoogleGuy’ I get to see my terminal greeting me… yay!

	$name = $argv[1]; // We're just going to take the first command line argument passed to the script and use it as a name
	echo "hello $name"; // Even more PHP awesomeness!
OK, but now I get a little more advanced and start writing longer scripts. At one point I find I am simply repeating the echo “hello $name” code in a lot of places. After learning about functions I decide to write a function so that I can reuse this code.

	$name = $argv[1];
	function greeting() {
		return "hello $name";
	}
	echo greeting(); // Outputs 'hello ' DUH! Uhhh, wait... what happened to $name?

Now, we begin to see our first problem. The variable I’m using was actually defined in the global scope and since it’s easy to add to the global scope I had no problem repeating my code everywhere. Quickly I realize the fix is to just pass the variable along to the function so that it can reference it.

	function greeting($name) {
		return "hello $name";
	}
	echo greeting($argv[1]); // Ahhh now we're getting somewhere!

Well, why didn’t we just declare $name as global inside the function? The reason is because I’m actually putting this function in a different file called greetings.php and including it in my actual script foo.php where I will run lots of other code. Now I want to test that it works before I actually include it and simply make calls directly to the function from the global scope. To do this I have to pass something to the function. If I just test it with echo greeting(‘GoogleGuy’); it will still work. If I use echo greeting($name) and $name is defined as $argv[1] then that works too! Now, what if I decide to use this script in a webSAPI instead of CLI? All I really have to do is change the variable used in the function call to $_POST[‘name’], for example. Or I can define $name as $_POST[‘name’] and call greeting($name) each time. The point is I don’t have to worry about which variable I declared from the global scope in my function. This would require me to go back and look at the function definition any time I make changes to the code using this function. What’s even worse is in order to test it I specifically have to make sure that this variable is never used anywhere else in the global scope. So now, the order in which I write my code has an impact on the behavior of my function and as I keep adding to the global state it isn’t clear how my function is being used unless I go back and review the function definition.

So there are two things to take away from this:

  • Easy to add to the global state, which can cause unexpected behavior of my functions
  • Difficult to test, because calling the same function with the same arguments doesn’t necessarily produce the same result (when we’re relying on the global scope)

Now, this is all very apparent and makes one think “well duh, of course I wouldn’t do that in my code,” but in all reality I meet programmers all the time that understand the problems of relying on the global state yet still insist on using singletons in their code. This is the – not so apparent part. Singletons aren’t always bad just like superglobals aren’t bad in PHP, but what you have to recognize is that they do rely on the global state and that could lead to testability problems, unreadable code, and lengthy debugging. These things are pretty interconnected so having one of these problems extroplates the other.

Remember that the garbage collector doesn’t come around to clean up the global scope until after the entire script exits. This is where all your resources, variables, objects, streams, etc are closed and removed from memory by the garbage collector. If you still work with mysql_* functions it’s common to see people simply declaring the mysql connection resource as global within their functions rather than passing the resource to their functions/methods. This isn’t a good practice to get in the habit of. When you get into OOP you learn more about dependency injection and how that makes instantiating your objects easier and more testable, because when you do unit testing of your individual methods you don’t want to spend a lot of time figuring out what your objects rely on from the global scope. In fact, if your objects rely on anything in the global scope their methods will be much more difficult to unit-test and the objects will be painfully difficult to instantiate. If you’re using singletons to work-around this problem and avoid having to figure out scope/path resolution it’s only going to come back and haunt you during testing.

One of the blogs I enjoy reading the most on and around the topic of clean code and avoiding mistakes like relying on the global state is Misko Hevery’s blog who is a coach at Google and has worked for some pretty big players. He does a lot of talks about clean code and you can find a lot of his vidoes on youtube and on his blog.