Sherif's Tech Blog

Just another guy on the Internet with a keyboard…

Why You Need a Database

There are a lot of developers that start off building their applications with the notion that a database is only necessary if they have a lot of data to work with or that the data they have will be easier to manage if they can avoid the complexities of building and maintaining a database or dealing with a DBMS (Database Management System). In the area of web-based development, this is rarely the case. The reason for this is that web-based applications tend to grow very rapidly. This is easy, because there are billions of people with access to the Internet and virtually anyone with access to the Internet usually gains such access from a web-enabled device. Having access to the Internet has become synonymous with having access to the world-wide web. Since the number of potential users is so huge the potential for data is equally huge. Not only that, but beyond the sheer amount of data that maybe collected from users of the application software and stored for use by the system there is the factor of maintainability. Databases make organizing and maintaining long-term data easier. This comes in several forms. Without a database solution you have to worry about concurrency issues for replication. You would also have to consider race conditions, access time, permissions, and scalability among others.

Databases Are Overkill

For those who start off building small web-based applications or even trying to put together a tiny CMS (Content Management Systems) they sometimes fall victim to the illusion that having a very small amount data would mean that building a database for this data would be overkill. This is simply not true anymore. Today databases are easier than ever to build, grow, and manage. With lite-weight solutions like SQLite you actually improve on performance with small amounts of data and make it easier to manage. SQLite is actually a small foot-print library written in C that implements an embedded DBMS. It’s only a few hundred KB in size and implements most of the SQL standard. You can use it to store databases in memory or on disk and still get the full benefits that relational databases offer with a minimalistic foot-print and without compromising on performance for small data sets. It’s adopted by PHP, Python, Perl, Ruby and even Javascript as well as many other languages. So there really is no excuse to avoid using a database when the solution is widely available in so many popular platforms and especially in web development.

Databases Are Slow

This could not be farther from the truth. A relational database can maintain indexing for records across different tables. This means rather than looking through the entirety of the data set and then trying to expose some underlying structure in order to find a particular set of data the relational database takes advantage of composing structures as you build your data sets. These structures make things like fetching a record with a primary key much much faster than you would get by using a flat-file solution.

Lets examine the alternatives. Even if you had a very small amount of data – say just a few hundred lines of text. Even if the data structure was overly simplistic – we’ll assume each line represents what would be a single row in a database table. Even if the data will only ever be maintained by a single developer – you. You are still overlooking so many problems that are not easily solved by using a flat-file to maintain this data. First lets consider the race-condition. You have a script that opens a specific file on the server and appends a new line each time a record is added. The script can also open the file for reading and retrieve the entire contents of the file into memory. The script can then do any necessary sorting and filtering to return the required data sets to the user. The most apparent problem with this approach is the race condition. It is entirely plausible that two requests could be made simultaneously to the same script – one to open the file for writing and append a record and one to open the file for reading and retrieve the data. If the data is stored into memory before the line is appended the result is stale and potentially corrupt. If the new data is appended before the read, no problem. However, what happens when you want to delete a record. Now the problem is three-fold. If three individual requests all come in at the same time – one to read, one to write, and one to delete a record – it is now likely the case that your entire data structure has been corrupted. Remember that HTTP is built on a request-response model and no two requests are treated as if they are tied to any previous requests. So there’s no central point of control over your script’s ability to manage which process can access the data and to what extent.

In a DBMS, on the other hand, the control is transferred away from the script and to the central management system of the database. The DBMS then gets to decide how requests will be served and the order of treating the data. This creates more dependable data that has a far lesser chance of corruptibility. Now, it’s entirely possible that you may not be concerned the integrity of your data for a small application, but then you might as well not waste your time building it.

I’ll Use A File Now And Learn To Use A Database Later

If you’ve said this phrase it’s already too late. It doesn’t take a lot of time to get started with a database in the first place. If you’re using languages like PHP, Python, Perl, or Ruby you probably already have the necessary libraries installed on your system to work with a database. These libraries and drives are usually packaged with these software stacks as standard. It’s actually uncommon to not have some DBMS solution already available in most of these environments. So why would go out of your way to reinvent the wheel when the solution is already at your fingertips? Not only that, but it takes very little time to set these DBMS solutions up and get them to run smoothly on virtually any platform. You will probably spend more time trying to write a script that stores, retrieves, sorts, filters, locks, and validates data using a flat file then you would installing the DBMS and getting a simple schema started.

If you’re using PHP interfacing with a database has become easier than ever. It only requires a couple of lines of code to open a database connection to virtually any database you have a PDO driver installed and loaded for in PHP. So whether you’re using SQLite, MySQL, PgSQL, etc… you shouldn’t need to spend a lot of time learning how to interface with each of these databases if you simply stick with the PDO extension. You use the same functions regardless of the database. This is opposed to having to learn the individual database-specific extensions in PHP to interface with each of those databases. Not to mention PDO supports many of the popular new database features such as prepared statements and is a lot easier to learn and use than extensions like MySQLi.

PHP and Databases

Being a PHP developer, I also take notice of many PHP developers that tend to have the misconception that when they start using a database (usually it’s the case that their first database is MySQL) they should start by learning the old mysql extension in PHP. This is simply not true. Some of the underlying reasons this is widespread, however, is mainly due to the fact that the old mysql extension has been around for quite a long time in PHP and it’s fairly common to see a lot of PHP code demonstrating the use of a database in PHP using this extension. It’s also become familiar to a lot of old PHP developers and is bound to be present in their older applications. However, the use of the old mysql extension is highly discouraged for new development. It’s an old extension that’s no longer well-maintained and has been planned for deprecation for years. There’s no guarantee that if a new bug creeps up that someone will go back and fix it. This leaves your application vulnerable and exposed. If the code base gets large enough this might leave developers scrambling for migration path. Additionally, the extension does not support prepared statements or parameterized queries. This makes things like making sure you properly escape user data to avoid SQL injection, prone to error. The extension lacks in many areas that are not conducive to future development. Learning the old mysql extension first before you learn the new improved mysql extension in PHP or before you learn PDO will gain you nothing. In fact, it will cause you to have to unlearn some of the very poor design of the old extension and its implementation details in order to become more accustomed to the newer extensions.

Some developers also complain that PDO seems too complicated or more difficult to use than the old mysql extension. This might come from the lack of understanding as to what PDO even is or how its used. Since PDO can only be used with the newer OOP features of PHP (you have to use objects and methods instead of procedural-style functions) it makes it seem unapproachable or even scary to developers who aren’t used to OOP in PHP. There is also the idea that PDO has a lot more features due to its vendor-agnosticism and the fact that requires further configurations such as installing and loading the individual drivers needed for interfacing with your specific database (where the drivers aren’t already packaged or loaded). I can understand the intimidation, but most of this has been alleviated with new versions of PHP coming pre-packaged and loaded with most of the popular drivers and the documentation offering up examples that are now easier to follow and get started with. Most of the intimidation is actually coming from having to unlearn old habits that older extensions like the old mysql extension once taught.

Once you get past the initial intimidation phase and actually get started with a PDO and with a database you’ll find that it doesn’t take nearly as much time as you’d think to get up and running. Most of the reservations people have are 90% of what’s holding them back. Not that the investment to get started is actually that significant. Beyond that you’ll find that learning to build on data normalization not only make development easier, but makes your users happier. When you can organize and maintain data that’s more clearly structured and accessible you can serve your users more effectively and efficiently. That will keep users coming back and eventually help you grow your application!

Load Balancing Software as a Service

I’m sure many of you have seen this statue before, perhaps not the very same one in the picture, but possibly similar statues around the world. This one is located in New York City.

Statue of Atlas in NYC

This particular statue is the Titan Atlas (a God from ancient Greek Mythology) who was supposedly burdened with carrying the weight of the world – or the weight of the heavens – on his shoulders as a punishment from Zeus. Whether it was the weight of the world or something else is unclear, but most people seem to follow this same observation. In general it’s nothing more than a myth, but the lesson history teaches us is that it constantly likes to repeat itself. Clearly, no one can bear the entire weight of the world on their shoulders just like no one computer can either. If you are running SaaS (or Software as a Service) you are online 24/7 and so is your service. The problem is there are over two-billion users online (or with Internet access) today. What happens when too many of those users all start using your service at once?


What Is Load Balancing


The idea behind load balancing is that a single machine can only handle so much work at one time and you can only go vertical for so high. Notice that even in large cities you can only build so high before you have to start building out. Since on the Internet virtually anyone can be using your server at any time you run the risk of overloading without warning. If too many users all send requests to your server too quickly, the server will reach a point where the load is higher than its capacity and eventually crash. This particular vulnerability of typical client-server relationships on a network is exploited by what is commonly referred to as a DDoS attack or a Distributed Denial of Service attack. Basically, a number of clients (sometimes a bot-net controlled by one or more users) will attempt to send a lot of requests to a server or number of servers very fast in order to overload the server and prevent its intended users from being able to access the service. Sometimes this is done just to destabilize the service running on the server or for other malicious intents. There are ways to mitigate DoS attacks with firewall software/hardware or through other means depending on the service, but not all DoS attacks are malicious or even intentional in nature. Google, for example, experienced what was at first glance considered a DoS attack on its search service during one afternoon on June 25th of 2009. This actually wasn’t a malicious user or users at all. It was the world receiving the tragic breaking news of the death of Michael Jackson. Literally, millions and millions of users from all around the world flooded Google Search all at once with the same search phrase “Michael Jackson”. Google had never seen such a tremendous amount of traffic coming in all-at-once on a single search query, before, so their first thought was “ohnoes, we’re getting DdoSed!

Scaling Out - SaaS

Scaling Out - SaaS


Why Do I Need It


The fact remains that any number of users can suddenly surge the number of requests coming in to your servers at any given time and whether that is malicious or not is unimportant. What is important is that you are better prepared to handle such situations so that your service will suffer as little downtime and degradation as possible. So load balancing allows you to distribute the load on a particular service or services over a larger array of resources. It’s basically making your service, as a whole, more tolerant of failure by being able to efficiently make use of all available resources.

If you are running any kind of high availability service over the Internet you need load balancing. Though, even small applications with just a few thousand users can benefit deeply from load balancing, as well. The only potential down-side is that you may need more than just one node to it. This isn’t always necessary as load-balancing can come in many shapes and sizes. For example, you might be doing load balancing on the same host node using multiple guest nodes on the same machine. All of the major services you probably use on a regular basis like your email, search engines, or popular social networking apps all make use of load balancing because it keeps things running a lot more smoothly as the number of users grow. If you’re not on-board with this yet – you probably should get on board quick.


How Do I Use It


There are few broad categories you can place load balancing techniques in. The easiest form of load balancing relies on existing system already built on top of how most systems function over the Internet (or large networks in general) and that’s DNS. DNS is a distributed system so it relies on multiple components in the network to do their job in order to make things more efficient. It reduces bottle-necks like those created by routing enormous amounts of packets across the planet in fractions of a second. Like most complex systems everything starts off small and simple and grows both horizontally and vertically, but at the core the protocols are fundamentally very simple.

DNS Load Balancing is simply relying on the DNS system to take care of the most basic problems for you. The way this works is you set the DNS record for a particular domain name to multiple IP addresses (usually one for each server) using low TTL (or Time to Live). Since DNS is cached at various levels this makes things like geographical loads efficient for services like name servers. A name server tells the DNS where to send the request for a particular domain name and can route packets to different locations depending on the geographical origin of the request thus alleviating network latency and allowing packets to travel shorter distances. Once the request comes in and is routed effectively the DNS is cached at multiple levels so that future requests are made to the same place. This can be cached at the local level, the ISP level and other levels in the parent zone. The name server then doesn’t become a bottle-neck since not every single request has to rely on that name server entirely. There is a TTL involved that will let the caching servers know when the cache has become stale and that it’s time to refresh. Also when requests to a particular server are no longer getting through the DNS server will know to try a different IP. So if you have different servers with different IPs in the DNS record that ultimately means if one server becomes unresponsive (potentially having gone down) the load is directed to a different server. The inherent problems with this approach are that it isn’t making very efficient use of all of your resources. It doesn’t take into account which servers are currently busy and if the DNS record has already been cached to a server that is now down you end up potentially being stuck with a poorly responsive server until the cache is refreshed. Additionally, you are exposing your infrastructure to the outside world by revealing the public IPs of your servers with no way to control the flow of traffic to an internal network. It’s very easy to have an unstable system this way. Most services that use this approach are usually just creating what is known as mirrors (servers that back each other up so that in case one goes down a backup can still be reached).

Software Load Balancing is another approach to solve some of the short-comings of the DNS offloading techniques described earlier. Software load balancers attempt to keep track of the available resources and when an incoming request is received it determines how to best allocate those resources in-order-to service that request. The benefits of this technique are that you don’t have to reveal your network setup to the outside world. Everything can be done on the internal networking configuration setup (whether that’s a local area network or otherwise), or in other words, you won’t expose your communication channels directly. Also, you have a tighter hand on security and distribution since you can more easily control the flow of traffic over the network. Some examples of common open-source load balancing software are Pound, Varnish, mod_proxy for Apache’s httpd, and Gearman. There are all sorts of nifty ways to balance the load across your network. You can have the load balancers poll the servers and check on resources like CPU usage, available memory, storage space, network traffic or open TCP connection, etc… The load balancer can then use this information to figure out how to best direct the incoming requests and serve up the responses as quickly and as efficiently as possible. There are still a few problems inherent to this technique depending on how you use it. If you’re only relying on a single machine you have a single point of failure. If the host node goes down the load balancer and all of your resources go with it. If you’ve only got one load balancer and multiple servers you still have a single point of failure. Additionally the load balancer itself can be DoSed given an attack of enough magnitude and proficiency. Not only that, but you have to worry about things like session storage consistency across multiple servers, file-system access, database synchronization between different database servers, and some network bottle-necks that might not always be easy to resolve with load balancing – to name a few.

Hardware Load Balancing there are some hardware load balancers as well. You can actually buy very expensive firewall/routers that take care of many of these things for you. Most people usually just setup a dedicated node or two with software load balancers that pretty much do the same thing. These hardware load balancers might do a better job of handling security and high bandwidth loads like Cisco’s ASA, but they do come with a heavy price tag.


Some Load Balancing Tips


There are some pretty common approaches to some of the problem inherent to distributing a service over multiple servers. For example, take your session storage as the most obvious problem. If you’re using PHP you are probably using the built in session handler, which makes use of file-based sessions. If you have users being directed to different servers by the load balancer you end up with the user having multiple sessions across those servers (that might be a little problematic for your application and annoying to the user). Some people will try to avoid this by creating what’s called a sticky session. Once the session is generated for that user they’re sent a cookie that lets the load balancer know upon subsequent requests to direct the user to this particular server. There are a few minor problems with that, but nothing you couldn’t work out through a well-planned architectural approach. Another way to approach this is by creating a centralized session storage server where all the requests will look for the session. Depending on your infrastructure this may or may not be a good idea and keep in mind it also creates a single point of failure. For example, if your servers are built on stacks (you have several software-based servers running on the same node like a webserver, database server, application server, etc…) it takes some tinkering to configure each stack to work from a centralized session storage. You can use something like Redis where you can have master/slave replication across all stacks. This takes a little less configuration and puts the dynamic into the software stack layer – thereby removing it from the load-balancing layer.

The other obvious problem is file system storage. If you allow your users to upload files to your server, or you store large amounts of files that your application relies on heavily, there needs to be some system whereby your application layer can access those files considering the load balancing may send requests to different servers. Again there is a centralized approach like with session storage, but even with a replication approach – to avoid the single-point of failure down side – you might create the problem of over redundancy. If your servers are set up in stacks having four or five copies of each file (or more depending on how many servers you have) on each server stack is a bit of a waste, especially if you’re already using RAID arrays for redundancy. Even if you have a centralized set of servers for storage you still face the problem of network overload. For example, consider that if your backbone bandwidth capacity is at 100Mbps but your central network bandwidth capcity is at 10x100Mbps you eventually create a bottleneck with increased usage as your backbone can only serve up to 100 megabits per second of traffic at any given time.

Using a CDN (or Content Delivery Network) is one solution often used when large amounts of files need be shared across a network, but this can also be a bit costly depending on your needs. In its simplest form a CDN is really just a group of servers that store files or data objects for you and replicate them across multiple nodes allowing many other servers on the network to access that data with improvements in caching and high bandwidth to reduce latency. The servers in the CDN clusters are usually strategically located on the edges of the core network to minimize the bottlenecks involved in the centralized network loop. So you are redirecting the traffic to access file storage away from the central network and off to the edge servers expanding on bandwidth and minimizing on bottle neck traffic. This solves both the single-point of failure problem as well as taking the complexity mechanism away from the server stack which can ultimately help reduce loads and create more efficient load balancing. Most services that utilize CDNs are usually ones that need to offer high-bandwidth access to a large user base with consistency. For example, a service that offer Hi-Definition video streaming, large photo sharing web sites, or other media services with high availability needs. You don’t always have to build this infrastructure yourself. You can rely on services like amazon Cloud Front which is a pay-as-you-go CDN service offered by amazon. There are many other competitors, of course, that can offer cheap CDN solutions. Depending on the sensitivity of your data this may or may not be an option for your particular SaaS needs. Still something to consider.

Besides just file storage you probably have a lot of database concerns in a system that scales horizontally, as well. If you’re just using a single LAMP stack with little more than PHP, MySQL and Apache running your back-end it might seem easy to scale wide at first. The problem you’re likely to run into head-on is the data-replication across your MySQL servers. The database is almost always the biggest bottleneck in SaaS. It usually contains tons of data that virtually every one of your users will access with each hit. There’s only so much traffic a single database server can handle, but setting up two or more database servers can show some significant improvement. Your load balancer can also play a role in this. There can be data object caching mechanisms in place to ease off some of the load for the most frequented queries. There can also be network latency issues to deal with once you have several database servers all replicating (especially if these servers are geographically spaced out across different data centers, cities, countries or even on different continents). Chunking is definitely not something I’d advice. It throws way too many variables into the equation and presents more problems than solutions – for the most applications.

What Programming Language Should I Learn

So you want to start learning a programming language? The first question you might have is what language should you start with. Unlike when we’re born where we don’t get to pick the first language that we’ll start learning to speak, read, and write in – in the computer science world you have a choice. However, as a programmer you have a vast array of languages to chose from and sometimes I find people ask me “which is the easiest?

The truth is whoever you ask will tell you their language is the easiest, best, most powerful, or whatever reason they can think of for you to learn and use that language! As a joke I posted this video on youtube along with others about individual languages, but this one seemed to get the most hits. Is it that people are very inclined to find the one ultimate programming language or just laugh about the rest? Who really knows, but it’s funny…

This is kind of like asking a multilingual person which language they think I’ll find the easiest to learn. While there may be some valid answers, most of them will probably be subjective and they won’t answer the true question one should be asking. That is, what am I going to be using the language for. Just like when you decide to learn French, because you’re either moving to France, or would like to communicate with someone who speaks French, similarly, you pick a programming language because you would like to communicate with a computer in a way that meets your objectives. What you intend to do with the language, and what it can do for you, however, may not always be so apparent at first.

I only recently realized how truly complicated it may be for someone who does not come from a programming or technical background to actually chose their first programming language. Having recently examined a massive comprehensive list of programming languages on wikipedia I found that there are currently over 600 programming languages to chose from and that doesn’t even include the more than 300 dialects of BASIC and some other various esoteric languages. This also probably doesn’t account for some of the lesser-known dialects or derivatives of some of these languages. Since not all programming languages have an official specification they very well may be implemented in dozens of different ways in smaller niches.

If we look at programming languages broken up into categorical or even chronological lists the information still doesn’t help make the choice any easier; or even more useful. However, if you simply attribute languages to their strongest generational origins you can narrow the list down to just a few dozen languages. If you take away some of the older generations and put emphasis on those languages highest in popularity and active use/development you come up with just a few languages and their respective dialects. However, this still isn’t informative enough to help someone decide their first language so I split up languages based on their strongest usage and in the world of computer science this comes down to two broad categories (systems programming and utility or application programming). The most notable distinction between these two types of programming is that systems programming aims to provide software to communicate with hardware while application programming usually aims to provide software for the user that’s sitting at the computer. So while software like a text editor or word processor is considered application software, software like a disk formatting/partitioning utility is considered systems software. Your operating system has to deal with the hardware in your computer directly in order to provide application programs with a means to do things like write to your computers memory while the operating systems page table and memory manager can control how this hardware is being used by the various application programs.

C

By far, still one of the most popular programming languages still around and even though it is still used by many to develop application software it hasn’t lost its popularity or its power.

C is not a language you usually pick to write every-day utility applications. If you chose to start learning C be prepared to start learning a lot of other systems programming concepts and technical hardware documentation as well. Most Computer Science majors take C as one of their first programming language courses in college. This is important, because there is a huge amount of software that’s written in C. For example, most operating system software is either written in C. There may be some C++ in there, but for the most part you’ll find a lot of linux distributions are made up of a huge amount of C code and much smaller portions written in either C++ or some other similar language. You may hear about Assembly language as well when learning or working with C. Essentially, when a C program is compiled into a native binary and run as an executable program it is technically transported to Assembler. You take a high-level language like C and, eventually, to get it to run on the machine it has to become low-level software in bytecodes the machine architecture can execute. C is still a high-level programming language, but it has also been recognized for its lack of agility in relation to programming languages like assembly which is a low-level programming language. Don’t let this confuse you, however, C is a powerful language and in fact many of the popular languages you will likely hear about or discover in this article were written in C. For example, PHP, Python, and Java are languages whose APIs and extensions were written in C.

However, C can be tough. Writing non-buggy C code is costly. It can take a lot of time, because you either have to find the libraries you need and implement them or write them yourself. C is a procedural and somewhat imperative language. It also teaches concurrent programming and programming with side-effects, which is very different from languages like Scheme where you program without side-effects. C programs are like one big global scope where everything can effect everything else. So you have to be very careful about managing your memory in C. You have to worry about pointers and references and data types everywhere in your code. You have the basic constructs like IFs, and loops, and functions, but ultimately you have to learn to do a lot of things other programming languages can make a lot easier, because they already have extensions that implement a lot of these popular C libraries built right into the language.

So, unless you plan on designing an API for a larger program or build some system utility C may not be the right language for you to learn. If you’re a compsci major you’re probably going to learn it as your first language whether you want to or not, but lets face it you chose the degree…

BASIC

BASIC has been around for quite a while as well and it has hundreds of dialects. It was popularized by many hobbyists during the 80s and grew further in popularity on Windows during the 90s with Microsoft’s Visual Basic suite that attempted to keep the language as simple but as powerful as possible. BASIC is not very difficult to learn, but it is also a compiled language like C and has declined in popularity over the last decade. It might not be the best language to work with, but it is still high on the hobbyists list. Much like languages that were once popular to learn just as a hobby and were fun to play with (like LOGO which was a dialect of LISP) not many people take it seriously.

BASIC has the essential control structures you’d find in almost any language like IFs, loops, and GOTOs, but it was fundamentally built on the concept of sequential programming where the entire program is built on one huge sequence of instructions. There are subroutines (like functions) and some dialects implement a lot of other modern features, but for the most part it’s great for when you want to learn programming for fun. If you’re serious about building cross-platform or enterprise-level applications BASIC is far from a first choice.

Java

Java and its other Java-based languages stand out for their compile-once run anywhere trait as opposed to many other compiled languages where you write the code once and then have to compile it for each different platform you chose to run it on. With Java, if you chose to compile your code to Java bytecodes to run in the JVM you will only need to compile it once. The JVM (or the Java Virtual Machine) can pretty much run on any platform (Windows, Linux, MAC OS, etc…) and works with the systems hardware directly through its VM. This enables programmers to be able to compile their Java code on any machine just once and it will run on any other machine in virtually any platform without having to recompile for that specific platform. Java can also be run in JRE (or the Java Runtime Enviornment) so it works as an interpreted language as well. Java’s popularity hasn’t declined much over the years and it’s gained quite the reputation with later adopting open source initiatives.

Java is also popularly taught in compsci courses in colleges, institutes, and universities around the world. It’s similar to C in that it is a statically typed language and has functions and basic loops and other constructs. However, Java is an object-oriented language. C is pretty much procedural in paradigm. You can build structs and things in C, but Java makes abstraction a whole lot easier with its OOP features. You can get a whole lot more done in development in a fraction of the time it might take you to do the same in C. So developing day-to-day applications in Java is a lot more common than with C. It’s just that a lot of the folks that have learned C and know it well have stuck to it over the decades and continue using it. Java is a much newer language. It appeared around the mid 90′s, but it has proven itself in the last 16 years or so. C has been around since the early 70′s and hasn’t changed much. The most current standard of C is C11, its predecessor was C99. Java is at Standard Edition 7.

Java is also considered a fast and secure language for a number of reasons. It is skeptical whether or not all of these reasons hold true, but for the most part they’re built on some solid grounds. First, Java code runs in the JVM, or the Java Vritual Machine, which means the VM can check the compiled bytecodes of the program and make sure they’re valid Java bytecodes before running or executing the code. Second, Java code is cross-platform so it easily translates to the same machine code across different platforms without much concern over the implemented libraries. Java is expected to be very performant because of its JVM. This means your Java programs run directly in a virtual machine that sits on top of the hardware layer allowing direct hardware implementations and interfaces as opposed to some other VM concepts where the program runs in the VM that runs on top of operating system or its implemented libraries that runs on top of the hardware. There’s somewhat of a more direct interaction there. Between Java and some other scripting languages like Perl, Python or PHP – this might be an advantage, but between C and Java it can go either way. In most cases C would easily out-perform Java, but in a few cases it might go the other way around.

PHP

PHP is probably the most popular language on the web. It has many followers and a huge open source community. It’s an interpreted language that was originally developed for producing dynamic web pages. However, today it is seen as a general purpose language. What makes PHP so great is that it works very well with web servers. You can install it as a web server module or run it on the command line. It has many useful built-in features that make web development easier right out of the box. PHP is also built on share-nothing architecture so it scales very easily and doesn’t require much configuration. It offers automatic memory management and it’s somewhat loosely typed so its data types may not be very suitable for edge cases, but that can be debated. For most general purposes PHP works great, but like BASIC it attracts a lot of hobbyists given that it lowers the bar of entry.

Unlike with C, in PHP you do not have to worry about managing your own memory. You can easily build data structures, facilitate external resources to databases or other libraries directly through the PHP extensions, and generate output to standard streams without a lot of fuss. It’s easy to take a general idea and implement it in PHP very quickly. Most people do this with Python and Perl as well to get a working prototype up and running. However, if you build a lot of prototypes, you know that they end up getting tossed out when you start building the real thing. Regardless, PHP is a great language to get code working quickly and very similar in syntax to languages like C and Perl. However, the down side is that these languages are also considered very ugly and have many extensions with poor implementations or interfaces or leaky memory. Not everything about PHP or Perl is great, but it works. At the end of the day it takes a fraction of the time to write PHP or Perl code that would do the same thing in languages like C and with less possibility of bugs since these languages are usually very forgiving and try to account for user error where possible.

PHP is extended by C and is built around the Zend Engine, which is the PHP Virtual Machine. PHP has different SAPIs, or Server APIs, for different web servers and platforms. Among the most popular are probably the Apache httpd module, which is known as mod_php and the fastcgi /fcgi SAPIs. The difference between the two is basically like running PHP inside your webserver as a part of the webserver program (mod_php), and running another program along-side of your webserver that interfaces with it through a CGI (Common Gateway Interface), which is what the cgi/fastcgi SAPIs are built around. There are lots of different implementations, but the module running as a part of the webserver usually trumps the others in performance and scale. PHP also has a CLI SAPI, which allows you to run PHP directly from the command line. You could use this to build command-line scripts like the popular BASH scripting language, on *nix shells. However, most people don’t use PHP to build command-line programs. It’s not the most performant programming language, but it works well for things like the web where you want to build dynamic websites or applications. Just tiny programs that execute for a very short period of time and run independently of one another. When you look into building things like long-running daemons, you usually turn-away from PHP and head for languages like C or even Java.

Other General Purpose Languages

There are many languages considered for both web development and as general purpose languages that are also dynamically or loosely typed and offer automatic memory management and even web server modules just like PHP. Languages like Python, Perl, and Ruby are also exceedingly popular and quite similar to PHP in many ways though they are not all based on the same generational languages. Of course shell scripting is also going to fit under general purpose in most cases and so Bash, sed, AWK, etc.. are also great languages to know.

To some people’s surprise, javascript is now becoming somewhat of a general purpose language itself. Recent VM implementations like Node.js make using javascript faster and a little more powerful than some of its earlier ancestors. One of the best things about javascript is it’s non-blocking nature and event-driven capabilities. It’s a great language for automating event-driven tasks by setting up listeners and such. It’s got a lot of uses on the web and offers multiple paradigms as well.

Beyond

Beyond just looking at what all of these programming languages can do for you it’s important to realize one language isn’t always enough to do what you need. If you’re going to start learning a programming language it’s easier to pick one that won’t require a lot of time to setup and configure. Something llike Python or PHP or even javascript is easy to just install and start writing code and the best part is you can just run that code instantly without having to compile anything and see the result right away. These languages aren’t very hard to learn because they have a lot of free online resources, documentation, and a lot of people already use them so you shouldn’t have too much trouble finding quick tutorials or examples of code that show you how to write short and useful programs. But of course your mileage may vary!

Over time, when you have learned your first programming language very well you may find the need to do some things that aren’t always very easy or even possible with that language (or you may never experience this depending on the language and what you’re doing). This may lead you to start using another language in place of or along side of that language for a similar project or a different project. If you’re a hobbyist doing this for fun you might not be so inclined to learn more languages, but if you’re a professional you will probably need to learn many languages over the years. It doesn’t hurt to have a long list of programming languages on your resume for a job and it certainly won’t hurt to already have some experience with a language you’ll be using on a new project at work. However, most programmers will be quite proficient in just two or three languages and have some overall understanding of others. This is usually all you need in the majority of cases.

WebSockets – Making The Web More Useful

As of the date of this document (October, 2011) the WebSocket protocol has about 18 versions of the ietf-hybi draft. The hixie draft actually has around 71 Internet drafts since the first one in January of 2009. So in less than 3 years there must have been nearly a hundred drafts to this protocol with virtually every main-stream browser providing some different implementation of the spec. The earliest support offered for the WebSocket protocol in Chrome, Safari, FireFox, and Opera have actually been disabled by default in FireFox 4 and Opera 11 and Internet Explorer only has support through HTML5 Labs, which is just a prototype. Microsoft plans to offer better WebSocket support for future versions of IE with the new revised hybi-10 draft, IE10 is the planned release. If that weren’t enough the Gecko-based web browsers 6-7 implement the WebSocket API objects differently requiring developers to write extra code when integrating with existing WebSocket code. Still, it doesn’t even stop there… You now have to worry about the iOS mobile Safari browser (for iPhone) and the BlackBerry Browser in OS7 and how they support WebSockets as well. If it seems like this is all very messy it’s only because it is!
The web started as a place to make information more easily accessible by being able to do simple – but very logical – things like Hyper-Linking. Every document on the web should be able to link to any other document on the web regardless of domain or origin. That’s useful, because it makes information more accessible and efficient. However, today we spend the majority of our time trying to get automated scripts and tools to do most of the information processing for us so that we can ingest the processed information in bite-sized pieces that are easier to swallow. Much like how many of us have become accustomed to the processed meat we buy at the grocery store this is become rather second-nature. The only problem with that is the web was never designed to be this powerful. It doesn’t supply us with the right tools to do many of the jobs we expect it to do today. We simply never envisioned that one day the world would be running Software As a Service as we do today and probably wouldn’t have guessed that the World-Wide-Web was going to be the platform for all of this. The browser — once seen as a window into a strange and unfamiliar part of the Internet — has now become the corner-stone of the PC. We spend more time than ever before inside of our browsers doing many things we were once used to doing outside of the browser. Like playing games, chatting with friends, writing word documents, reading books, watching videos, editing and browsing photos, and much more. The world is now a much different place with the browser being at the center of every computer screen. Disconnect someone from the Internet and all-of-a-sudden their computer feels inert or even incomplete.
How the web works today and how it’s basically worked for the past 18 or 19 years is very simple. Your computer opens something called a TCP socket that connects to a remote server and on the other end the server is listening on a specific port for incoming requests. Once a request is received it processes the request according to a defined spec (called HTTP) and sends back a reply and the connection is terminated on both ends. There is no persistence layer. There are no negotiations to be made or handshakes to be exchanged between the client and the server. It’s all very simple and the protocol name says it all — it’s in plain text because it’s Hyper Text Transmission Protocol. The request may come from any client to a single server or from one client to multiple servers. It doesn’t matter. Each request is independent of every other request and every request must illicit a response. So there’s virtually no complexity in how this protocol can scale. We send off a request, wait for a response, and both the client and server part ways forever. That is, of course, until the next request is made.
To use WebSockets you need to write both a server and client. The client can easily be written in javascript using the WebSocket API (for WebSocket protocol capable browsers). The client can also be written using flash sockets as a fallback or even through long-polling if neither WebSockets or flash are available on the client machine. The server can pretty much be written in whatever server-side language you’d like. You can take your pick from any of the popular server-side languages like Python, PHP, Perl, Java, etc… Now you can even write one using javascript with node.js. I chose to write one using PHP and only implemented WebSockets (so there is no fall-back support for my example, but it works with most mainstream browsers like Chrome, FireFox, and Safari).

You can see a working WebSocket example here…

For some sample code on how to write a socket daemon in PHP you can see my PHP chat server example here as well.

This example basically uses WebSockets to implement a completely web-based chat in the browser. Each user that connects can send a message to the chat room and the message will be relayed to all other users currently in the chat — in real time. There is no database required on the server-side and no nasty long-polling done on the client. The TCP socket is a two-way pipe in full duplex that allows either party to send and receive messages at any time. The client doesn’t have to wait for a response to send the next request and the server doesn’t have to wait for a request to send the next response. It was easy enough to write the daemon in PHP and get it up and running quickly as an example, but you will need to review the specifications of the WebSocket protocol carefully and find a way to implement it that will work for your specific needs.

A useful full-blown framework you can use to help you implement most of this is socket.io but keep in mind if you’re going to be writing the server in a different language (not using node.js) you’ll want to understand the implementation well since it varies wildly and requires fall-back methods incase the client doesn’t support WebSockets.