Sherif's Tech Blog

Just another guy on the Internet with a keyboard…

Load Balancing Software as a Service

I’m sure many of you have seen this statue before, perhaps not the very same one in the picture, but possibly similar statues around the world. This one is located in New York City.

Statue of Atlas in NYC

This particular statue is the Titan Atlas (a God from ancient Greek Mythology) who was supposedly burdened with carrying the weight of the world – or the weight of the heavens – on his shoulders as a punishment from Zeus. Whether it was the weight of the world or something else is unclear, but most people seem to follow this same observation. In general it’s nothing more than a myth, but the lesson history teaches us is that it constantly likes to repeat itself. Clearly, no one can bear the entire weight of the world on their shoulders just like no one computer can either. If you are running SaaS (or Software as a Service) you are online 24/7 and so is your service. The problem is there are over two-billion users online (or with Internet access) today. What happens when too many of those users all start using your service at once?


What Is Load Balancing


The idea behind load balancing is that a single machine can only handle so much work at one time and you can only go vertical for so high. Notice that even in large cities you can only build so high before you have to start building out. Since on the Internet virtually anyone can be using your server at any time you run the risk of overloading without warning. If too many users all send requests to your server too quickly, the server will reach a point where the load is higher than its capacity and eventually crash. This particular vulnerability of typical client-server relationships on a network is exploited by what is commonly referred to as a DDoS attack or a Distributed Denial of Service attack. Basically, a number of clients (sometimes a bot-net controlled by one or more users) will attempt to send a lot of requests to a server or number of servers very fast in order to overload the server and prevent its intended users from being able to access the service. Sometimes this is done just to destabilize the service running on the server or for other malicious intents. There are ways to mitigate DoS attacks with firewall software/hardware or through other means depending on the service, but not all DoS attacks are malicious or even intentional in nature. Google, for example, experienced what was at first glance considered a DoS attack on its search service during one afternoon on June 25th of 2009. This actually wasn’t a malicious user or users at all. It was the world receiving the tragic breaking news of the death of Michael Jackson. Literally, millions and millions of users from all around the world flooded Google Search all at once with the same search phrase “Michael Jackson”. Google had never seen such a tremendous amount of traffic coming in all-at-once on a single search query, before, so their first thought was “ohnoes, we’re getting DdoSed!

Scaling Out - SaaS

Scaling Out - SaaS


Why Do I Need It


The fact remains that any number of users can suddenly surge the number of requests coming in to your servers at any given time and whether that is malicious or not is unimportant. What is important is that you are better prepared to handle such situations so that your service will suffer as little downtime and degradation as possible. So load balancing allows you to distribute the load on a particular service or services over a larger array of resources. It’s basically making your service, as a whole, more tolerant of failure by being able to efficiently make use of all available resources.

If you are running any kind of high availability service over the Internet you need load balancing. Though, even small applications with just a few thousand users can benefit deeply from load balancing, as well. The only potential down-side is that you may need more than just one node to it. This isn’t always necessary as load-balancing can come in many shapes and sizes. For example, you might be doing load balancing on the same host node using multiple guest nodes on the same machine. All of the major services you probably use on a regular basis like your email, search engines, or popular social networking apps all make use of load balancing because it keeps things running a lot more smoothly as the number of users grow. If you’re not on-board with this yet – you probably should get on board quick.


How Do I Use It


There are few broad categories you can place load balancing techniques in. The easiest form of load balancing relies on existing system already built on top of how most systems function over the Internet (or large networks in general) and that’s DNS. DNS is a distributed system so it relies on multiple components in the network to do their job in order to make things more efficient. It reduces bottle-necks like those created by routing enormous amounts of packets across the planet in fractions of a second. Like most complex systems everything starts off small and simple and grows both horizontally and vertically, but at the core the protocols are fundamentally very simple.

DNS Load Balancing is simply relying on the DNS system to take care of the most basic problems for you. The way this works is you set the DNS record for a particular domain name to multiple IP addresses (usually one for each server) using low TTL (or Time to Live). Since DNS is cached at various levels this makes things like geographical loads efficient for services like name servers. A name server tells the DNS where to send the request for a particular domain name and can route packets to different locations depending on the geographical origin of the request thus alleviating network latency and allowing packets to travel shorter distances. Once the request comes in and is routed effectively the DNS is cached at multiple levels so that future requests are made to the same place. This can be cached at the local level, the ISP level and other levels in the parent zone. The name server then doesn’t become a bottle-neck since not every single request has to rely on that name server entirely. There is a TTL involved that will let the caching servers know when the cache has become stale and that it’s time to refresh. Also when requests to a particular server are no longer getting through the DNS server will know to try a different IP. So if you have different servers with different IPs in the DNS record that ultimately means if one server becomes unresponsive (potentially having gone down) the load is directed to a different server. The inherent problems with this approach are that it isn’t making very efficient use of all of your resources. It doesn’t take into account which servers are currently busy and if the DNS record has already been cached to a server that is now down you end up potentially being stuck with a poorly responsive server until the cache is refreshed. Additionally, you are exposing your infrastructure to the outside world by revealing the public IPs of your servers with no way to control the flow of traffic to an internal network. It’s very easy to have an unstable system this way. Most services that use this approach are usually just creating what is known as mirrors (servers that back each other up so that in case one goes down a backup can still be reached).

Software Load Balancing is another approach to solve some of the short-comings of the DNS offloading techniques described earlier. Software load balancers attempt to keep track of the available resources and when an incoming request is received it determines how to best allocate those resources in-order-to service that request. The benefits of this technique are that you don’t have to reveal your network setup to the outside world. Everything can be done on the internal networking configuration setup (whether that’s a local area network or otherwise), or in other words, you won’t expose your communication channels directly. Also, you have a tighter hand on security and distribution since you can more easily control the flow of traffic over the network. Some examples of common open-source load balancing software are Pound, Varnish, mod_proxy for Apache’s httpd, and Gearman. There are all sorts of nifty ways to balance the load across your network. You can have the load balancers poll the servers and check on resources like CPU usage, available memory, storage space, network traffic or open TCP connection, etc… The load balancer can then use this information to figure out how to best direct the incoming requests and serve up the responses as quickly and as efficiently as possible. There are still a few problems inherent to this technique depending on how you use it. If you’re only relying on a single machine you have a single point of failure. If the host node goes down the load balancer and all of your resources go with it. If you’ve only got one load balancer and multiple servers you still have a single point of failure. Additionally the load balancer itself can be DoSed given an attack of enough magnitude and proficiency. Not only that, but you have to worry about things like session storage consistency across multiple servers, file-system access, database synchronization between different database servers, and some network bottle-necks that might not always be easy to resolve with load balancing – to name a few.

Hardware Load Balancing there are some hardware load balancers as well. You can actually buy very expensive firewall/routers that take care of many of these things for you. Most people usually just setup a dedicated node or two with software load balancers that pretty much do the same thing. These hardware load balancers might do a better job of handling security and high bandwidth loads like Cisco’s ASA, but they do come with a heavy price tag.


Some Load Balancing Tips


There are some pretty common approaches to some of the problem inherent to distributing a service over multiple servers. For example, take your session storage as the most obvious problem. If you’re using PHP you are probably using the built in session handler, which makes use of file-based sessions. If you have users being directed to different servers by the load balancer you end up with the user having multiple sessions across those servers (that might be a little problematic for your application and annoying to the user). Some people will try to avoid this by creating what’s called a sticky session. Once the session is generated for that user they’re sent a cookie that lets the load balancer know upon subsequent requests to direct the user to this particular server. There are a few minor problems with that, but nothing you couldn’t work out through a well-planned architectural approach. Another way to approach this is by creating a centralized session storage server where all the requests will look for the session. Depending on your infrastructure this may or may not be a good idea and keep in mind it also creates a single point of failure. For example, if your servers are built on stacks (you have several software-based servers running on the same node like a webserver, database server, application server, etc…) it takes some tinkering to configure each stack to work from a centralized session storage. You can use something like Redis where you can have master/slave replication across all stacks. This takes a little less configuration and puts the dynamic into the software stack layer – thereby removing it from the load-balancing layer.

The other obvious problem is file system storage. If you allow your users to upload files to your server, or you store large amounts of files that your application relies on heavily, there needs to be some system whereby your application layer can access those files considering the load balancing may send requests to different servers. Again there is a centralized approach like with session storage, but even with a replication approach – to avoid the single-point of failure down side – you might create the problem of over redundancy. If your servers are set up in stacks having four or five copies of each file (or more depending on how many servers you have) on each server stack is a bit of a waste, especially if you’re already using RAID arrays for redundancy. Even if you have a centralized set of servers for storage you still face the problem of network overload. For example, consider that if your backbone bandwidth capacity is at 100Mbps but your central network bandwidth capcity is at 10x100Mbps you eventually create a bottleneck with increased usage as your backbone can only serve up to 100 megabits per second of traffic at any given time.

Using a CDN (or Content Delivery Network) is one solution often used when large amounts of files need be shared across a network, but this can also be a bit costly depending on your needs. In its simplest form a CDN is really just a group of servers that store files or data objects for you and replicate them across multiple nodes allowing many other servers on the network to access that data with improvements in caching and high bandwidth to reduce latency. The servers in the CDN clusters are usually strategically located on the edges of the core network to minimize the bottlenecks involved in the centralized network loop. So you are redirecting the traffic to access file storage away from the central network and off to the edge servers expanding on bandwidth and minimizing on bottle neck traffic. This solves both the single-point of failure problem as well as taking the complexity mechanism away from the server stack which can ultimately help reduce loads and create more efficient load balancing. Most services that utilize CDNs are usually ones that need to offer high-bandwidth access to a large user base with consistency. For example, a service that offer Hi-Definition video streaming, large photo sharing web sites, or other media services with high availability needs. You don’t always have to build this infrastructure yourself. You can rely on services like amazon Cloud Front which is a pay-as-you-go CDN service offered by amazon. There are many other competitors, of course, that can offer cheap CDN solutions. Depending on the sensitivity of your data this may or may not be an option for your particular SaaS needs. Still something to consider.

Besides just file storage you probably have a lot of database concerns in a system that scales horizontally, as well. If you’re just using a single LAMP stack with little more than PHP, MySQL and Apache running your back-end it might seem easy to scale wide at first. The problem you’re likely to run into head-on is the data-replication across your MySQL servers. The database is almost always the biggest bottleneck in SaaS. It usually contains tons of data that virtually every one of your users will access with each hit. There’s only so much traffic a single database server can handle, but setting up two or more database servers can show some significant improvement. Your load balancer can also play a role in this. There can be data object caching mechanisms in place to ease off some of the load for the most frequented queries. There can also be network latency issues to deal with once you have several database servers all replicating (especially if these servers are geographically spaced out across different data centers, cities, countries or even on different continents). Chunking is definitely not something I’d advice. It throws way too many variables into the equation and presents more problems than solutions – for the most applications.

Understanding DNS in Order to Host Your Web Sites

Introduction

This tutorial attempts to explain some of the inner-workings of DNS specifically in regards to web hosting. If you are thinking of getting in to web hosting, or have been hosting web sites for a while, you will likely benefit from this DNS tutorial. There are entire books written about DNS so I will not attempt to fully explain, in exhaustive detail, the complex system that makes up DNS. It is not the purpose of this tutorial to serve as a complete reference for DNS, but rather it is meant to act as a guide in helping you understand what parts of DNS to focus on for web hosting and maintaining a fully functioning web site.
Before we can begin let us first familiarize ourselves with some common terminology associated with DNS. First off, DNS is an acronym, which is commonly known as Domain Name System or Domain Name Service. DNS relies on some protocols that have been refined over time – and some that have refined DNS – as the Internet has grown and expanded to what it is today. Such protocols as HTTP, FTP, TCP, IP, and UDP should be basic general knowledge for those of us with a solid understanding of the Domain Name System. These protocols fall under the Internet Protocol Suite or what’s more commonly known as TCP/IP. What we need to know about this suite, for the purpose of this tutorial, focuses mostly on those aforementioned protocols.

Starting With The Domain Name

If you want to host a web site you will need a domain name. You can register a domain name with any of the registrars available online, through your ISP, or through a web hosting company. The process begins by selecting a Top Level Domain or TLD. This is a domain like com, net, or org, for example. The TLD is at the root of the domain hierarchy. To a human we read a domain from left to right. For example, www.domain.com is read starting with the www and ending with com. To a computer or DNS server, the domain name is read from right to left starting with the TLD and ending with the lowest point in the domain name space relative to the tree of the domain hierarchy. So www.domain.com translates to com.domain.www in a DNS because the DNS needs to start at the top of the domain name space and work its way down in order to resolve. You may ask, why use this nonsensical routine that contradicts the way you read and understand domain names? To us it seems contradicting based on how we may have become accustomed to reading a domain name, but to computers it makes sense based on how the DNS is designed. What you need to remember is that DNS is a hierarchical system and finds authoritative name servers by searching at the top of the hierarchy and working its way down until it finds the parent name servers. DNS is also delegated into zones and subzones to make it more efficient.
A Fully Qualified Domain Name or FQDN is distinguished by its uniqueness in the domain name space and is some times referred to as an absolute domain name. For example, if we have a server on a network with the host-name server1 then server1 will be relative to its root domain. If that root domain, including the TLD, happens to be domain.com then server1.domain.com is an FQDN as there can only be one server1.domain.com even though there may be many server1’s in other parts of the domain name space (e.g. server1.mydomain.com, server1.yourdomain.com, server1.ourdomain.com, etc). The host is thus unique in the domain name space and this is an important part of how the DNS was designed to remain hierarchical in a host-to-client fashion. The machine can now be uniquely identified on the Internet by server1.domain.com regardless of one’s physical location.

Name Servers

ISPs setup DNS servers and DNS resolvers that communicate with each other to locate information about domain names. The Name Servers tell us where to look for information about a particular domain name. Registrars retain the name server information that points to a particular domain name. This is called an NS record. It shows up on dig (a linux networking tool) as domain.com. 86400 IN NS ns1.domain.com, for example. You can set multiple name servers, but the general practice is that you have at least two name servers for each domain. This way if one fails then hopefully the secondary name server will still resolve. You can use the same name servers for multiple domains. Name servers can be both public and private. When the name servers point to the domain itself then the IP or Internet Protocol address must also be set in the A records of that domain name.
If you are using a shared hosting account you will most likely use your web hosts name servers set on that server. Some web hosts will allow you to use private name servers. In this matter you will set ns1.yourdomain.com and ns2.yourdomain.com as the name servers for you domain with your domain registrar and specify the IPs that your web host provides you for those name servers.
If you are running a dedicated server you will need to have your own DNS software installed. The standard is usually BIND. BIND stands for Berkley Internet Name Daemon and is sometimes referred to as named. Low-end VPS machines or some home networks may use smaller foot-print DNS server software such as Dnsmasq. While BIND has had some security vulnerabilities BIND9 is still certainly popular in use. Your focus should be on DNS cacheing and recursive DNS queries rather than authoritative DNS, unless you really know what you’re doing.

A Records, CNAMEs, PTR, and MX Records

People will commonly use www.domain.com in their browsers rather than domain.com when visiting a website. If you are hosting your web sites thinking these two are one and the same you are mistaken. They are in fact viewed as two different web sites. Common search engine bots like Google’s Google Bot will crawl these URLs as two different web sites. To avoid any confusion there is a good way to work around this both on the DNS side and through your web server software (for example, configuring your .htacess file to treat this with a 301 perm redirect).
First, you should consider that CNAME records (Canonical Name) will create a second lookup when retrieving the records. They can be avoided by simply using the alias as an A record. The A record can then look something like this:

If we add www.example.com. to A 192.0.32.10 then the alias will still point to the same IP without creating an CNAME lookup. This is because the server will first find www.example.com IN CNAME example.com and then proceed to lookup example.com to find it IN A 192.0.32.10

If you’re not familiar with the numbers showing up in the dig these are called TTL or Time To Live. They represent how often the records should be updated. The settings can vary based on record type. Again because DNS is a hierarchical distributed system the authoritative name servers will pass the records on to the caching name servers until the TTL has expired. This can cause different load issues depending on how low or high the TTL is set. It is generally not a good idea to lower your TTL too much unless you are constantly updating your DNS records, which is likely never the case.
Your MX records (Mail Exchange) are prioritized. So MX 0 should always be the first point of mail delivery. The lower the number set in the MX record the higher priority it gets.
Its also important to consider rDNS or Reverse DNS for your domain especially if you plan on sending mail from that domain. Some mail servers will attempt to verify fcrDNS, or Forward Confirmed Reverse DNS, from ips they receive mail from in order to prevent domain name spoofing (this is where a malicious server attempts to use some one else’s domain in their email header). Reverse DNS is basically the opposite of what we discussed earlier. Instead of the resolver using the domain name to find an IP address it is using an IP address to find a domain name. This can be done through a PTR record or Pointer Record. The IPv4 uses the in-addr.arpa domain and IPv6 uses the ip6.arpa domain. The pointer record helps us resolve the reverse DNS lookup in these domains. Forward Confirmed Reverse DNS is a full loop. The mail server will usually take the IP that the mail was received from and will attempt to resolve it to a domain using the reverse DNS lookup. That domain is then resolved back to an IP address and if the IPs match then the the fcrDNS test is a pass (i.e. we can be pretty sure that this IP is the one that correctly delivered the mail). Failing the fcrDNS test does not necessarily prevent your mail from being delivered, but most large email servers – especially ones that handle millions of emails per day like Yahoo, MSN, or Gmail – will delay or possibly reject the mail delivery. It might also get filtered out as spam so make sure you have correctly configured your rDNS. This process will require a dedicated IP address to become a fully qualified loop.

There are many other areas to be looked at in setting up and configuring DNS so I will try to update this tutorial where possible. Feel free to let me know if you have any questions or suggestions for this tutorial or to correct/append anything I may have overlooked.