Web Analytics with PHP and Google Visualization: Made Simple

web-analytics

web-analytics

The web is a place where everyone is in your backyard. This can be both a blessing and a curse. How do you know where your website traffic is coming from and how much of it there is? How do you know the difference between traffic that comes from a bot (or automated software) behind the computer and traffic that comes as a result of a person behind the computer? How was traffic led to your website in the first place and why? How long do people stick around and what do they like the most about your site? These are all some of the many questions you can answer with web analytic software, but the real question is why would you want to answer these questions at all and what’s the best way to look at the answer?

Seeing Is Believing

If you’re a visual person then you probably agree with cliches like “a picture is worth a thousand words”, or “seeing is believing”, but even if you aren’t someone that generally understands things visually more than audibly or kinesthetically you are still likely to agree that it’s easier to spot the obvious with your eyes than it is with most other senses. That is, if your eyes are looking at something that stands out.

If we look at a heat map it’s easier to spot where the majority of our visitors and traffic are coming from, geographically. It also helps us understand that people visiting a virtual space can, and will, be physically located in different parts of the world. It just makes sense to look at geographic visitor information on a map. Even if these people are behind machines they must occupy a physical space. The location of the computer that made a request to your website may be different than the location of the actual person at that computer that received it (i.e. proxies), but for the most part you want a general idea (a view from 40,000 feet as they say) of where people are coming from to visit your website. Looking at this same data in a tabular fashion, really isn’t the same. Obviously we can get the same information, but it doesn’t compute in our heads the same way.

GeoIP Data Made Simple

Gathering this geographic data about your website’s visitors is actually a simpler task than it may appear to be at first. The easiest way to go about collecting this information is from the very source that directs all the traffic to your website in the first place (i.e. your web server). Your web server access logs should already gather all the necessary information you need to start analyzing geographic information about your visitors. This is the remote IP address stored in your access logs for each request made to your web server.

All you need is a GeoIP database (I find that MaxMind’s GeoLite City is one of the best out there and it’s free), MySQL (or your database server of choice), and a simple PHP script to get started.

First, download the GeoLite City CSV zip file from MaxMind’s website. Then unzip it into your MySQL directory and startup MySQL command line interface. We’re going to need to create a new database first. So let’s do that…

mysql> CREATE DATABASE `geoip`;
Query OK, 1 row affected (0.10 sec)

mysql> USE `geoip`;
Database changed

Now that we have a database we can create our tables. We’ll need two tables: one for the location data and one for the IP block data. I’ll name them   iplocs   and   ipblocks  , respectively.

-- The iplocs table
CREATE TABLE `iplocs` (
  `locid` int(11) NOT NULL,
  `country` varchar(2) NOT NULL,
  `region` char(2) NULL,
  `city` varchar(255) NULL,
  `postalcode` varchar(6) NULL,
  `latitude` decimal(3,2) NULL,
  `longitude` decimal(3,2) NULL,
  `metrocode` int(11) unsigned NULL,
  `areacode` char(3) NULL,
  INDEX locid(locid)
);
-- The ipblocks table
CREATE TABLE `ipblocks` (
  `startip` int(11) unsigned NOT NULL,
  `endip` int(11) unsigned NOT NULL,
  `locid` int(11) NOT NULL,
  INDEX locid(locid)
);

Now we just need to import all of the data from the csv files we downloaded into our tables. This is a lot of data so we’re going to do with it from the MySQL command prompt using   LOAD DATA INFILE   since it will be much faster loading all of these into the tables by reading directly from the files.

mysql> LOAD DATA
          INFILE 'GeoLiteCity-Location.csv'
          INTO TABLE `iplocs`
          FIELDS TERMINATED BY ','
          ENCLOSED BY '"'
          LINES TERMINATED BY '\n'
          IGNORE 2 LINES;

This will import all of the ip location information from our GeoLiteCity-Location.csv file, which you should have unpacked from the zip file you downloaded, into the iplocs table that will store all of the location data about each IP block. Now we just need to do the same for our ipblocks table.

mysql> LOAD DATA
          INFILE 'GeoLiteCity-Blocks.csv'
          INTO TABLE `ipblocks`
          FIELDS TERMINATED BY ','
          ENCLOSED BY '"'
          LINES TERMINATED BY '\n'
          IGNORE 2 LINES;

Once you’ve done this we can now write a simple script in PHP to extract all the IPs from our web server access log and query the database for GeoIP location information that can be used to build a heat-map using Google Visualization APIs.

Fire Up PHP

For the purposes of this article I’m keeping things very simple and very generalized to the broadest use cases. Since the majority of people that are using PHP and MySQL tend to run Apache httpd as their web server I will assume that the reader’s access log is handled via Apache httpd and uses the default   LogFormat   directive specified by mod_log_config. That’s usually   ”%h %l %u %t \”%r\” %>s %b”  , meaning that the very first part of each line in the access log file is the   remote host   IP address. If you don’t know what format your Apache log is stored in be sure to check the Apache httpd documentation and verify your vhost config file to see how it’s being stored. Since each line in the log file represents an individual request made to your web server we can iterate over the entire file with the simple PHP code demonstrated below.

if (!$fp = fopen("/path/to/access_log", "r")) {
  die("Could not open access log!");
}

$visitors = array();

while (!feof($fp)) {
  $logLine = fgets($fp);
  list($ip) = explode(" ", $logLine, 2);
  if (!$ip) {
    continue;
  }
  if (isset($visitors[$ip])) {
    $visitors[$ip]++;
  } else {
    $visitors[$ip] = 1;
  }
}

fclose($fp);
echo "Found a total of " . count($visitors) .
     " unique visitors and " .
     array_sum($visitors) . " requests in the log.\n";

This simple script will tell you exactly how many unique IP addresses it was able to extract from the log file and how many requests came from each of them. This is stored in our   $visitors   array where each key represents an IP address and its value is the total number of requests that IP address made.

Now all we need to do is query the database for the geoip information and build a simple JSON object with that data. For this we’ll need to connect to the database from PHP first. So let’s do that after we open our log file.

try {
  $db = new PDO("mysql:dbname=geoip;host=localhost", "dbuser", "dbpass");
  /* Put PDO into Exception Mode */
  $db->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
  /* Turn off emulated prepares */
  $db->setAttribute(PDO::ATTR_EMULATE_PREPARES, 0);
} catch(PDOException $e) {
  /* Handle Exceptions here... */
  echo "Ohnoes! Something went wrong... " . $e->getMessage();
  exit;
}

/* The SQL we'll need for the query later on... */
$sql =
     "SELECT iplocs.country FROM ipblocks " .
     "JOIN(iplocs) ON (ipblocks.locid = iplocs.locid) " .
     "WHERE :ip BETWEEN ipblocks.startip AND ipblocks.endip";

$stmt = $db->prepare($sql); // prepare our statement for later

Great now that we have a connection to our database and our prepared statement is ready for execution we can start querying it to get the geoip data. Note that we’re not loading the entire file into memory since some people may have huge web server logs so this will be a bit more efficient. After we’ve read from the file and gathered all of the IP information we need, we can then collect the geo-location data information.

You’ll want to add this bit after the file loop…

/* Initialize a $geodata array */
$geodata = array();

/* Traverse the $visitors array */
foreach ($visitors as $ip => $hits) {
  $stmt->execute(array(":ip" => $ip));
  $result = $stmt->fetch(PDO::FETCH_ASSOC);
  if (empty($result)) {
    continue;
  }
  if (isset($geodata[$result['country']])) {
    $geodata[$result['country']] += $hits;
  } else {
    $geodata[$result['country']] = $hits;
  }
}

Now we have everything we need to use the Google Visualization API to get a heat map of our visitors. Here’s what you’ll need to do to get the JSON we’ll be using for the heat map first. Add this part to the end of your script, after the above loop.

$json = array(array("Country", "Hits"));

foreach ($geodata as $country => $hits) {
  $json[] = array($country, $hits);
}

echo json_encode($json); // Save this

Using the Google Visualization APIs

Once we have this JSON data we can then use the following template to load it using Google’s Visualization API. Put this part in the   head   section of your HTML.

<script type='text/javascript' src='https://www.google.com/jsapi'></script>
<script type='text/javascript'>
  google.load('visualization', '1', {packages:['intensitymap']});
  google.setOnLoadCallback(drawChart);
  function drawChart() {
    var data = google.visualization.arrayToDataTable(/* JSON HERE */);
    var chart = new google.visualization.IntensityMap( document.getElementById('chart_div') );
    chart.draw(data);             
  }
</script>

Just make sure to insert your JSON on line six where the comment is. Next just create a   div   in your HTML using the same name you initialized the chart object with (in our example that’s   chart_div  ), and load the HTML in your browser and that’s it! You now have a heat map of your visitors.

There are all sorts of useful ways you can look at your data to analyze the things that might matter to you most.

For example, looking at the last 6 months of my web server access log for my personal blog I can see that my visitors have increased substantially during that time. Traffic to my blog has gone up an order of magnitude. But by visualizing the data and looking at it in a logarithmic scale I can see that there were points where traffic clearly dipped and peaked during certain months.

Just seeing this data by itself is not enough. It’s also important for me to know where my traffic is coming from. That way I can focus on building up new sources of traffic that aren’t currently attracting high traffic volumes.

By seeing this information visually I can definitely make better decisions about what’s going in a way that tabular data doesn’t easily portray. For example, you can see here by this pie chart that the largest source of direct referral traffic to my blog is in fact other blogs that link to mine. However, if I simply looked at the tabular data of hosts that provided direct referrals to my blog I would actually think that Google was my biggest source of traffic, when in fact that isn’t true.

However, when I take a much deeper look into the data I realize that even though blogs and social media sites like reddit, twitter, youtube, and facebook do ultimately provide me with a lot of traffic over time, Google actually brings in consistent volumes of traffic every day. These are precisely the kinds of pictures you want to be painting with web analytic visualization software.

Google Web Master Tools

You should also consider using Google Web Master Tools as a part of your web analytic suite. There are loads of useful features provided there by Google, like seeing how many searches you show up in every day, what your CTR (Click Through Rate) looks like, how high up on the search result pages your site ranks, what search phrases people are using to find you, and so much more. It’s all free and what’s more is that if you already have a Google Drive account you can easily import this data into your Google Drive and create awesome charts with it in Google Docs!

3 Responses to“Web Analytics with PHP and Google Visualization: Made Simple”

  1. John
    December 19, 2012 at 5:18 am #

    I hate relying on third party sites for stats, and I’m not sure I trust big G to no be using analytics data for their own nefarious purposes, so this is perfect thanks.

  2. December 19, 2012 at 10:05 am #

    Thanks for a detailed article. This line seems wrong:

    $json = array(“Country”, “Hits”);

    For me it only worked as:

    $json = array(array(“Country”, “Hits”));

    • GoogleGuy
      December 21, 2012 at 9:08 am #

      You’re right. That was an oversight and it’s been fixed. Thanks for pointing that out :)

Leave a Reply

Your email address will not be published. Required fields are marked *

(Required)

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>