April 12, 2009

How to Build a Scraper Using PHP & cURL

Filed under: PHP at 3:12 am — Comments (0)

Use the following knowledge at your own risk, scraping content is a pretty gray area. For this guide you will need FireFox and the Firebug extension. You’ll also need PHP installed with the cURL module.

For this example we’ll be grabbing the counters for “bloggers, new posts and words today” from WordPress.com. Let’s start off by creating a function to fetch the html contents of a webpage using cURL.

function getDocument($url)
{
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_USERAGENT, ‘Googlebot/2.1 (http://www.googlebot.com/bot.html)’);
        curl_setopt($ch, CURLOPT_URL,$url);
        curl_setopt($ch, CURLOPT_FAILONERROR, true);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($ch, CURLOPT_AUTOREFERER, true);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
        curl_setopt($ch, CURLOPT_TIMEOUT, 10);
        $html = curl_exec($ch);
       
        return $html;
}

We’ll be disguising ourselves as Googlebot. You can find a list of user agents to use at http://www.user-agents.org/. Next we’ll create a DOM object out of the raw html code.

$html = $this->getDocument(‘http://wordpress.com’);
$dom = new DOMDocument();
@$dom->loadHtml($html);

Note the @, this eliminates any errors that might (and most likely will) get output due to the website not following html coding standards. Next, we will use FireBug to find the XPath of what you want to scrape. Right click on what you want to scrape and click “Inspect Element”. FireBug will pop up with the code highlighted. At the time of this blog post this is what pops up:

<h6>
<span>204,045</span>
bloggers,
<span>133,640</span>
new posts,
<span>34,374,943</span>
words today.
</h6>

h6 is the container of the content we want to scrape, thus the xpath of the h6 tag is what we want. Right click on the h6 tag inside FireBug and click “Copy XPath”. This saves something like “/html/body/div/div[2]/div[2]/h6″ to our clipboard. So we’ll add the following to our code:

$xpath = new DOMXPath($dom);
$content = $xpath->query("/html/body/div/div[2]/div[2]/h6");

Note: If your xpath contains any ‘tbody’, remove them from the path.

The span tags with our desired content is now accessible through $content->item(0)->childNodes - You can chose to iterate through this if the list, or for simple uses like our example you can call the child item directly like so:

echo ‘Bloggers: ‘ . $content->item(0)->childNodes->item(0)->nodeValue . ‘<br />’;
echo ‘New Posts: ‘ . $content->item(0)->childNodes->item(2)->nodeValue . ‘<br />’;
echo ‘Words Today: ‘ . $content->item(0)->childNodes->item(4)->nodeValue;

To get the most out of this technique you should brush up your knowledge of the Document Object Model.

Best Dedicated Web Host? Meet SoftLayer

Filed under: Hosting at 2:02 am — Comments (0)

Webmasters who create sites that grow beyond the capabilities of a Shared host are often faced with the daunting task of choosing a dedicated hosting provider. I’ve been all across town; hosting with ev1servers (now ThePlanet), iWeb Technologies, Mediatemple, and more who I dare not name as their respective levels of service do not deserve the attention. The biggest problem I’ve run into with these companies is support. I remember an instance with iWeb where my server went down at 8pm, and since their high level technicians were off for the day I had to wait until 8am to get service. Now not every situation with the aforementioned dedicated companies was that extreme, but they have their issues.

Back in December of 07 one of my largest websites suffered a DDoS attack. My datacenter (iWeb) said the only solution would be to shut down my servers, null route the ips and wait it out. After a few sleepless nights I came across SoftLayer, read about their Cisco Guard DDoS protection system. I went into a chat with sales, explained my situation and they went over my logs with me, outlined what hardware I should order and how they intend on thwarting the attack.

They put me on the Cisco Guard, monitored the attack and noticed a pattern from the attacker - they were all on a rare useragent and from Asia. SoftLayer developed a solution for me that would block all traffic from Asia with the useragent, and the attack came to an immediate halt.

Moving to SoftLayer ended up being a great decision, I’ve been with them ever since and have ordered 8 servers there. Their support is scary-good, every time I’ve called them, day or night, they have answered within 1 ring. Every ticket I have posted have been acknowledged within 10 minutes and acted on shortly after.

The most interesting service they offer is $3 administrative tickets. They do standard support tickets, hardware failures, server outages, etc. for free, but they’ll also do practically ANY administrative work on your server for just $3 a pop. Need Apache/PHP installed? Done. Memcached? Done. JPG Delagates for ImageMagick? Done and done. This makes running a dedicated server as an amateur, or even an experienced professional who just doesn’t have time - easy as pie.

Their management portal is very slick, giving you so much control over your server that it’s hardly ever necessary to make a ticket asking for support. From rebooting your server to reloading the whole OS, it can all be done automatically with the push of a button.

I could gush all day about SoftLayer and how they’re on a whole different level than the competition - but I’d feel silly if I said any more here. I highly recommend them for a website (or group of websites) of any shape & size.

http://www.softlayer.com

Rising from the ashes

Filed under: Miscellaneous at 1:03 am — Comments (0)

I can’t believe it’s been well over a year since I last updated Meta Titan. I’ve been extremely busy with personal & work projects and just haven’t had time to write new and interesting things for the site.

That’s all about to change, as I’ve dusted off the ol’ WordPress, cleaned out the comment spam and I’m currently preparing some articles & tools for the site. Stay tuned!

January 3, 2008

Classes vs IDs, CSS Practice

Filed under: CSS at 4:22 pm — Comments (0)

A good use of classes and IDs can save you a lot of time. You’ll end up with a site that’s easy to maintain, and frankly, your code will look a lot cleaner. There are certain rules and practices when using classes and IDs, the following guide looks over them.

First off, what defines an ID and a class? Simply put, an ID is a unique identifier and should only be used once in your document. It’s good practice to use ID on structural blocks of your site such as a wrapper, header, footer, navigation bar, etc. A class can be used more broadly to define objects that can appear multiple times in your document, such as link styling, tables, etc.

Example usage of an ID:

In your html code:
<div id=”mainWrapper”>content</div>

In your stylesheet:
div#mainWrapper { margin: 10px 30px; }

Example usage of a Class:

In your html code:
<span class=”test”>Hello, World</span>

In your stylesheet:
span.test { color: #003366; font-weight: 900; }

When naming your classes and IDs, try and use generic and easy to identify names. For example, instead of calling something “yellowBar” try “topSidebar”. Who knows what color that bar will be 6 months from now! Also, pick a naming style that you’re comfortable with and stick to it - either lowercase (#helloworld) or camel case (#helloWorld) - you should never use spacing in names.

December 13, 2007

How to Calculate PHP Load Times

Filed under: PHP at 8:42 am — Comments (6)

Here’s a popular request amongst those who are learning PHP. When developing PHP applications, it’s good practice to benchmark your pages to see if you need to further optimize your code. The following snippet will show you how much time it took your server to process your PHP document.

Insert this at or near the top of your PHP file.

$m_time = explode(" ",microtime());
$m_time = $m_time[0] + $m_time[1];
$loadstart = $m_time;

Now place this snippet at or near the bottom of your file for the best results.

$m_time = explode(" ",microtime());
$m_time = $m_time[0] + $m_time[1];
$loadend = $m_time;
$loadtotal = ($loadend - $loadstart);
echo "<small><em>Generated page in ". round($loadtotal,3) ." seconds</em></small>";

That’s it! I suggest adding this while you develop any PHP application, and include it even after the launch, so that you can see how well your scripts scale with the traffic you receive.

December 10, 2007

Backing up a MySQL Database Using Cron

Filed under: MySQL at 11:44 am — Comments (2)

It sure has been awhile since I’ve updated, things have been pretty busy and I just haven’t had time. Anyway, the most important and often overlooked part of running a dynamic MySQL website is backing up your data often. Losing your file system often doesn’t hurt as much as losing all of your content, especially when running a script that’s easily replaceable like vBulletin, Wordpress, etc. Backing up your data can be a chore, so this is the simple method I use for an automated backup of my databases.

1) SSH in your box

2) Open up your crontab, to do this type:
crontab -e

3) Add the job to your crontab, this is what I use:
30 0 * * * date=`date -I` ; mysqldump -a -uuser -ppassword dbname > /path/to/dump_$date.sql

I’ll break down what’s going on in the above line and what you need to edit

  • 30 0 * * * - This specifies the interval in which it will backup your data. Minutes, hours, days of the month, months, days of the week, respectively. In my case, I’m going to be running this every day at 12:30 AM. Asterisk out values which you do not need to limit.
  • user - Enter your mysql username here
  • password - Enter your mysql password here
  • It’s important to note that -uuser is not a typo, you need to prefix -u on your username, so if it’s jsmith, you will enter -ujsmith. Same goes for your password.
  • dbname - Enter the mysql database name which you want to backup
  • /path/to/dump_$date.sql - Enter the directory you wish to back up your data to, include $date if you want a datestamp on your backup names. Don’t back this up to a web accessible directory as anyone would be able to access your database information and view potentially sensitive data.

Once your cron job is up and running you can then use a 3rd party backup service to automatically pull those backups across onto secure networks at set intervals (ie: every day at 12:40 AM). Talk to your hosting provider as many already provide backup services like this. You can also choose to manually download them onto your hard drive if you prefer a most cost effective approach. Just remember to go in weekly or monthly to delete older backups if necessary - those with large databases may eventually max out their hard drive space if left unattended.

October 30, 2007

New Look

Filed under: Miscellaneous at 10:13 am — Comments (0)

It felt unfitting for a blog that teaches and discusses web development tips & tricks to use a generic widely used WordPress theme. I’ve launched a new custom look for Meta Titan and I’m just working out the kinks and making adjustments.

I have a couple entries planned for this week so keep an eye out for them!

October 28, 2007

Show/Hide Content With CSS & Javascript

Filed under: CSS at 1:51 am — Comments (7)

My apologies for the recent lack of updates and the briefness of this post, things have been (and still are) really busy on my end. Anyway, when building websites for my clients a popular request is to have content that can be toggled by the user. Today I’ll show you have to have this effect done really quickly. Although this method does not support persistence (saving cookies to the users browser to remember what they have hidden/shown), I’m sure there are some who will find it useful.

Place this code in your <head> tags.

<script type="text/javascript">
function shToggle(content) {
  if (document.getElementById(content).style.display == "none")
    document.getElementById(content).style.display = "block"
  else
    document.getElementById(content).style.display = "none"
}
</script>

Now you can effectively show/hide content by placing id=”elementname” style=”display:none;” inside the element tag you wish to be toggle-able, and onclick=”shToggle(’elementname‘); return false;” inside the link code of the image or text the user clicks to toggle it. You can see a live example of it on this page, or simply look at the example code snippet below.

<strong>What’s the name of Calgary’s NHL Team?</span>
<a href="javascript:void(0);" onclick="shToggle(’calgary’); return false;">show/hide answer</a>

<div id="calgary" style="display:none;">The Calgary Flames</div>

Next Page »