April 12, 2009

How to Build a Scraper Using PHP & cURL

Filed under: PHP at 3:12 am — Comments (0)

Use the following knowledge at your own risk, scraping content is a pretty gray area. For this guide you will need FireFox and the Firebug extension. You’ll also need PHP installed with the cURL module.

For this example we’ll be grabbing the counters for “bloggers, new posts and words today” from WordPress.com. Let’s start off by creating a function to fetch the html contents of a webpage using cURL.

function getDocument($url)
{
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_USERAGENT, ‘Googlebot/2.1 (http://www.googlebot.com/bot.html)’);
        curl_setopt($ch, CURLOPT_URL,$url);
        curl_setopt($ch, CURLOPT_FAILONERROR, true);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($ch, CURLOPT_AUTOREFERER, true);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
        curl_setopt($ch, CURLOPT_TIMEOUT, 10);
        $html = curl_exec($ch);
       
        return $html;
}

We’ll be disguising ourselves as Googlebot. You can find a list of user agents to use at http://www.user-agents.org/. Next we’ll create a DOM object out of the raw html code.

$html = $this->getDocument(‘http://wordpress.com’);
$dom = new DOMDocument();
@$dom->loadHtml($html);

Note the @, this eliminates any errors that might (and most likely will) get output due to the website not following html coding standards. Next, we will use FireBug to find the XPath of what you want to scrape. Right click on what you want to scrape and click “Inspect Element”. FireBug will pop up with the code highlighted. At the time of this blog post this is what pops up:

<h6>
<span>204,045</span>
bloggers,
<span>133,640</span>
new posts,
<span>34,374,943</span>
words today.
</h6>

h6 is the container of the content we want to scrape, thus the xpath of the h6 tag is what we want. Right click on the h6 tag inside FireBug and click “Copy XPath”. This saves something like “/html/body/div/div[2]/div[2]/h6″ to our clipboard. So we’ll add the following to our code:

$xpath = new DOMXPath($dom);
$content = $xpath->query("/html/body/div/div[2]/div[2]/h6");

Note: If your xpath contains any ‘tbody’, remove them from the path.

The span tags with our desired content is now accessible through $content->item(0)->childNodes - You can chose to iterate through this if the list, or for simple uses like our example you can call the child item directly like so:

echo ‘Bloggers: ‘ . $content->item(0)->childNodes->item(0)->nodeValue . ‘<br />’;
echo ‘New Posts: ‘ . $content->item(0)->childNodes->item(2)->nodeValue . ‘<br />’;
echo ‘Words Today: ‘ . $content->item(0)->childNodes->item(4)->nodeValue;

To get the most out of this technique you should brush up your knowledge of the Document Object Model.

Best Dedicated Web Host? Meet SoftLayer

Filed under: Hosting at 2:02 am — Comments (0)

Webmasters who create sites that grow beyond the capabilities of a Shared host are often faced with the daunting task of choosing a dedicated hosting provider. I’ve been all across town; hosting with ev1servers (now ThePlanet), iWeb Technologies, Mediatemple, and more who I dare not name as their respective levels of service do not deserve the attention. The biggest problem I’ve run into with these companies is support. I remember an instance with iWeb where my server went down at 8pm, and since their high level technicians were off for the day I had to wait until 8am to get service. Now not every situation with the aforementioned dedicated companies was that extreme, but they have their issues.

Back in December of 07 one of my largest websites suffered a DDoS attack. My datacenter (iWeb) said the only solution would be to shut down my servers, null route the ips and wait it out. After a few sleepless nights I came across SoftLayer, read about their Cisco Guard DDoS protection system. I went into a chat with sales, explained my situation and they went over my logs with me, outlined what hardware I should order and how they intend on thwarting the attack.

They put me on the Cisco Guard, monitored the attack and noticed a pattern from the attacker - they were all on a rare useragent and from Asia. SoftLayer developed a solution for me that would block all traffic from Asia with the useragent, and the attack came to an immediate halt.

Moving to SoftLayer ended up being a great decision, I’ve been with them ever since and have ordered 8 servers there. Their support is scary-good, every time I’ve called them, day or night, they have answered within 1 ring. Every ticket I have posted have been acknowledged within 10 minutes and acted on shortly after.

The most interesting service they offer is $3 administrative tickets. They do standard support tickets, hardware failures, server outages, etc. for free, but they’ll also do practically ANY administrative work on your server for just $3 a pop. Need Apache/PHP installed? Done. Memcached? Done. JPG Delagates for ImageMagick? Done and done. This makes running a dedicated server as an amateur, or even an experienced professional who just doesn’t have time - easy as pie.

Their management portal is very slick, giving you so much control over your server that it’s hardly ever necessary to make a ticket asking for support. From rebooting your server to reloading the whole OS, it can all be done automatically with the push of a button.

I could gush all day about SoftLayer and how they’re on a whole different level than the competition - but I’d feel silly if I said any more here. I highly recommend them for a website (or group of websites) of any shape & size.

http://www.softlayer.com

Rising from the ashes

Filed under: Miscellaneous at 1:03 am — Comments (0)

I can’t believe it’s been well over a year since I last updated Meta Titan. I’ve been extremely busy with personal & work projects and just haven’t had time to write new and interesting things for the site.

That’s all about to change, as I’ve dusted off the ol’ WordPress, cleaned out the comment spam and I’m currently preparing some articles & tools for the site. Stay tuned!