April 12, 2009

How to Build a Scraper Using PHP & cURL

Filed under: PHP at 3:12 am —

Use the following knowledge at your own risk, scraping content is a pretty gray area. For this guide you will need FireFox and the Firebug extension. You’ll also need PHP installed with the cURL module.

For this example we’ll be grabbing the counters for “bloggers, new posts and words today” from WordPress.com. Let’s start off by creating a function to fetch the html contents of a webpage using cURL.

function getDocument($url)
{
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_USERAGENT, ‘Googlebot/2.1 (http://www.googlebot.com/bot.html)’);
        curl_setopt($ch, CURLOPT_URL,$url);
        curl_setopt($ch, CURLOPT_FAILONERROR, true);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($ch, CURLOPT_AUTOREFERER, true);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
        curl_setopt($ch, CURLOPT_TIMEOUT, 10);
        $html = curl_exec($ch);
       
        return $html;
}

We’ll be disguising ourselves as Googlebot. You can find a list of user agents to use at http://www.user-agents.org/. Next we’ll create a DOM object out of the raw html code.

$html = $this->getDocument(‘http://wordpress.com’);
$dom = new DOMDocument();
@$dom->loadHtml($html);

Note the @, this eliminates any errors that might (and most likely will) get output due to the website not following html coding standards. Next, we will use FireBug to find the XPath of what you want to scrape. Right click on what you want to scrape and click “Inspect Element”. FireBug will pop up with the code highlighted. At the time of this blog post this is what pops up:

<h6>
<span>204,045</span>
bloggers,
<span>133,640</span>
new posts,
<span>34,374,943</span>
words today.
</h6>

h6 is the container of the content we want to scrape, thus the xpath of the h6 tag is what we want. Right click on the h6 tag inside FireBug and click “Copy XPath”. This saves something like “/html/body/div/div[2]/div[2]/h6″ to our clipboard. So we’ll add the following to our code:

$xpath = new DOMXPath($dom);
$content = $xpath->query("/html/body/div/div[2]/div[2]/h6");

Note: If your xpath contains any ‘tbody’, remove them from the path.

The span tags with our desired content is now accessible through $content->item(0)->childNodes - You can chose to iterate through this if the list, or for simple uses like our example you can call the child item directly like so:

echo ‘Bloggers: ‘ . $content->item(0)->childNodes->item(0)->nodeValue . ‘<br />’;
echo ‘New Posts: ‘ . $content->item(0)->childNodes->item(2)->nodeValue . ‘<br />’;
echo ‘Words Today: ‘ . $content->item(0)->childNodes->item(4)->nodeValue;

To get the most out of this technique you should brush up your knowledge of the Document Object Model.

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URL

Leave a comment