Deprecated: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in /share/CACHEDEV1_DATA/Web/www/libraries/UBBcode/text_parser.class.php on line 228
Writing a webcrawler in PHP

Comments Blog About Development Research Sites

Writing a webcrawler in PHP

Jul 3, 2011
One way to measure a language's applicability for a given task is to determine how much code is required to perform it, compared to other languages. If you need a lot of code (or many external libraries), chances are there are better alternatives. There are of course also other concerns to take into account, like speed, security, or maintainability to name but a few. Still, it is always good to make sure you are not using a proverbial screwdriver to hammer in a nail.

For many webrelated tasks, a web-oriented programming language like PHP is a good choice. To demonstrate this, I decided to write a fairly functional webcrawler in just 30 lines of code, using nothing more than PHP's internal functions. Of course, extensive comments increase the total a bit to the Webcrawler class you see defined here:

Code (php) (nieuw venster):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
/**
 *  PHP based webcrawler.
 *
 *  This is an example class which demonstrates how you
 *  can build a webcrawler in 30 lines (excluding lines
 *  for comments) of PHP code. Pages crawled are listed
 *  but not actually analysed, the parseDocument method
 *  should be used for that.
 *
 *  Note that the list of to-be-crawled pages is stored
 *  in memory. This makes the crawler a lot faster than
 *  when a SQL queue is used, but it also means it will
 *  start over once the script is halted.
 *
 *  @author Matthijs Dorst
**/

class Crawler {
  
  private
$targets = array(''); // Add dummy entry so the first next() call won't fail.


  /**
   *  Crawler constructor. It requires one URL to start
   *  crawling and will keep on going as long as unique
   *  pages are left uncrawled in its index.
   *
   *  @param  start   String    URL to start crawling.
  **/

  public function __construct ($start = 'http://www.wikipedia.org') {
    array_push($this -> targets, $start);
    while ((
$target = next($this -> targets)) !== false)
      $this -> parseTarget($target);
  }

  
  
  /**
   *  Parse an URL. Load the target URL's source into a
   *  DOMDocument (return on failure), parse the page's
   *  content and handle all links for inclusion in the
   *  crawler-queue.
   *
   *  @param  target  String    The URL to crawl.
  **/

  private function parseTarget ($target) {
    if (!
$document =@ DOMDocument::loadHTMLFile($target))
      return;

    $this -> parseDocument($document, $target);  
    foreach (
$document -> getElementsByTagName('a') as $link)
      @
$this -> handleLink($link -> attributes -> getNamedItem('href') -> nodeValue, $target);
  }

  
  
  /**
   *  Handle a link found on a webpage. This also takes
   *  into account the base URL of the origin page, for
   *  handling relative URL's correctly.
   *
   *  If the URL is new, it is appended to the queue at  
   *  the last position.
   *
   *  @param  url     String    URL to handle.
   *  @param  base    String    Base-url from the page.
  **/

  private function handleLink ($url, $base) {
    $host = parse_url($url, PHP_URL_HOST);
    if (empty(
$host))
      $url = $base . $url;
    $url = substr($url, 0, 50);
    if (!
in_array($url, $this -> targets))
      array_push($this -> targets, $url);
  }

  
  
  /**
   *  Parse the document just crawled. Typical handling
   *  will see storage of the text or extraction of any
   *  keywords / titles / etc in this location.
   *
   *  @param  document DOMDocument  The document as DOM
   *                                object.
   *  @param  source   String       URL of the source.
  **/

  private function parseDocument ($document, $source) {
    echo
sizeof($this -> targets) . ' left, at ' . $source . PHP_EOL;
  }
}


It can download a HTML page, parse it's contents, analyse any links found on the page, add them to a list of to-be-crawled pages and continue with the next link in the list. It is very basic, and I have taken a few shortcuts here (most notably the static call to loadHTMLFile will throw an E_NOTICE, as does the nodeValue call in case an 'a' tag does not also have an 'href' attribute) but depending on the application these matter very little - otherwise, they are easy enough to correct with a few more lines.

All this crawler needs is a start URL and it will continue to run till it has exhausted all URL's found - which most likely will consists of a large part of the internet as we know it, so the chance of that happening is fairly slim. A more advanced version of this (one I wrote for Fai years ago already in fact) stores the URL's in a datase, keeping track of how often they are referenced, when they were last crawled, how their contents changed over time, etc. Alternatively you can limit the crawler to a single host (the handleLink method can be extended to do this with one or two lines of code) to, for example, determine if all the pages on your site can be found through links. Another use would be to automatically generate a sitemap, although without additional information it will hardly be better than what search engines create themselfs.

Then again, add some smart text parsers and you can use this to create an index of your content and let users search your website by keyword. Run the crawler once a day and it will always stay up-to-date. The possibilities are endless really! :)

FragFrog, out!

New comment

Your name:
Comment: