Basic WordPress crawler in Python

I’m developing a new personal web where I will probably include some links to this, my blog, I thought it would be a good idea to have an automatic way to extract them from this site. As I was a bit bored last Monday and wanted to remember my Python skills, I thought “Let’s do a really simple WP crawler to make that work for me!”.

The next Python crawler it’s a really simple script. I know it could have been coded showing a beautiful menu with options and some methods having default params and all that, but just wanted to code it quickly. Every tricky part it’s commented but I will explain it in a few lines.

  1. WordPress blogs show only the last entries and at the bottom, there’s a link which allows you to go to older entries (linking to something like …/page/number)
  2. The crawler will start seeking for every link in the webpage indicated as source
  3. For every link found, if it’s an ‘h2’ type, will consider it as a blog entry so will store it in the list wp_entries
  4. If it’s an internal link will crawl it as long as it contains the word ‘page’
  5. There’s a depth limit so it doesn’t crawl through every page if desire
  6. When it has finished crawling, it will show every entry found

Here is the code:

import urllib2
import urlparse
import re

class Crawler:
    _source = ''
    _depth = 0
    _links = []
    _wp_entries = []
    _debug = False

    def __init__(self, source, depth):
        self._source = source
        self._depth = depth
        self._links = [] 
        self._wp_entries = []
    def get_childs(self, url, level):
        if (level <= self._depth):
            if self._debug : print 'Crawling ', url
                page = urllib2.urlopen(url)

                # Discard non html files
                page_info =['content-type']
                if not('text\/html', page_info)):
                    if self._debug: print 'Found a non-html file ', url, ' :',['content-type']

                    for line in page:
                        # Find every link in source code
                        if (('a href=', line) != None)):
                            # If several links in a line -->  Iterate through all
                            links_in_line = re.findall('a href="(.*?)"', line)
                            for link in links_in_line:
                                # For each link found
                                #   0 - Check the link its internal (regex source or a page in same dir)
                                #   1 - Create the new page (only if it's without source in link)
                                #   2 - Check it doesnt been crawled (check self._links)
                                #   3 - Add to the list (self._links)
                                #   4 - Crawl the new one according to the rules (match 'page' in this case)
                                new_link = ''
                                crawl_link = False
                                if (re.match(self._source, link)):
                                    # Subpage sharing source (source/sthing) -> use that link
                                    if self._debug: print 'Found internal link ', link
                                    new_link = link
                                    crawl_link = True

                                elif (re.match('http:\/\/', link)):
                                    # External link -> do nothing
                                    if self._debug: print 'Found external link ', link

                                elif (re.match('\w|\_|\.', link)):
                                    # Internal link -> construct full url
                                    if self._debug: print 'Found internal link', link
                                    new_link = urlparse.urljoin(url, link)
                                    crawl_link = True

                                    # Weird link
                                    if self._debug: print  'Found weird link ', link

                                if ((self._links.count(new_link) == 0) & crawl_link):
                                    # Found a new link, store & crawl it

                                    # WP specific actions
                                    # Store only h2 links (meaning entries)
                                    # Just crawl whenever link contains 'page' 
                                    if('h2', line)): 
                                        if self._debug: print 'WP entry ', new_link

                                    level += 1
                                    if('page', new_link)): self.get_childs(new_link, level)
                                    level -= 1

            except urllib2.HTTPError as e:
                if self._debug: print 'Found error while crawling ', url, e

            except urllib2.URLError as e:
                if self._debug: print 'Found error while crawling ', url, e
            if self._debug: print 'Maximum depth level reached, skipping'
    def show_childs(self):
        print len(self._links),' links were found, listing:'
        for link in self._links:
            print link

    def show_wp_entries(self):
        print len(self._wp_entries),' WP entries were found, listing:'
        for head in self._wp_entries:
            print head

# Create a new crawler with page and maximu depth level
crawler = Crawler('', 5)

# Get childs for that page (could have used defaults params in method)
crawler.get_childs('', 0)

# Show me what you found

And now, let’s check the output where I’m asking the crawler to list this blog 6 pages of entries:

kets@ExoduS:~/programacion/sources/python$ python 
30  WP entries were found, listing:

It works!! As you can see, with just a hundred lines of Python where typed to achieve that, and much more could be done just modifying some parameters. Hope it helps!



Introduce tus datos o haz clic en un icono para iniciar sesión:

Logo de

Estás comentando usando tu cuenta de Cerrar sesión / Cambiar )

Imagen de Twitter

Estás comentando usando tu cuenta de Twitter. Cerrar sesión / Cambiar )

Foto de Facebook

Estás comentando usando tu cuenta de Facebook. Cerrar sesión / Cambiar )

Google+ photo

Estás comentando usando tu cuenta de Google+. Cerrar sesión / Cambiar )

Conectando a %s