Basic WordPress crawler in Python

I’m developing a new personal web where I will probably include some links to this, my blog, I thought it would be a good idea to have an automatic way to extract them from this site. As I was a bit bored last Monday and wanted to remember my Python skills, I thought “Let’s do a really simple WP crawler to make that work for me!”.

The next Python crawler it’s a really simple script. I know it could have been coded showing a beautiful menu with options and some methods having default params and all that, but just wanted to code it quickly. Every tricky part it’s commented but I will explain it in a few lines.

  1. WordPress blogs show only the last entries and at the bottom, there’s a link which allows you to go to older entries (linking to something like …/page/number)
  2. The crawler will start seeking for every link in the webpage indicated as source
  3. For every link found, if it’s an ‘h2’ type, will consider it as a blog entry so will store it in the list wp_entries
  4. If it’s an internal link will crawl it as long as it contains the word ‘page’
  5. There’s a depth limit so it doesn’t crawl through every page if desire
  6. When it has finished crawling, it will show every entry found

Here is the code:

import urllib2
import urlparse
import re

class Crawler:
    _source = ''
    _depth = 0
    _links = []
    _wp_entries = []
    _debug = False

    def __init__(self, source, depth):
        self._source = source
        self._depth = depth
        self._links = [] 
        self._wp_entries = []
    
    def get_childs(self, url, level):
        if (level <= self._depth):
            if self._debug : print 'Crawling ', url
            try:
                page = urllib2.urlopen(url)

                # Discard non html files
                page_info = page.info()['content-type']
                if not(re.search('text\/html', page_info)):
                    if self._debug: print 'Found a non-html file ', url, ' :', page.info()['content-type']

                else:
                    for line in page:
                        # Find every link in source code
                        if ((re.search('a href=', line) != None)):
                            # If several links in a line -->  Iterate through all
                            links_in_line = re.findall('a href="(.*?)"', line)
                            for link in links_in_line:
                                # For each link found
                                #   0 - Check the link its internal (regex source or a page in same dir)
                                #   1 - Create the new page (only if it's without source in link)
                                #   2 - Check it doesnt been crawled (check self._links)
                                #   3 - Add to the list (self._links)
                                #   4 - Crawl the new one according to the rules (match 'page' in this case)
                                new_link = ''
                                crawl_link = False
                                if (re.match(self._source, link)):
                                    # Subpage sharing source (source/sthing) -> use that link
                                    if self._debug: print 'Found internal link ', link
                                    new_link = link
                                    crawl_link = True

                                elif (re.match('http:\/\/', link)):
                                    # External link -> do nothing
                                    if self._debug: print 'Found external link ', link

                                elif (re.match('\w|\_|\.', link)):
                                    # Internal link -> construct full url
                                    if self._debug: print 'Found internal link', link
                                    new_link = urlparse.urljoin(url, link)
                                    crawl_link = True

                                else:
                                    # Weird link
                                    if self._debug: print  'Found weird link ', link

                                if ((self._links.count(new_link) == 0) & crawl_link):
                                    # Found a new link, store & crawl it
                                    self._links.append(new_link)

                                    # WP specific actions
                                    # Store only h2 links (meaning entries)
                                    # Just crawl whenever link contains 'page' 
                                    if(re.search('h2', line)): 
                                        if self._debug: print 'WP entry ', new_link
                                        self._wp_entries.append(new_link)

                                    level += 1
                                    if(re.search('page', new_link)): self.get_childs(new_link, level)
                                    level -= 1

            except urllib2.HTTPError as e:
                if self._debug: print 'Found error while crawling ', url, e

            except urllib2.URLError as e:
                if self._debug: print 'Found error while crawling ', url, e
        else:
            if self._debug: print 'Maximum depth level reached, skipping'
                
    def show_childs(self):
        print len(self._links),' links were found, listing:'
        for link in self._links:
            print link

    def show_wp_entries(self):
        print len(self._wp_entries),' WP entries were found, listing:'
        for head in self._wp_entries:
            print head


# Create a new crawler with page and maximu depth level
crawler = Crawler('https://hoyhabloyo.wordpress.com/', 5)

# Get childs for that page (could have used defaults params in method)
crawler.get_childs('https://hoyhabloyo.wordpress.com/', 0)

# Show me what you found
crawler.show_wp_entries()

And now, let’s check the output where I’m asking the crawler to list this blog 6 pages of entries:

kets@ExoduS:~/programacion/sources/python$ python wordpres_crawl.py 
30  WP entries were found, listing:
https://hoyhabloyo.wordpress.com/2012/01/24/mitos-y-verdades-sobre-las-becas-icex-en-informatica/
https://hoyhabloyo.wordpress.com/2012/01/18/xfce-display-switching-dual-single-monitor/
https://hoyhabloyo.wordpress.com/2012/01/08/ano-nuevo-vida-nueva/
https://hoyhabloyo.wordpress.com/2011/12/31/adios-2011/
https://hoyhabloyo.wordpress.com/2011/12/23/100-000-visitas/
https://hoyhabloyo.wordpress.com/2011/12/19/de-dibujos-animados/
https://hoyhabloyo.wordpress.com/2011/12/01/hablemos-de-los-rumanos-vorbim-despre-romanii/
https://hoyhabloyo.wordpress.com/2011/12/01/reencuentro-de-becarios-ic3x-en-navaluenga/
https://hoyhabloyo.wordpress.com/2011/11/29/sigo-vivo/
https://hoyhabloyo.wordpress.com/2011/10/07/va-de-despedidas/
https://hoyhabloyo.wordpress.com/2011/09/27/los-rincones-de-bucarest-piata-matache/
https://hoyhabloyo.wordpress.com/2011/09/22/los-rincones-de-bucarest-parcul-carol-i/
https://hoyhabloyo.wordpress.com/2011/09/13/los-rincones-de-bucarest-piata-universitatii/
https://hoyhabloyo.wordpress.com/2011/09/12/receta-hummus/
https://hoyhabloyo.wordpress.com/2011/09/08/viaje-por-los-balcanes/
https://hoyhabloyo.wordpress.com/2011/09/01/receta-gazpacho/
https://hoyhabloyo.wordpress.com/2011/08/31/los-rincones-de-bucarest-parcul-herestrau/
https://hoyhabloyo.wordpress.com/2011/08/16/viaje-expres-a-espana/
https://hoyhabloyo.wordpress.com/2011/08/02/ruta-por-la-rumania-profunda-valaquia/
https://hoyhabloyo.wordpress.com/2011/07/17/de-paseo-por-los-carpatos/
https://hoyhabloyo.wordpress.com/2011/07/10/guia-para-vivir-en-bucarest/
https://hoyhabloyo.wordpress.com/2011/06/30/viaje-por-asia/
https://hoyhabloyo.wordpress.com/2011/06/28/receta-pescado-blanco-al-microondas/
https://hoyhabloyo.wordpress.com/2011/05/20/la-revolucion-espanola-spanishrevolution/
https://hoyhabloyo.wordpress.com/2011/05/17/viaje-a-belgrado/
https://hoyhabloyo.wordpress.com/2011/04/29/semana-santa-por-la-rumania-profunda/
https://hoyhabloyo.wordpress.com/2011/04/14/bruselas-y-amsterdam/
https://hoyhabloyo.wordpress.com/2011/04/04/roma/
https://hoyhabloyo.wordpress.com/2011/03/17/las-1000-grullas/
https://hoyhabloyo.wordpress.com/2011/03/02/receta-torretas-de-berenejena-queso-y-tomate-vegetarianas/

It works!! As you can see, with just a hundred lines of Python where typed to achieve that, and much more could be done just modifying some parameters. Hope it helps!

Anuncios

Responder

Introduce tus datos o haz clic en un icono para iniciar sesión:

Logo de WordPress.com

Estás comentando usando tu cuenta de WordPress.com. Cerrar sesión / Cambiar )

Imagen de Twitter

Estás comentando usando tu cuenta de Twitter. Cerrar sesión / Cambiar )

Foto de Facebook

Estás comentando usando tu cuenta de Facebook. Cerrar sesión / Cambiar )

Google+ photo

Estás comentando usando tu cuenta de Google+. Cerrar sesión / Cambiar )

Conectando a %s