I’m developing a new personal web where I will probably include some links to this, my blog, I thought it would be a good idea to have an automatic way to extract them from this site. As I was a bit bored last Monday and wanted to remember my Python skills, I thought “Let’s do a really simple WP crawler to make that work for me!”.
The next Python crawler it’s a really simple script. I know it could have been coded showing a beautiful menu with options and some methods having default params and all that, but just wanted to code it quickly. Every tricky part it’s commented but I will explain it in a few lines.
- WordPress blogs show only the last entries and at the bottom, there’s a link which allows you to go to older entries (linking to something like …/page/number)
- The crawler will start seeking for every link in the webpage indicated as source
- For every link found, if it’s an ‘h2’ type, will consider it as a blog entry so will store it in the list wp_entries
- If it’s an internal link will crawl it as long as it contains the word ‘page’
- There’s a depth limit so it doesn’t crawl through every page if desire
- When it has finished crawling, it will show every entry found
Here is the code:
import urllib2
import urlparse
import re
class Crawler:
_source = ''
_depth = 0
_links = []
_wp_entries = []
_debug = False
def __init__(self, source, depth):
self._source = source
self._depth = depth
self._links = []
self._wp_entries = []
def get_childs(self, url, level):
if (level <= self._depth):
if self._debug : print 'Crawling ', url
try:
page = urllib2.urlopen(url)
# Discard non html files
page_info = page.info()['content-type']
if not(re.search('text\/html', page_info)):
if self._debug: print 'Found a non-html file ', url, ' :', page.info()['content-type']
else:
for line in page:
# Find every link in source code
if ((re.search('a href=', line) != None)):
# If several links in a line --> Iterate through all
links_in_line = re.findall('a href="(.*?)"', line)
for link in links_in_line:
# For each link found
# 0 - Check the link its internal (regex source or a page in same dir)
# 1 - Create the new page (only if it's without source in link)
# 2 - Check it doesnt been crawled (check self._links)
# 3 - Add to the list (self._links)
# 4 - Crawl the new one according to the rules (match 'page' in this case)
new_link = ''
crawl_link = False
if (re.match(self._source, link)):
# Subpage sharing source (source/sthing) -> use that link
if self._debug: print 'Found internal link ', link
new_link = link
crawl_link = True
elif (re.match('http:\/\/', link)):
# External link -> do nothing
if self._debug: print 'Found external link ', link
elif (re.match('\w|\_|\.', link)):
# Internal link -> construct full url
if self._debug: print 'Found internal link', link
new_link = urlparse.urljoin(url, link)
crawl_link = True
else:
# Weird link
if self._debug: print 'Found weird link ', link
if ((self._links.count(new_link) == 0) & crawl_link):
# Found a new link, store & crawl it
self._links.append(new_link)
# WP specific actions
# Store only h2 links (meaning entries)
# Just crawl whenever link contains 'page'
if(re.search('h2', line)):
if self._debug: print 'WP entry ', new_link
self._wp_entries.append(new_link)
level += 1
if(re.search('page', new_link)): self.get_childs(new_link, level)
level -= 1
except urllib2.HTTPError as e:
if self._debug: print 'Found error while crawling ', url, e
except urllib2.URLError as e:
if self._debug: print 'Found error while crawling ', url, e
else:
if self._debug: print 'Maximum depth level reached, skipping'
def show_childs(self):
print len(self._links),' links were found, listing:'
for link in self._links:
print link
def show_wp_entries(self):
print len(self._wp_entries),' WP entries were found, listing:'
for head in self._wp_entries:
print head
# Create a new crawler with page and maximu depth level
crawler = Crawler('https://hoyhabloyo.wordpress.com/', 5)
# Get childs for that page (could have used defaults params in method)
crawler.get_childs('https://hoyhabloyo.wordpress.com/', 0)
# Show me what you found
crawler.show_wp_entries()
And now, let’s check the output where I’m asking the crawler to list this blog 6 pages of entries:
kets@ExoduS:~/programacion/sources/python$ python wordpres_crawl.py
30 WP entries were found, listing:
Mitos y verdades sobre las becas ICEX en Informática
XFCE display switching (dual & single monitor)
Año nuevo… ¿vida nueva?
¡Adiós 2011!
¡100.000 visitas!
¡De dibujos animados!
Hablemos de los rumanos // Vorbim despre Românii
Reencuentro de becarios IC3X en Navaluenga
¡Sigo vivo!
Va de despedidas…
Los rincones de Bucarest: Piata Matache
Los rincones de Bucarest: Parcul Carol I
Los rincones de Bucarest: Piata Universitatii
Receta: Hummus
Viaje por los Balcanes
Receta: Gazpacho
Los rincones de Bucarest: Parcul Herestrau
Viaje exprés a España
Ruta por la Rumanía profunda (Valaquia)
De paseo por los Cárpatos
Guía para vivir en Bucarest
Viaje por Asia
Receta: Pescado blanco al microondas
La revolución española (#spanishrevolution)
Viaje a Belgrado
Semana Santa por la Rumanía profunda
Bruselas y Amsterdam
Roma
Las 1000 grullas
Receta: Torretas de berenejena, queso y tomate (vegetarianas)
It works!! As you can see, with just a hundred lines of Python where typed to achieve that, and much more could be done just modifying some parameters. Hope it helps!