I am much into technical SEO. Testing broken links with xenu is my routine work. So, I wanted to import list of website urls from sitemap.xml to feed xenu software.
So, I tried to implement it in my favorite scripting language Python. Pyquery is python dom parsing tool similar to jquery for nodejs.
Let’s install required dependency first.
- Get pyquery from https://pypi.python.org/pypi/pyquery#downloads
- Unzip pyquery-1.2.x.zip (where x is latest version)
- cd pyquey-1.2.x (in windows open command prompt and navigate to this directory)
- type: python setup.py install
Verify pyquery setup
python >>> import pyquery >>> dir(pyquery) ['PyQuery','__builtins__','__doc__','__file__','__name__','__package__','__path__','cssselectpatch','openers','pyquery']
Now let’s look into pyquery basics.
In jQuery you select dom node as follow
jQuery('element')
In pyQuery you will select dom node as follow
from pyquery import PyQuery as pq jQuery = pq(url=remote_sitemap)
So above statement will create pyquey object and assign it to variable jQuery.
Now let’s code sitemap grabber and parser.
Sample sitemap format
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url><loc>http://www.example.com/</loc><lastmod>2013-09-10</lastmod><priority>1.0</priority></url> </urlset>
Grabbing and parsing url with pyquery
import os import pyquery as pq jQuery = pq(url=remote_sitemap,headers={'user-agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}) for i in jQuery('loc'): print jQuery(i).text()
above statement will parse sitemap.xml node value.
Now lets do small python file manipulation tutorial for list of urls saving.
import os if os.path.exists(filename): f = file(filename, "r+") else: f = file(filename, "w") for i in jQuery('loc'): print jQuery(i).text() f.write(jQuery(i).text()) f.write("\n")
Above statement will open file and write urls from node.
Finally let’s make it command line tool
from pyquery import PyQuery as pq import sys import os import urllib2 if len(sys.argv) != 3: print "Usage: sitemap_to_list.py remote_url localfile" sys.exit(0) remote_sitemap = sys.argv[1] local_file = sys.argv[2] try: jQuery = pq(url=remote_sitemap,headers={'user-agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}) except urllib2.HTTPError, err: if err.code == 404: print "File " + str(remote_sitemap) + " not found" sys.exit(0) elif err.code == 403: print "\n403 Access Denied\n****************\nFile " + str(remote_sitemap) + " reading error" sys.exit(0) filename = local_file if os.path.exists(filename): f = file(filename, "r+") else: f = file(filename, "w") for i in jQuery('loc'): print jQuery(i).text() f.write(jQuery(i).text()) f.write("\n") f.close()