Sitemaps have many uses. Besides the obvious navigation advantage they can be practical for testing performance, finding bugs, and helping search engines index all your pages.

Pylot can help you find problematic pages on your site. It report the amount of errors returned. But most importantly it shows you the slowest loading pages on the site so you know where to optimize. Pylot is written in python but just makes call to your server so it is useful even if you’re not working with, or even know python.

First we create a simple sitemap class for the project. It just need to hold a list of URLs and have a few xml output methods.

class Sitemap(object):
    def __init__(self):
        self.entries = list()
    
    def append(self, url, lastmod=None, changefreq=None, priority=None):
        self.entries.append({   'url': url,
                                'lastmod': lastmod,
                                'changefreq': changefreq,
                                'priority': priority })

    def render(self):
        base = """<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
%s</urlset>
"""
        urls = ""
        for e in self.entries:
            urls += "  <url>\n"
            urls += "    <loc>%s</loc>\n" % e['url']
            if e['lastmod'] is not None:
                urls += "    <lastmod>%s</lastmod>\n" % e['lastmod']
            if e['changefreq'] is not None:
                urls += "    <changefreq>%s</changefreq>\n" % e['changefreq']
            if e['priority'] is not None:
                urls += "    <priority>%.1f</priority>\n" % e['priority']
            urls += "    </url>\n"
        return base % urls

    def render_pylot(self):
        base = """<testcases>
%s</testcases>
"""
        urls = ""
        for e in self.entries:
            urls += "  <case>\n"
            urls += "    <url>%s</url>\n" % e['url']
            urls += "  </case>\n"
        return base % urls

It runs like this:

sitemap = Sitemap()

sitemap.append("http://site.com/p0/", priority=0.5)
sitemap.append("http://site.com/p1/", lastmod='2010-01-01')
sitemap.append("http://site.com/p2/", changefreq='weekly', priority=1.0)

print sitemap.render()
# <?xml version="1.0" encoding="UTF-8"?>
# <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
#   <url>
#     <loc>http://http://site.com/p0/</loc>
#     <priority>0.5</priority>
#     </url>
#   <url>
#     <loc>http://http://site.com/p1/</loc>
#     <lastmod>2010-01-01</lastmod>
#     </url>
#   <url>
#     <loc>http://http://site.com/p2/</loc>
#     <changefreq>weekly</changefreq>
#     <priority>1.0</priority>
#     </url>
# </urlset>

print sitemap.render_pylot()
# <testcases>
#   <case>
#     <url>http://site.com/p0/</url>
#   </case>
#   <case>
#     <url>http://site.com/p1/</url>
#   </case>
#   <case>
#     <url>http://site.com/p2/</url>
#   </case>
# </testcases>

Obviously you want to generate the urls based on your address schema and database content.

The sitemap should be reachable via http://yoursite.com/sitemap.xml. For big sites (over 50,000 urls) you need to split it into multiple files. See sitemap.org for more information.

To use the Pylot data simply save the output to an xml file, ‘cd’ to the folder with pylot, and say:

python run.py -x your_map.xml -a 5 -d 300

This will performance test the site by running 5 agents that hit the linked pages for 5 minutes (300 seconds). Pylot then generates an HTML report with useful charts and tables (charts require numpy and matplotlib).

That is all there is to it. Just a few minutes of work to improve your chances of getting your pages crawled correctly and give yourself a good starting point for optimizing.