Parse Website XML Sitemap with Python and Pandas

Need to parse XML sitemap of a website in Python and Pandas? To get all URLs as a Pandas DataFrame?

If so, you may find several very useful solutions in this article.

Option 1: Parse XML Sitemap with Python and Pandas

The sitemap which we are going to read is the one of this website:

/sitemap.xml

We are going to read it in Python with library requests and then parse the URLs and the dates by module lxml.

You can find the code below - first we read the sitemap with package requests. Next we load the content into lxml creating a tree for all elements.

Finally we are iterating over all elements in the sitemap and append the info to a dict:

import requests
import pandas as pd
from lxml import etree

main_sitemap = 'https://blog.softhints.com/sitemap.xml'

xmlDict = []

r = requests.get(main_sitemap)
root = etree.fromstring(r.content)
print ("The number of sitemap tags are {0}".format(len(root)))
for sitemap in root:
    children = sitemap.getchildren()
    xmlDict.append({'url': children[0].text, 'date': children[1].text})

pd.DataFrame(xmlDict)

The result is a Pandas DataFrame which contains the URLs and the dates of the sitemap:

Option 2: Parse compressed XML Sitemap with Python

In this option we will take care of a sitemap which is compressed. The parsing is pretty similar but includes one additional step - extraction:

import requests
import gzip
from io import StringIO

r = requests.get('http://blog.softhints.com/sitemap.xml.gz')
sitemap = gzip.GzipFile(fileobj=StringIO(r.content)).read()

We are going to use StringIO from io in order to read the content of the compressed XML sitemap.

Then we can parse it in the same way like in Option 1.

Option 3: Parse local XML Sitemap with Python - no namespaces

Sometimes you need to parse with Python a sitemap which is stored locally on your machine.

Suppose you have a local file with content like and named - XML Sitemap.xml:

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="//blog.softhints.com/sitemap.xsl"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <sitemap>
        <loc>https://blog.softhints.com/sitemap-pages.xml</loc>
        <lastmod>2021-06-27T20:56:56.533Z</lastmod>
    </sitemap>
    <sitemap>
        <loc>https://blog.softhints.com/sitemap-posts.xml</loc>
        <lastmod>2021-08-16T09:50:34.531Z</lastmod>
    </sitemap>
    <sitemap>
        <loc>https://blog.softhints.com/sitemap-authors.xml</loc>
        <lastmod>2021-08-20T19:26:41.214Z</lastmod>
    </sitemap>
    <sitemap>
        <loc>https://blog.softhints.com/sitemap-tags.xml</loc>
        <lastmod>2021-08-16T09:50:34.641Z</lastmod>
    </sitemap>
</sitemapindex>

To parse the above sitemap without taking care for the namespaces you can use the next Python code:

import lxml.etree


tree = lxml.etree.parse("/home/myuser/Desktop/XML Sitemap.xml")

for url in tree.xpath("//*[local-name()='loc']/text()"):
    print(url)
    
for date in tree.xpath("//*[local-name()='lastmod']/text()"):
    print(date)    

This will print the urls and the dates as:

https://blog.softhints.com/sitemap-pages.xml
https://blog.softhints.com/sitemap-posts.xml
https://blog.softhints.com/sitemap-authors.xml
https://blog.softhints.com/sitemap-tags.xml
2021-06-27T20:56:56.533Z
2021-08-16T09:50:34.531Z
2021-08-20T19:26:41.214Z
2021-08-16T09:50:34.641Z

Note: If you need to find all elements and print their values you can use method root.iter():

for i in root.iter():
    print(i.text)

Option 4: Parse local XML Sitemap with Python - namespaces

If you need to parse XML sitemap with namespaces you can use method root.findall and give the namespace as a path.

In order to find what is your namespace you can test the root element by:

tree.getroot()

the result is:

<Element '{http://www.sitemaps.org/schemas/sitemap/0.9}sitemapindex' at 0x7fe25fbdd720>

So the namespace which we are going to use is {http://www.sitemaps.org/schemas/sitemap/0.9} and we are searching for this element sitemap.

The code below parse the XML sitemap which is stored locally but it will work also with requests:

import xml.etree.ElementTree as ET
tree = ET.parse("/home/myuser/Desktop/XML Sitemap.xml")
root = tree.getroot()

# In find/findall, prefix namespaced tags with the full namespace in braces
for sitemap in root.findall('{http://www.sitemaps.org/schemas/sitemap/0.9}sitemap'):
    loc = sitemap.find('{http://www.sitemaps.org/schemas/sitemap/0.9}loc').text
    lastmod = sitemap.find('{http://www.sitemaps.org/schemas/sitemap/0.9}lastmod').text
    print(loc, lastmod)

The parsed sitemap content is shown below:

https://blog.softhints.com/sitemap-pages.xml 2021-06-27T20:56:56.533Z
https://blog.softhints.com/sitemap-posts.xml 2021-08-16T09:50:34.531Z
https://blog.softhints.com/sitemap-authors.xml 2021-08-20T19:26:41.214Z
https://blog.softhints.com/sitemap-tags.xml 2021-08-16T09:50:34.641Z