How to Parse a Compressed Sitemap in Python and Pandas

To parse a compressed XML sitemap (xml.gz) directly from a URL without downloading it to disk, you can use Python’s:

requests
gzip
pandas

Example 1

import requests
import gzip
from io import BytesIO

r = requests.get('https://example.com/weekly_sitemap2024-11.xml.gz')
sitemap = gzip.GzipFile(fileobj=BytesIO(r.content)).read()
df = pd.read_xml(sitemap.decode().replace('xhtml:link', 'xhtml'))
df

Output

	loc	xhtml	lastmod	changefreq	priority
0	https://example.com/1	NaN	2024-03-11T08:00:30+02:00	weekly	0.7
1	https://example.com/2	NaN	2024-03-11T08:00:56+02:00	weekly	0.7
2	https://example.com/3	NaN	2024-03-11T08:01:26+02:00	weekly	0.7

Explanation

Fetches the .gz file using requests.get().
Decompresses it using gzip.decompress().
Parses the XML structure using ElementTree.
Decode the read sitemap to avoid - TypeError: a bytes-like object is required, not 'str'
Solve XML namespace errors like - XMLSyntaxError: Namespace prefix xhtml on link is not defined, line 1, column 302
Extracts URLs from <loc> tags.

Reading Raw Sitemap Content

import gzip
from io import BytesIO

response = requests.get('https://example.com/weekly_sitemap2024-11.xml.gz')

buffer = BytesIO(response.content)
decompressed_content = gzip.GzipFile(fileobj=buffer).read()
print(decompressed_content)

> Python Basics

> Advanced Tutorials

> Python Errors

> Pandas Advanced

> Pandas Count

> Pandas Column

> Pandas Basics

> Pandas DataFrame

> Pandas Row

> User Interface

> Advanced Linux

> Troubleshoot

> Video & Sound

> Linux Commands

> MySQL

> SQL Basics

> Python

> DB apps

> JupyterLab

> Jupyter Tips

> Jupyter Display

> Regex in Text Editor

> Regex Basics

> Regex Match

> Regex Date

> PyCharm Advanced

> Git and PyCharm

> PyCharm Error

> PyCharm Tips

> Linux Mint Applications

> VIrtual Machine

> Miscellaneous

> Java

> Automation

> Windows

> Office

> Cheat Sheet

Example 1

Output

Explanation

Reading Raw Sitemap Content

Resources