To parse a compressed XML sitemap (xml.gz) directly from a URL without downloading it to disk, you can use Python’s:

  • requests
  • gzip
  • pandas

Example 1

import requests
import gzip
from io import BytesIO

r = requests.get('https://example.com/weekly_sitemap2024-11.xml.gz')
sitemap = gzip.GzipFile(fileobj=BytesIO(r.content)).read()
df = pd.read_xml(sitemap.decode().replace('xhtml:link', 'xhtml'))
df

Output

loc xhtml lastmod changefreq priority
0 https://example.com/1 NaN 2024-03-11T08:00:30+02:00 weekly 0.7
1 https://example.com/2 NaN 2024-03-11T08:00:56+02:00 weekly 0.7
2 https://example.com/3 NaN 2024-03-11T08:01:26+02:00 weekly 0.7

Explanation

  • Fetches the .gz file using requests.get().
  • Decompresses it using gzip.decompress().
  • Parses the XML structure using ElementTree.
  • Decode the read sitemap to avoid - TypeError: a bytes-like object is required, not 'str'
  • Solve XML namespace errors like - XMLSyntaxError: Namespace prefix xhtml on link is not defined, line 1, column 302
  • Extracts URLs from <loc> tags.

Reading Raw Sitemap Content

import gzip
from io import BytesIO

response = requests.get('https://example.com/weekly_sitemap2024-11.xml.gz')

buffer = BytesIO(response.content)
decompressed_content = gzip.GzipFile(fileobj=buffer).read()
print(decompressed_content)

Resources