To parse a compressed XML sitemap (xml.gz
) directly from a URL without downloading it to disk, you can use Python’s:
requests
gzip
pandas
Example 1
import requests
import gzip
from io import BytesIO
r = requests.get('https://example.com/weekly_sitemap2024-11.xml.gz')
sitemap = gzip.GzipFile(fileobj=BytesIO(r.content)).read()
df = pd.read_xml(sitemap.decode().replace('xhtml:link', 'xhtml'))
df
Output
loc | xhtml | lastmod | changefreq | priority | |
---|---|---|---|---|---|
0 | https://example.com/1 | NaN | 2024-03-11T08:00:30+02:00 | weekly | 0.7 |
1 | https://example.com/2 | NaN | 2024-03-11T08:00:56+02:00 | weekly | 0.7 |
2 | https://example.com/3 | NaN | 2024-03-11T08:01:26+02:00 | weekly | 0.7 |
Explanation
- Fetches the
.gz
file usingrequests.get()
. - Decompresses it using
gzip.decompress()
. - Parses the XML structure using
ElementTree
. - Decode the read sitemap to avoid -
TypeError: a bytes-like object is required, not 'str'
- Solve XML namespace errors like -
XMLSyntaxError: Namespace prefix xhtml on link is not defined, line 1, column 302
- Extracts URLs from
<loc>
tags.
Reading Raw Sitemap Content
import gzip
from io import BytesIO
response = requests.get('https://example.com/weekly_sitemap2024-11.xml.gz')
buffer = BytesIO(response.content)
decompressed_content = gzip.GzipFile(fileobj=buffer).read()
print(decompressed_content)