How to read and test robots.txt with Python
In this quick tutorial, we'll cover how we can test, read and extract information from robots.txt in Python. We are going to use two libraries - urllib.request
and requests
Step 1: Test if robots.txt exists
First we will test if the robots.txt
exists or not. To do so we are going to use library requests
. We are going to visit the robots.txt page and return the status code of the link:
import requests
def status_code(url):
r = requests.get(url)
return r.status_code
print(status_code('https://softhints.com/robots.txt'))
return:
200
This means that robots.txt exists for this site: https://softhints.com/.
Step 2: Read robots.txt with Python
Now let's say that we would like to extract particular information from the robots.txt file - i.e. the sitemap link.
To read and parse the robots.txt with Python we will use: urllib.request
So the code for reading and parsing robots.txt file will looks like:
robots = 'https://softhints.com/robots.txt'
sitemap_ls = []
with urlopen(robots) as stream:
for line in urlopen(robots).read().decode("utf-8").split('\n'):
if 'Sitemap'.lower() in line.lower():
sitemap_url = re.findall(r' (https.*xml)', line)[0]
sitemap_ls.append(sitemap_url)
if the code is working properly you will get the sitemap link. In case of protection from bots you will get an error 403:
HTTPError: HTTP Error 403: Forbidden
Depending on the protection you might need to use different techniques to bypass it.
Step 3: Extract sitemap link from any URL
In this last section we are going to define a method which will get an URL and try to extract the sitemap.xml
based on the robots.txt
file.
The code will reuse the above steps plus one additional - convert URL to domain with urlparse
:
from urllib.request import urlopen, urlparse
import re
test_url = "https://blog.softhints.com"
def get_robots(test_url):
domain = urlparse(test_url).netloc
scheme = urlparse(test_url).scheme
robots = f'{scheme}://{domain}/robots.txt'
sitemap_url = ''
sitemap_ls = []
with urlopen(robots) as stream:
for line in urlopen(robots).read().decode("utf-8").split('\n'):
if 'Sitemap'.lower() in line.lower():
sitemap_url = re.findall(r' (https.*xml)', line)[0]
sitemap_ls.append(sitemap_url)
return list(set(sitemap_ls))
get_robots(test_url)
If the code works properly you will get the sitemap or sitemaps of the site as a list: