In this quick tutorial, we'll cover how we can test, read and extract information from robots.txt in Python. We are going to use two libraries - urllib.request
and requests
Step 1: Test if robots.txt exists
First we will test if the robots.txt
exists or not. To do so we are going to use library requests
. We are going to visit the robots.txt page and return the status code of the link:
import requests
def status_code(url):
r = requests.get(url)
return r.status_code
print(status_code('https://softhints.com/robots.txt'))
return:
200
This means that robots.txt exists for this site: https://softhints.com/.
Step 2: Read robots.txt with Python
Now let's say that we would like to extract particular information from the robots.txt file - i.e. the sitemap link.
To read and parse the robots.txt with Python we will use: urllib.request
So the code for reading and parsing robots.txt file will looks like:
robots = 'https://softhints.com/robots.txt'
sitemap_ls = []
with urlopen(robots) as stream:
for line in urlopen(robots).read().decode("utf-8").split('\n'):
if 'Sitemap'.lower() in line.lower():
sitemap_url = re.findall(r' (https.*xml)', line)[0]
sitemap_ls.append(sitemap_url)
if the code is working properly you will get the sitemap link. In case of protection from bots you will get an error 403:
HTTPError: HTTP Error 403: Forbidden
Depending on the protection you might need to use different techniques to bypass it.
Step 3: Extract sitemap link from any URL
In this last section we are going to define a method which will get an URL and try to extract the sitemap.xml
based on the robots.txt
file.
The code will reuse the above steps plus one additional - convert URL to domain with urlparse
:
from urllib.request import urlopen, urlparse
import re
test_url = "https://blog.softhints.com"
def get_robots(test_url):
domain = urlparse(test_url).netloc
scheme = urlparse(test_url).scheme
robots = f'{scheme}://{domain}/robots.txt'
sitemap_url = ''
sitemap_ls = []
with urlopen(robots) as stream:
for line in urlopen(robots).read().decode("utf-8").split('\n'):
if 'Sitemap'.lower() in line.lower():
sitemap_url = re.findall(r' (https.*xml)', line)[0]
sitemap_ls.append(sitemap_url)
return list(set(sitemap_ls))
get_robots(test_url)
If the code works properly you will get the sitemap or sitemaps of the site as a list: