How to read and test robots.txt with Python

In this quick tutorial, we'll cover how we can test, read and extract information from robots.txt in Python. We are going to use two libraries - urllib.request and requests

Step 1: Test if robots.txt exists

First we will test if the robots.txt exists or not. To do so we are going to use library requests. We are going to visit the robots.txt page and return the status code of the link:

import requests

def status_code(url):
    r = requests.get(url)
    return r.status_code

print(status_code('https://softhints.com/robots.txt'))

return:

200

This means that robots.txt exists for this site: https://softhints.com/.

Step 2: Read robots.txt with Python

Now let's say that we would like to extract particular information from the robots.txt file - i.e. the sitemap link.

To read and parse the robots.txt with Python we will use: urllib.request

So the code for reading and parsing robots.txt file will looks like:


robots = 'https://softhints.com/robots.txt'

sitemap_ls = []

with urlopen(robots) as stream:
    for line in urlopen(robots).read().decode("utf-8").split('\n'):
        if 'Sitemap'.lower() in line.lower():
            sitemap_url = re.findall(r' (https.*xml)', line)[0]
            sitemap_ls.append(sitemap_url)

if the code is working properly you will get the sitemap link. In case of protection from bots you will get an error 403:

HTTPError: HTTP Error 403: Forbidden

Depending on the protection you might need to use different techniques to bypass it.

Step 3: Extract sitemap link from any URL

In this last section we are going to define a method which will get an URL and try to extract the sitemap.xml based on the robots.txt file.

The code will reuse the above steps plus one additional - convert URL to domain with urlparse:

from urllib.request import urlopen, urlparse
import re

test_url = "https://blog.softhints.com"

def get_robots(test_url):

    domain = urlparse(test_url).netloc
    scheme = urlparse(test_url).scheme
    robots =  f'{scheme}://{domain}/robots.txt'

    sitemap_url = ''
    
    sitemap_ls = []

    with urlopen(robots) as stream:
        for line in urlopen(robots).read().decode("utf-8").split('\n'):
            if 'Sitemap'.lower() in line.lower():
                sitemap_url = re.findall(r' (https.*xml)', line)[0]
                sitemap_ls.append(sitemap_url)
    return list(set(sitemap_ls))

get_robots(test_url)

If the code works properly you will get the sitemap or sitemaps of the site as a list:

['https://blog.xxx.com/sitemap.xml']

Share Tweet Send
0 Comments
Loading...
You've successfully subscribed to SoftHints - Python, Data Science and Linux Tutorials
Great! Next, complete checkout for full access to SoftHints - Python, Data Science and Linux Tutorials
Welcome back! You've successfully signed in
Success! Your account is fully activated, you now have access to all content.