How to read and test robots.txt with Python

In this quick tutorial, we'll cover how we can test, read and extract information from robots.txt in Python. We are going to use two libraries - urllib.request and requests

Step 1: Test if robots.txt exists

First we will test if the robots.txt exists or not. To do so we are going to use library requests. We are going to visit the robots.txt page and return the status code of the link:

import requests

def status_code(url):
    r = requests.get(url)
    return r.status_code

print(status_code('https://softhints.com/robots.txt'))

return:

This means that robots.txt exists for this site: https://softhints.com/.

Step 2: Read robots.txt with Python

Now let's say that we would like to extract particular information from the robots.txt file - i.e. the sitemap link.

To read and parse the robots.txt with Python we will use: urllib.request

So the code for reading and parsing robots.txt file will looks like:


robots = 'https://softhints.com/robots.txt'

sitemap_ls = []

with urlopen(robots) as stream:
    for line in urlopen(robots).read().decode("utf-8").split('\n'):
        if 'Sitemap'.lower() in line.lower():
            sitemap_url = re.findall(r' (https.*xml)', line)[0]
            sitemap_ls.append(sitemap_url)

if the code is working properly you will get the sitemap link. In case of protection from bots you will get an error 403:

HTTPError: HTTP Error 403: Forbidden

Depending on the protection you might need to use different techniques to bypass it.

Step 3: Extract sitemap link from any URL

In this last section we are going to define a method which will get an URL and try to extract the sitemap.xml based on the robots.txt file.

The code will reuse the above steps plus one additional - convert URL to domain with urlparse:

from urllib.request import urlopen, urlparse
import re

test_url = "https://blog.softhints.com"

def get_robots(test_url):

    domain = urlparse(test_url).netloc
    scheme = urlparse(test_url).scheme
    robots =  f'{scheme}://{domain}/robots.txt'

    sitemap_url = ''
    
    sitemap_ls = []

    with urlopen(robots) as stream:
        for line in urlopen(robots).read().decode("utf-8").split('\n'):
            if 'Sitemap'.lower() in line.lower():
                sitemap_url = re.findall(r' (https.*xml)', line)[0]
                sitemap_ls.append(sitemap_url)
    return list(set(sitemap_ls))

get_robots(test_url)

If the code works properly you will get the sitemap or sitemaps of the site as a list:

['https://blog.xxx.com/sitemap.xml']

> Python Basics

> Advanced Tutorials

> Python Errors

> Pandas Advanced

> Pandas Count

> Pandas Column

> Pandas Basics

> Pandas DataFrame

> Pandas Row

> User Interface

> Advanced

> Troubleshoot

> Video & Sound

> Linux Commands

> MySQL

> SQL Basics

> Python

> DB apps

> JupyterLab

> Jupyter Tips

> Jupyter Display

> Regex in Text Editor

> Regex Basics

> Regex Match

> Regex Date

> PyCharm Advanced

> Git and PyCharm

> PyCharm Error

> PyCharm Tips

> Linux Mint Applications

> VIrtual Machine

> Miscellaneous

> Java

> Automation

> Windows

> Office

> Cheat Sheet

Step 1: Test if robots.txt exists

Step 2: Read robots.txt with Python

Step 3: Extract sitemap link from any URL