How to check if URLs in XML sitemaps are blocked by robots.txt

robots.txt
XML sitemaps
SEO
advertools
A Python script using advertools on how to bulk check if URLs in your XML sitemap are blocked by rules in your robots.txt file. This will be run in bulk for all combinations of user-agent:URL.

In this guide we will build a script where you provide the URL of a robots.txt file, and it will extract the sitemap(s) in it, run a robots test on for all URLs and user-agents. Feel free to modify the defaults.

Prerequisites

Although you need Python and advertools installed, this will all be handled by uv, so the main prerequisite that you need to take care of is installing uv.

For Linux/MacOS:

curl -LsSf https://astral.sh/uv/install.sh | sh

For Windows:

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Run the imports

from urllib.parse import urlsplit

import advertools as adv
import pandas as pd

Define the variable robots_url with any robots.txt file you want

robots_url = "https://www.google.com/robots.txt"

Convert the robots file to a DataFrame

robots_df = adv.robotstxt_to_df(robots_url)

Fetch the sitemap(s) from the robots file

sitemap_df = adv.sitemap_to_df(robots_url)

Exract a list of available user-agents

This will print all non-200 URLs to the console.

user_agents = robots_df[robots_df['directive'].str.contains("User-agent", case=False)]['content']
user_agents
0                         *
312           AdsBot-Google
319              Twitterbot
325     facebookexternalhit
332            anthropic-ai
333       Applebot-Extended
334              Bytespider
335                   CCBot
336            ChatGPT-User
337               ClaudeBot
338               cohere-ai
339                 Diffbot
340             FacebookBot
341                  GPTBot
342            ImagesiftBot
343      Meta-ExternalAgent
344    Meta-ExternalFetcher
345               Omgilibot
346           PerplexityBot
347                Timpibot
Name: content, dtype: object

Run the robotstxt_test function

robots_report = adv.robotstxt_test(
    robotstxt_url=robots_url,
    user_agents=user_agents,
    urls=sitemap['loc'].dropna())
robots_report[~robots_report['can_fetch']]

Sample output

Empty DataFrame
Columns: [robotstxt_url, user_agent, url_path, can_fetch]
Index: []

In this case we got an empty DataFrame. Otherwise we would have received a set of rows where “can_fetch” is False.

Putting all together

This is a self-conained script that has the required Python version, as well as the dependencies. When you run it uv will handle making sure everything is handled regarding the environment, dependencies, and running the script.

adv_robots_blocking_sitemap_urls.py
# /// script
# requires-python = ">=3.13"
# dependencies = [
#     "advertools",
# ]
# ///

import sys
from urllib.parse import urlsplit

import advertools as adv
import pandas as pd


if __name__ == "__main__":
    robots_url = sys.argv[1]
    robots_df = adv.robotstxt_to_df(robots_url)
    sitemap = adv.sitemap_to_df(robots_url)
    user_agents = robots_df[robots_df['directive'].str.contains("User-agent", case=False)]['content']
    robots_report = adv.robotstxt_test(
        robotstxt_url=robots_url,
        user_agents=user_agents,
        urls=sitemap['loc'].dropna())
    print(robots_report[~robots_report['can_fetch']])

Having saved the above file, you can now simply run it using uv as follows:

uv run adv_robots_blocking_sitemap_urls.py