How to check if URLs in XML sitemaps are blocked by robots.txt
In this guide we will build a script where you provide the URL of a robots.txt file, and it will extract the sitemap(s) in it, run a robots test on for all URLs and user-agents. Feel free to modify the defaults.
Prerequisites
Although you need Python and advertools installed, this will all be handled by uv
, so the main prerequisite that you need to take care of is installing uv
.
For Linux/MacOS:
For Windows:
Run the imports
Define the variable robots_url
with any robots.txt file you want
Convert the robots file to a DataFrame
Fetch the sitemap(s) from the robots file
Exract a list of available user-agents
This will print all non-200 URLs to the console.
user_agents = robots_df[robots_df['directive'].str.contains("User-agent", case=False)]['content']
user_agents
0 *
312 AdsBot-Google
319 Twitterbot
325 facebookexternalhit
332 anthropic-ai
333 Applebot-Extended
334 Bytespider
335 CCBot
336 ChatGPT-User
337 ClaudeBot
338 cohere-ai
339 Diffbot
340 FacebookBot
341 GPTBot
342 ImagesiftBot
343 Meta-ExternalAgent
344 Meta-ExternalFetcher
345 Omgilibot
346 PerplexityBot
347 Timpibot
Name: content, dtype: object
Run the robotstxt_test
function
Sample output
In this case we got an empty DataFrame. Otherwise we would have received a set of rows where “can_fetch” is False
.
Putting all together
This is a self-conained script that has the required Python version, as well as the dependencies. When you run it uv
will handle making sure everything is handled regarding the environment, dependencies, and running the script.
adv_robots_blocking_sitemap_urls.py
# /// script
# requires-python = ">=3.13"
# dependencies = [
# "advertools",
# ]
# ///
import sys
from urllib.parse import urlsplit
import advertools as adv
import pandas as pd
if __name__ == "__main__":
robots_url = sys.argv[1]
robots_df = adv.robotstxt_to_df(robots_url)
sitemap = adv.sitemap_to_df(robots_url)
user_agents = robots_df[robots_df['directive'].str.contains("User-agent", case=False)]['content']
robots_report = adv.robotstxt_test(
robotstxt_url=robots_url,
user_agents=user_agents,
urls=sitemap['loc'].dropna())
print(robots_report[~robots_report['can_fetch']])
Having saved the above file, you can now simply run it using uv
as follows: