How to check if URLs in XML sitemaps are blocked by robots.txt
In this guide we will build a script where you provide the URL of a robots.txt file, and it will extract the sitemap(s) in it, run a robots test on for all URLs and user-agents. Feel free to modify the defaults.
Prerequisites
Although you need Python and advertools installed, this will all be handled by uv, so the main prerequisite that you need to take care of is installing uv.
For Linux/MacOS:
For Windows:
Run the imports
Define the variable robots_url with any robots.txt file you want
Convert the robots file to a DataFrame
Fetch the sitemap(s) from the robots file
Exract a list of available user-agents
This will print all non-200 URLs to the console.
user_agents = robots_df[robots_df['directive'].str.contains("User-agent", case=False)]['content']
user_agents0                         *
312           AdsBot-Google
319              Twitterbot
325     facebookexternalhit
332            anthropic-ai
333       Applebot-Extended
334              Bytespider
335                   CCBot
336            ChatGPT-User
337               ClaudeBot
338               cohere-ai
339                 Diffbot
340             FacebookBot
341                  GPTBot
342            ImagesiftBot
343      Meta-ExternalAgent
344    Meta-ExternalFetcher
345               Omgilibot
346           PerplexityBot
347                Timpibot
Name: content, dtype: objectRun the robotstxt_test function
Sample output
In this case we got an empty DataFrame. Otherwise we would have received a set of rows where “can_fetch” is False.
Putting all together
This is a self-conained script that has the required Python version, as well as the dependencies. When you run it uv will handle making sure everything is handled regarding the environment, dependencies, and running the script.
adv_robots_blocking_sitemap_urls.py
# /// script
# requires-python = ">=3.13"
# dependencies = [
#     "advertools",
# ]
# ///
import argparse
import advertools as adv
def main():
    parser = argparse.ArgumentParser(
        description="Check if any robots.txt rules block any URLs in your XML sitemaps."
    )
    parser.add_argument("robots_url", help="The URL of the robots.txt file to analyze.")
    parser.add_argument(
        "-o",
        "--output-file",
        help="Optional output file to save the report.",
        default=None,
    )
    args = parser.parse_args()
    robots_url = args.robots_url
    robots_df = adv.robotstxt_to_df(robots_url)
    sitemap_df = adv.sitemap_to_df(robots_url)
    user_agents = robots_df[
        robots_df["directive"].str.contains("User-agent", case=False)
    ]["content"]
    robots_report = adv.robotstxt_test(
        robotstxt_url=robots_url, user_agents=user_agents, urls=sitemap_df["loc"].dropna().drop_duplicates()
    )
    disallowed = robots_report[~robots_report["can_fetch"]]
    print(disallowed)
    if args.output_file:
        disallowed.to_csv(args.output_file, index=False)
        print(f"Report saved to {args.output_file}")
if __name__ == "__main__":
    main()Having saved the above file, you can now simply run it using uv as follows:
uv run adv_robots_blocking_sitemap_urls.py https://example.com/robots.txt --output-file robots_report.csvThe -o or --output-file argument is optional