Getting status codes of URLs in XML sitemaps

status codes

auditing

SEO

XML sitemaps

robots.txt

A Python script using advertools to check the status codes of all URL in an XML sitemap (or sitemapindex), simply by supplying the robots.txt URL.

In this guide we will build a script where you provide the URL of a robots.txt file, and it will extract the sitemap(s) in it, run a status code check on all the URLs, and print out all URLs with a status code that is not 200. Feel free to modify the defaults.

Prerequisites

Although you need Python and advertools installed, this will all be handled by uv, so the main prerequisite that you need to take care of is installing uv.

For Linux/MacOS:

curl -LsSf https://astral.sh/uv/install.sh | sh

For Windows:

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Run the imports

from urllib.parse import urlsplit

import advertools as adv
import pandas as pd

Define the variable `robots_url` with any robots.txt file you want

robots_url = "https://example.com/robots.txt"
domain = urlsplit(robots_url).netloc.replace(".", "_")

Fetch the sitemap(s) from the robots file

sitemap_df = adv.sitemap_to_df(robots_url)

Run the `crawl_headers` function to check status codes and retreive all available response headers

This will print all non-200 URLs to the console.

adv.crawl_headers(sitemap_df['loc'], "output.jl", custom_settings={"LOG_FILE": "output.log"})
pd.read_json(f"{domain}.jl", lines=True)[["url", "status"]].query("status != 200")

As a side-effect you will have two files:

{domain}.jl: The full status codes crawl file which you can use analyze further
{domain}.log: The logs of the crawl process, in case you have any issues that you want to audit

Sample output

If you get non-200 URLs, you will get a table like this:

                          url  status
0  https://example.com/page_0     404
1  https://example.com/page_1     404
2  https://example.com/page_2     500
3  https://example.com/page_3     404
4  https://example.com/page_4     403

Putting all together

This is a self-conained script that has the required Python version, as well as the dependencies. When you run it uv will handle making sure everything is handled regarding the environment, dependencies, and running the script.

adv_sitemap_status_codes.py

# /// script
# requires-python = ">=3.13"
# dependencies = [
#     "advertools",
# ]
# ///

import sys
from urllib.parse import urlsplit

import advertools as adv
import pandas as pd


if __name__ == "__main__":
    robots_url = sys.argv[1]
    domain = urlsplit(robots_url).netloc.replace(".", "_")
    try:
        sitemap_df = adv.sitemap_to_df(robots_url)
    except Exception as e:
        print("There seems to be an issue with the sitemap or it doesn't exist")
    sys.exit()
    adv.crawl_headers(
        sitemap_df["loc"].dropna(), f"{domain}.jl", custom_settings={"LOG_FILE": f"{domain}.log"}
    )
    print(
        pd.read_json(f"{domain}.jl", lines=True)[["url", "status"]].query("status != 200")
    )

Having saved the above file, you can now simply run it using uv as follows:

uv run adv_sitemap_status_codes.py