Getting status codes of URLs in XML sitemaps
In this guide we will build a script where you provide the URL of a robots.txt file, and it will extract the sitemap(s) in it, run a status code check on all the URLs, and print out all URLs with a status code that is not 200. Feel free to modify the defaults.
Prerequisites
Although you need Python and advertools installed, this will all be handled by uv
, so the main prerequisite that you need to take care of is installing uv
.
For Linux/MacOS:
For Windows:
Run the imports
Define the variable robots_url
with any robots.txt file you want
Fetch the sitemap(s) from the robots file
Run the crawl_headers
function to check status codes and retreive all available response headers
This will print all non-200 URLs to the console.
adv.crawl_headers(sitemap_df['loc'], "output.jl", custom_settings={"LOG_FILE": "output.log"})
pd.read_json(f"{domain}.jl", lines=True)[["url", "status"]].query("status != 200")
As a side-effect you will have two files:
{domain}.jl
: The full status codes crawl file which you can use analyze further{domain}.log
: The logs of the crawl process, in case you have any issues that you want to audit
Sample output
If you get non-200 URLs, you will get a table like this:
Putting all together
This is a self-conained script that has the required Python version, as well as the dependencies. When you run it uv
will handle making sure everything is handled regarding the environment, dependencies, and running the script.
adv_sitemap_status_codes.py
# /// script
# requires-python = ">=3.13"
# dependencies = [
# "advertools",
# ]
# ///
import sys
from urllib.parse import urlsplit
import advertools as adv
import pandas as pd
if __name__ == "__main__":
robots_url = sys.argv[1]
domain = urlsplit(robots_url).netloc.replace(".", "_")
try:
sitemap_df = adv.sitemap_to_df(robots_url)
except Exception as e:
print("There seems to be an issue with the sitemap or it doesn't exist")
sys.exit()
adv.crawl_headers(
sitemap_df["loc"].dropna(), f"{domain}.jl", custom_settings={"LOG_FILE": f"{domain}.log"}
)
print(
pd.read_json(f"{domain}.jl", lines=True)[["url", "status"]].query("status != 200")
)
Having saved the above file, you can now simply run it using uv
as follows: