How to Download, Parse, and Visualize XML Sitemaps

data visualization
analytics
SEO
XML sitemaps
URLs
plotly
A Python script using advertools to download, parse, and save the sitemap to a CSV file. We will also visualize the URL structure and publishing trends of the website

In this guide we will build a script where you provide the URL of a robots.txt file, and it will extract the sitemap(s) in it, convert and save it to a CSV file, and produce two visualizations, both of which will be saved to HTML files. Feel free to modify the defaults.

Prerequisites

Although you need Python and advertools installed, this will all be handled by uv, so the main prerequisite that you need to take care of is installing uv.

Run the imports

from urllib.parse import urlsplit

import advertools as adv
import adviz
import pandas as pd

Define the variable robots_url with any robots.txt file you want

robots_url = "https://brightonseo.com/robots.txt"
domain = urlsplit(robots_url).netloc.replace(".", "_")

Fetch the sitemap(s) from the robots file

sitemap = adv.sitemap_to_df(robots_url)
sitemap.head()
2024-12-25 16:27:55,343 | INFO | sitemaps.py:623 | sitemap_to_df | Getting https://brightonseo.com/sitemap.xml
loc lastmod sitemap sitemap_size_mb download_date
0 https://brightonseo.com 2024-12-16 11:34:32+00:00 https://brightonseo.com/sitemap.xml 0.200686 2024-12-25 16:27:55.358044+00:00
1 https://brightonseo.com/ssas-october-2024 2024-12-10 14:47:25+00:00 https://brightonseo.com/sitemap.xml 0.200686 2024-12-25 16:27:55.358044+00:00
2 https://brightonseo.com/measurefest-october-2025 2024-12-10 15:10:42+00:00 https://brightonseo.com/sitemap.xml 0.200686 2024-12-25 16:27:55.358044+00:00
3 https://brightonseo.com/training 2024-07-08 15:43:40+00:00 https://brightonseo.com/sitemap.xml 0.200686 2024-12-25 16:27:55.358044+00:00
4 https://brightonseo.com/ssas 2024-12-10 14:46:46+00:00 https://brightonseo.com/sitemap.xml 0.200686 2024-12-25 16:27:55.358044+00:00

Visualize the website’s URL structure

url_structure = adviz.url_structure(
    sitemap["loc"],
    title=f"{domain.replace('_', '.')} URL Structure",
    domain=domain.replace("_", "."),
    theme='flatly')
url_structure

Save the chart to an HTML file

url_structure.write_html(f"{domain}_url_structure.html")

Save the chart to an HTML file

pub_trend.write_html(f"{domain}_pub_trends.html")

Save the DataFrame to a CSV file

sitemap.to_csv(f"{domain}_sitemap.csv", index=False)

Putting all together

The full steps are combined in one self-contained script ready to go.

adv_sitemaps.py
# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "adviz",
# ]
# ///

import sys
from urllib.parse import urlsplit

import advertools as adv
import adviz


if __name__ == "__main__":
    robots_url = sys.argv[1]
    domain = urlsplit(robots_url).netloc.replace(".", "_")
    sitemap = adv.sitemap_to_df(robots_url)
    url_structure = adviz.url_structure(sitemap["loc"], title=f"{domain.replace('_', '.')} URL Structure", height=None)
    url_structure.write_html(f"{domain}_url_structure.html")
    pub_trend = adviz.ecdf(sitemap, x="lastmod", hover_name="loc", title=f"{domain.replace('_', '.')} Publishing Trends")
    pub_trend.write_html(f"{domain}_pub_trends.html")
    sitemap.to_csv(f"{domain}_sitemap.csv", index=False)

Having saved the above file, you can now simply run it using uv as follows:

uv run adv_sitemaps.py https://example.com/robots.txt

Just replace with the robots.txt URL of interest.