How to Download, Parse, and Visualize XML Sitemaps

data visualization

analytics

SEO

XML sitemaps

URLs

plotly

A Python script using advertools to download, parse, and save the sitemap to a CSV file. We will also visualize the URL structure and publishing trends of the website

In this guide we will build a script where you provide the URL of a robots.txt file, and it will extract the sitemap(s) in it, convert and save it to a CSV file, and produce two visualizations, both of which will be saved to HTML files. Feel free to modify the defaults.

Prerequisites

Although you need Python and advertools installed, this will all be handled by uv, so the main prerequisite that you need to take care of is installing uv.

Run the imports

from urllib.parse import urlsplit

import advertools as adv
import adviz
import pandas as pd

Define the variable `robots_url` with any robots.txt file you want

robots_url = "https://brightonseo.com/robots.txt"
domain = urlsplit(robots_url).netloc.replace(".", "_")

Fetch the sitemap(s) from the robots file

sitemap = adv.sitemap_to_df(robots_url)
sitemap.head()

2024-12-25 16:27:55,343 | INFO | sitemaps.py:623 | sitemap_to_df | Getting https://brightonseo.com/sitemap.xml

	loc	lastmod	sitemap	sitemap_size_mb	download_date
0	https://brightonseo.com	2024-12-16 11:34:32+00:00	https://brightonseo.com/sitemap.xml	0.200686	2024-12-25 16:27:55.358044+00:00
1	https://brightonseo.com/ssas-october-2024	2024-12-10 14:47:25+00:00	https://brightonseo.com/sitemap.xml	0.200686	2024-12-25 16:27:55.358044+00:00
2	https://brightonseo.com/measurefest-october-2025	2024-12-10 15:10:42+00:00	https://brightonseo.com/sitemap.xml	0.200686	2024-12-25 16:27:55.358044+00:00
3	https://brightonseo.com/training	2024-07-08 15:43:40+00:00	https://brightonseo.com/sitemap.xml	0.200686	2024-12-25 16:27:55.358044+00:00
4	https://brightonseo.com/ssas	2024-12-10 14:46:46+00:00	https://brightonseo.com/sitemap.xml	0.200686	2024-12-25 16:27:55.358044+00:00

Visualize the website’s URL structure

url_structure = adviz.url_structure(
    sitemap["loc"],
    title=f"{domain.replace('_', '.')} URL Structure",
    domain=domain.replace("_", "."),
    theme='flatly')
url_structure

Save the chart to an HTML file

url_structure.write_html(f"{domain}_url_structure.html")

Visualize publishing trends

pub_trend = adviz.ecdf(
    sitemap,
    x="lastmod",
    hover_name="loc",
    title=f"{domain.replace('_', '.')} Publishing Trends",
    template="flatly")
pub_trend

Save the chart to an HTML file

pub_trend.write_html(f"{domain}_pub_trends.html")

Save the DataFrame to a CSV file

sitemap.to_csv(f"{domain}_sitemap.csv", index=False)

Putting all together

The full steps are combined in one self-contained script ready to go.

adv_sitemaps.py

# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "adviz",
# ]
# ///

import sys
from urllib.parse import urlsplit

import advertools as adv
import adviz


if __name__ == "__main__":
    robots_url = sys.argv[1]
    domain = urlsplit(robots_url).netloc.replace(".", "_")
    sitemap = adv.sitemap_to_df(robots_url)
    url_structure = adviz.url_structure(sitemap["loc"], title=f"{domain.replace('_', '.')} URL Structure", height=None, domain=domain.replace("_", "."))
    url_structure.write_html(f"{domain}_url_structure.html")
    pub_trend = adviz.ecdf(sitemap, x="lastmod", hover_name="loc", title=f"{domain.replace('_', '.')} Publishing Trends")
    pub_trend.write_html(f"{domain}_pub_trends.html")
    sitemap.to_csv(f"{domain}_sitemap.csv", index=False)

Having saved the above file, you can now simply run it using uv as follows:

uv run adv_sitemaps.py https://example.com/robots.txt

Just replace with the robots.txt URL of interest.

Prerequisites

Run the imports

Define the variable robots_url with any robots.txt file you want

Fetch the sitemap(s) from the robots file

Visualize the website’s URL structure

Save the chart to an HTML file

Visualize publishing trends

Save the chart to an HTML file

Save the DataFrame to a CSV file

Putting all together

Define the variable `robots_url` with any robots.txt file you want