How to Download, Parse, and Visualize XML Sitemaps
data visualization
analytics
SEO
XML sitemaps
URLs
plotly
A Python script using advertools to download, parse, and save the sitemap to a CSV file. We will also visualize the URL structure and publishing trends of the website
In this guide we will build a script where you provide the URL of a robots.txt file, and it will extract the sitemap(s) in it, convert and save it to a CSV file, and produce two visualizations, both of which will be saved to HTML files. Feel free to modify the defaults.
Prerequisites
Although you need Python and advertools installed, this will all be handled by uv
, so the main prerequisite that you need to take care of is installing uv
.
Run the imports
Define the variable robots_url
with any robots.txt file you want
Fetch the sitemap(s) from the robots file
2024-12-25 16:27:55,343 | INFO | sitemaps.py:623 | sitemap_to_df | Getting https://brightonseo.com/sitemap.xml
loc | lastmod | sitemap | sitemap_size_mb | download_date | |
---|---|---|---|---|---|
0 | https://brightonseo.com | 2024-12-16 11:34:32+00:00 | https://brightonseo.com/sitemap.xml | 0.200686 | 2024-12-25 16:27:55.358044+00:00 |
1 | https://brightonseo.com/ssas-october-2024 | 2024-12-10 14:47:25+00:00 | https://brightonseo.com/sitemap.xml | 0.200686 | 2024-12-25 16:27:55.358044+00:00 |
2 | https://brightonseo.com/measurefest-october-2025 | 2024-12-10 15:10:42+00:00 | https://brightonseo.com/sitemap.xml | 0.200686 | 2024-12-25 16:27:55.358044+00:00 |
3 | https://brightonseo.com/training | 2024-07-08 15:43:40+00:00 | https://brightonseo.com/sitemap.xml | 0.200686 | 2024-12-25 16:27:55.358044+00:00 |
4 | https://brightonseo.com/ssas | 2024-12-10 14:46:46+00:00 | https://brightonseo.com/sitemap.xml | 0.200686 | 2024-12-25 16:27:55.358044+00:00 |
Visualize the website’s URL structure
Save the chart to an HTML file
Visualize publishing trends
Save the chart to an HTML file
Save the DataFrame to a CSV file
Putting all together
The full steps are combined in one self-contained script ready to go.
adv_sitemaps.py
# /// script
# requires-python = ">=3.12"
# dependencies = [
# "adviz",
# ]
# ///
import sys
from urllib.parse import urlsplit
import advertools as adv
import adviz
if __name__ == "__main__":
robots_url = sys.argv[1]
domain = urlsplit(robots_url).netloc.replace(".", "_")
sitemap = adv.sitemap_to_df(robots_url)
url_structure = adviz.url_structure(sitemap["loc"], title=f"{domain.replace('_', '.')} URL Structure", height=None)
url_structure.write_html(f"{domain}_url_structure.html")
pub_trend = adviz.ecdf(sitemap, x="lastmod", hover_name="loc", title=f"{domain.replace('_', '.')} Publishing Trends")
pub_trend.write_html(f"{domain}_pub_trends.html")
sitemap.to_csv(f"{domain}_sitemap.csv", index=False)
Having saved the above file, you can now simply run it using uv
as follows:
Just replace with the robots.txt URL of interest.