Python for SEO

The ultimate guide to getting started with Python for SEO and get to know advertools, how to use it, the best tools, why this should not be your focus, what’s the difference (and overlaps) with Data Science, and anything else you’d like to see in this guide.

When I started learning Data Science I quickly realized several things:

So I decided to build one.

By now advertools has been downloaded millions of times, featured in many articles on several leading websites, and has become one of the leading Python tools that many SEOs use.

Data Science

One of the biggest developments for the SEO and digital marketing professionals in general, is the shift in how to do work with data that was enabled in the last two decades.

With all the developments in hardware and software, as well as all the data libraries that were developed, it is now possible for a non-software engineer to use code and perform daily tasks, most importantly data tasks.

Yes, but AI!?

We can also consider this as part of the whole data ecosystem, because AI is based on a set of machine learning and deep learning technologies. Also, your data skills are even more crucial now for several reasons:

  • Running prompts in bulk is what is going to differentiate you from people who run prompts one by one.
  • Incorporating AI in your data workflows will take them to the next level (versus running them manually).

There is a big difference between basic programming (which will be our focus) and software development/engineering, which is not accessible to hobbyists.

Why is Python one of the best languages for SEO (data work)?

  • It is the most widely used language for all data tasks ML/DL/AI etc.
  • It has a clean and consistent syntax and logic that is also accessible for beginners.
  • It is the language used for data analysis by the major LLMs like ChatGPT and Google’s Gemini. When you use their analyst solutions, your prompts are translated to Python code and either executed, or returned to you for you to run on your machine.
  • As a general-purpose programming language you can use it easily to extend your data pipelines for many other projects like building websites, generative AI, and whatever else you want to build.

So, if you can’t wait to get strated…

How to get started with Python from scratch, and crawl a website in eight minutes

In the short video below, you’ll see how easy it is to start using the advertools crawler and crawl with one line of code. Then you can read the resulting file as a table (DataFrame) and start your audit/analysis.

import advertools as adv
import pandas as pd

adv.crawl("https://example.com", "output.jsonl")
crawldf = pd.read_json("output.jsonl", lines=True)
crawldf
url title viewport charset h1 body_text size download_timeout download_slot download_latency depth status links_url links_text links_nofollow ip_address crawl_time resp_headers_Content-Length resp_headers_Age resp_headers_Cache-Control resp_headers_Content-Type resp_headers_Date resp_headers_Etag resp_headers_Expires resp_headers_Last-Modified resp_headers_Server resp_headers_Vary resp_headers_X-Cache request_headers_Accept request_headers_Accept-Language request_headers_User-Agent request_headers_Accept-Encoding
0 https://example.com Example Domain width=device-width, initial-scale=1 utf-8 Example Domain \n Example Domain \n This domain is fo... 1256 180 example.com 0.103382 0 200 https://www.iana.org/domains/example More information... False 93.184.215.14 2024-12-20 18:58:48 648 86402 max-age=604800 text/html; charset=UTF-8 Fri, 20 Dec 2024 18:58:48 GMT "3147526947+gzip" Fri, 27 Dec 2024 18:58:48 GMT Thu, 17 Oct 2019 07:18:26 GMT ECAcc (bsb/27D8) Accept-Encoding HIT text/html,application/xhtml+xml,application/xm... en advertools/0.16.3 gzip, deflate

As you saw in the video, most of your work with Python will be running a sequence of functions that each perform a certain task. If you can lean a new function in Excel, with a bit of Python, you should be able to easily learn and use a new Python function as well.

How to start moving from Excel to Python

Let’s say you want to learn how to generate an array of numbers in Excel. You start typing =randarray( and you’ll immediately see the parameters of the function (rows, columns, min, max, integer). You can enter those values separated by commas, and if anything is unclear, you can click, and read the documentation:

If you are comfortable doing that in Excel, then you should find it easy to explore a new function in Python. Just put a question mark next to the function, object, or class and you can see its documentation right in your notebook.

import advertools as adv
adv.crawl?
Signature:
adv.crawl(
    url_list,
    output_file,
    follow_links=False,
    allowed_domains=None,
    exclude_url_params=None,
    include_url_params=None,
    exclude_url_regex=None,
    include_url_regex=None,
    css_selectors=None,
    xpath_selectors=None,
    custom_settings=None,
    meta=None,
)
Docstring:
Crawl a website or a list of URLs based on the supplied options.

Parameters
----------
url_list : url, list
  One or more URLs to crawl. If ``follow_links`` is True, the crawler will start
  with these URLs and follow all links on pages recursively.
output_file : str
  The path to the output of the crawl. Jsonlines only is supported to allow for
  dynamic values. Make sure your file ends with ".jl", e.g. `output_file.jl`.
follow_links : bool
  Defaults to False. Whether or not to follow links on crawled pages.
allowed_domains : list
  A list of the allowed domains to crawl. This ensures that the crawler does not
  attempt to crawl the whole web. If not specified, it defaults to the domains of
  the URLs provided in ``url_list`` and all their sub-domains. You can also specify
  a list of sub-domains, if you want to only crawl those.
exclude_url_params : list, bool
  A list of URL parameters to exclude while following links. If a link contains any
  of those parameters, don't follow it. Setting it to ``True`` will exclude links
  containing any parameter.
include_url_params : list
  A list of URL parameters to include while following links. If a link contains any
  of those parameters, follow it. Having the same parmeters to include and exclude
  raises an error.
exclude_url_regex : str
  A regular expression of a URL pattern to exclude while following links. If a link
  matches the regex don't follow it.
include_url_regex : str
  A regular expression of a URL pattern to include while following links. If a link
  matches the regex follow it.
css_selectors : dict
  A dictionary mapping names to CSS selectors. The names will become column headers,
  and the selectors will be used to extract the required data/content.
xpath_selectors : dict
  A dictionary mapping names to XPath selectors. The names will become column
  headers, and the selectors will be used to extract the required data/content.
custom_settings : dict
  A dictionary of optional custom settings that you might want to add to the
  spider's functionality. There are over 170 settings for all kinds of options. For
  details please refer to the `spider settings <https://docs.scrapy.org/en/latest/topics/settings.html>`_
  documentation.
meta : dict
  Additional data to pass to the crawler; add arbitrary metadata, set custom request
  headers per URL, and/or enable some third party plugins.
Examples
--------
Crawl a website and let the crawler discover as many pages as available

>>> import advertools as adv
>>> adv.crawl("http://example.com", "output_file.jl", follow_links=True)
>>> import pandas as pd
>>> crawl_df = pd.read_json("output_file.jl", lines=True)

Crawl a known set of pages (on a single or multiple sites) without
following links (just crawl the specified pages) or "list mode":

>>> adv.crawl(
...     [
...         "http://exmaple.com/product",
...         "http://exmaple.com/product2",
...         "https://anotherexample.com",
...         "https://anotherexmaple.com/hello",
...     ],
...     "output_file.jl",
...     follow_links=False,
... )

Crawl a website, and in addition to standard SEO elements, also get the
required CSS selectors.
Here we will get three additional columns `price`, `author`, and
`author_url`. Note that you need to specify if you want the text attribute
or the `href` attribute if you are working with links (and all other
selectors).

>>> adv.crawl(
...     "http://example.com",
...     "output_file.jl",
...     css_selectors={
...         "price": ".a-color-price::text",
...         "author": ".contributorNameID::text",
...         "author_url": ".contributorNameID::attr(href)",
...     },
... )

Using the ``meta`` parameter:

**Adding custom meta data** for the crawler using the `meta` parameter for
tracking/context purposes. If you supply {"purpose": "pre-launch test"}, then you
will get a column called "purpose", and all its values will be "pre-launch test" in
the crawl DataFrame.

>>> adv.crawl(
...     "https://example.com",
...     "output_file.jl",
...     meta={"purpose": "pre-launch test"},
... )

Or maybe mention which device(s) you crawled with, which is much easier than reading
the user-agent string:

>>> adv.crawl(
...     "https://example.com",
...     "output.jsonl",
...     custom_settings={
...         "USER_AGENT": "Mozilla/5.0 (iPhone; CPUiPhone OS 14_7_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Mobile/15E148 Safari/604.1"
...     },
...     meta={"device": "Apple iPhone 12 Pro (Safari)"},
... )

Of course you can combine any such meta data however way you want:

>>> {"device": "iphone", "purpose": "initial audit", "crawl_country": "us", ...}

**Custom request headers**: Supply custom request headers per URL with the special
key ``custom_headers``. It's value is a dictionary where its keys are URLs, and
every URL's values is a dictionary, each with its own custom request headers.

>>> adv.crawl(
...     URL_LIST,
...     OUTPUT_FILE,
...     meta={
...         "custom_headers": {
...             "URL_A": {"HEADER_1": "VALUE_1", "HEADER_2": "VALUE_1"},
...             "URL_B": {"HEADER_1": "VALUE_2", "HEADER_2": "VALUE_2"},
...             "URL_C": {"HEADER_1": "VALUE_3"},
...         }
...     },
... )

OR:

>>> meta = {
...     "custom_headers": {
...         "https://example.com/A": {"If-None-Match": "Etag A"},
...         "https://example.com/B": {
...             "If-None-Match": "Etag B",
...             "User-Agent": "custom UA",
...         },
...         "https://example.com/C": {
...             "If-None-Match": "Etag C",
...             "If-Modified-Since": "Sat, 17 Oct 2024 16:24:00 GMT",
...         },
...     }
... }

**Long lists of requests headers:** In some cases you might have a very long list
and that might raise an `Argument list too long` error. In this case you can provide
the path of a Python script that contains a dictionary for the headers. Keep in
mind:

- The dictionary has to be named ``custom_headers`` with the same structure mentioned above
- The file has to be a Python script, having the extension ".py"
- The script can generate the dictionary programmatically to make it easier to
  incorporate in various workflows
- The path to the file can be absolute or relative to where the command is
  run from.

  >>> meta = {"custom_headers": "my_custom_headers.py"}

  OR

  >>> meta = {"custom_headers": "/full/path/to/my_custom_headers.py"}

**Use with third party plugins** like scrapy playwright. To enable it, set
``{"playwright": True}`` together with other settings.
File:      ~/uvpy313/lib/python3.13/site-packages/advertools/spider.py
Type:      function

Why “python for seo” is misleading, or why not “python for seo”

“When some field is just getting started and you don’t really understand it very well [Data Science], it’s very easy to confuse the essence of what you’re doing with the tools that you use [Python].”

– Harold Abelson, MIT

And this is exactly how things started. Many people started confusing the tool for the discipline and consequently started focusing on “learning Python” instead of the principles and techniques of Data Science. The problem is that you end up going on a path of software development (you said you wanted to learn a programming language, didn’t you?), and lose both worlds. You become a mediocre developer, and your software is too basic to be useful. Learning Data Science instead, you start boosting your current workflows immediately, and benefit immensely from the little programming you learn.

What are the myth conceptions about Python and SEO?

  • Programming vs software development: Many people think that because you are “coding” you are becoming a software developer. Read the simple explanation on the differences that hopefully makes a useful distinction for you.
  • Using Python to do SEO vs using Python to build software for SEO: Related to the previous point, once it is clear that you are doing data work and not building software, it becomes easier to get on a smooth track and get productive real quick.
  • Python is for automating tasks (efficiency): While this is absolutely true, and there is really no need to discuss it, it is missing another important element; insights, analytics, and diagnosis. Automation means doing many things by giving one command, and the main value is the massive time savings you get, as well as the error minimization. But there is the important element of insights that you get which is not about automation. Creating an insightful chart or running some machine learning algorithm can uncover great insights and enable you to make important decisions. These are not about repeating a certain task many times.

Enough theory and let’s talk about the tools that you can use.

What are the best Python tools for SEO?

The list below is for tools that are directly made for SEO. While there are many major tools that could (and should) be used for SEO (like pandas, plotly, scikit-learn, and many others), these are not SEO tools. They are tools for general data processing, visualization, machine learning, etc.

Crawling and scraping

Take a look at the crawling tutorial showing how to use Python’s advertools to crawl websites, and the many options that you can use. These are the crawlers you can use from advertools:

Website crawl audit and analysis template:

Advanced crawling strategies and recipes

  • Crawlytics: a special advertools module for typical analysis tasks to perform on crawl files, and has several powerful functions.

robots.txt

  • Downloading and parsing robots.txt files
  • Bulk robots.txt tester

XML sitemaps

  • Downloading, parsing and analyzing XML sitemaps

Auditing structured data

  • Learn how to parse, audit, and analyze JSON-LD, Twitter, and OpenGraph tags on a crawled website.

URL analysis

  • Splitting URLs to their components and analyzing them

Content and text analysis

  • Ngram analysis (absolute and weighted)

Extracting entities from text

  • Extracting structured elements from a text list (corpus) like hashtags, mentions, links, and so son.

Stopwords

  • Default lists of stopwords in forty languages are provided by advertools
import advertools as adv
adv.stopwords["english"]
adv.stopwords["arabic"]
adv.stopwords["portuguese"]

# etc

To get all available languages:

adv.stopwords.keys()

Log file analysis

A full workflow for analyzing log files

  • Converting and compressing log files, using the efficient parquet format
  • Running reverse DNS lookup in bulk to verify IPs
  • Analyzing URLs (request and referrer URLs)
  • Parsing User-agents in the log file
  • Status codes
  • Various interactive visualizations that enable you to ask any question you want

Reverse DNS lookup

  • Bulk tool for running this task, and provide basic statistics in case you have duplicates in your dataset

User-agent parsing and analysis

  • Recipe for converting a list of user-agent strings to a DataFrame

SERP analysis

  • Querying SERPs in bulk and visualizing them for better insights, and not just or monitoring

Google Search Console

  • A Python template for auditing and analyzing GSC data:

Using Python with other SEO tools

As a “glue language” Python is great for being you central interface to several tasks, and since it is very powerful for analyzing data you can use it to gain more insights from tools that you already use. Once you are a bit more fluent with data tasks, you will start to prefer exporting the data and analyzing yourself, because you probably have specific questions in mind that the tools can’t provide.