Python broken link checker

crawling

scraping

auditing

advertools

A Python script that takes an advertools crawl file, maps the links on all pages, finds broken internal links and locates them. Also runs the same for external links.

It is assumed that you have alread crawled a website and have the crawl file ready. You can run a simple crawl of a website as follows:

Crawling a website

import advertools as adv
import pandas as pd

adv.crawl(
    url_list="https://example.com",
    output_file="output_file.jsonl",
    follow_links=True)

crawl_df = pd.read_json("output_file.jsonl", lines=True)

Learn more about crawling a website with Python and advertools.

How to map links on a website

Now that we have read the crawl into a DataFrame, we will use the advertools.crawlytics.links function to create a mapping of all links on the site. We want to see each crawled URL, and links on it, together with their respective anchor text, as well as whether or not they are nofollow links.

links_df = adv.crawlytics.links(
    crawl_df, internal_url_regex=r'https://.*?supermetrics.com')
links_df.head()

url	link	text	nofollow	internal
https://supermetrics.com/	https://supermetrics.com/	.css-ozeoft{width:205px;}	False	True
https://supermetrics.com/	https://supermetrics.com/marketing-intelligenc...	.css-ayr2p0{display:-webkit-box;display:-webki...	False	True
https://supermetrics.com/	https://supermetrics.com/marketing-intelligenc...	Connect	False	True
https://supermetrics.com/	https://supermetrics.com/marketing-intelligenc...	Transform	False	True
https://supermetrics.com/	https://supermetrics.com/marketing-intelligenc...	Store	False	True

Link mapping columns:

url: The URL that contains links (from)
link: The links that are found on the url (to)
text: The anchor text of the link
nofollow: Whether or not it’s a nofollow link
internal: Whether or not it is an internal link. This is determined by the internal_url_regex parameter, and it is up to you how to define it. You might include/exclude sub-domains, social media links to your properties, or even separate domains that you know belong to the same company/brand.

How to find error URLs

error_urls = crawl_df[crawl_df['status'].ne(200)]['url'].tolist()
error_urls

['https://supermetrics.com/blog/google-ads-key-lessons',
 'https://supermetrics.com/blog/marketing-data-pipeline',
 'https://supermetrics.com/case-studies/vuelo6',
 'https://supermetrics.com/docs/integration-google-analytics-fields/',
 'https://supermetrics.com/webinars/vanmoof-data-warehouse',
 'https://support.supermetrics.com/support/solutions/articles/19000092290-can-i-schedule-queries-to-refresh-automatically-excel-',
 'https://supermetrics.com/blog/bulk-url-checker-template',
 'https://supermetrics.com/blog/insightful-ppc-report']

Errors here are defined as anything with a status code other than 200. It is up to you to define this by filtering the status codes that make sense to you. Examples:

Finding status codes between two numbers:

crawl_df[crawl_df['status'].between(400, 499)]

The between method optionally takes an inclusive parameter, to which you can supply any of [“both”, “neither”, “left”, “right”]

Finding status codes that belong to a defined set of codes

crawl_df[crawl_df['status'].isin([404, 403, 429])]

Finding status codes that are any of <, >, <=, >=, ==, != which are also available as method names. I prefer using the method names because this doesn’t have any spaces, and enables method chaining.

crawl_df[crawl_df['status'].lt(400)]  # < "less than"
crawl_df[crawl_df['status'].gt(400)]  # > "greater than"
crawl_df[crawl_df['status'].le(400)]  # <=
crawl_df[crawl_df['status'].ge(400)]  # >=
crawl_df[crawl_df['status'].eq(400)]  # ==
crawl_df[crawl_df['status'].ne(400)]  # !=

How to locate error URLs

Finding which URLs had the status code(s) of choice is quite easy, but we also want to know which pages link to these URLs. You might want to change those links and link somewhere else in case the broken URLs have expired for example.

links_df[links_df['link'].fillna('').isin(error_urls)]

	url	link	text	nofollow	internal
540	https://supermetrics.com/blog/marketing-report...	https://supermetrics.com/blog/google-ads-key-l...	I discussed Google Trends in an earlier post a...	False	True
1613	https://supermetrics.com/blog/google-analytics...	https://supermetrics.com/docs/integration-goog...	463 metrics and 618 dimensions	False	True
1650	https://supermetrics.com/blog/marketing-data-w...	https://supermetrics.com/webinars/vanmoof-data...	To hear more about VanMoof’s data warehousing ...	False	True
1815	https://supermetrics.com/blog/excel-marketing-...	https://support.supermetrics.com/support/solut...	schedule refresh and emailing	False	True
1829	https://supermetrics.com/blog/ad-performance-r...	https://supermetrics.com/blog/bulk-url-checker...	HTTP header codes for multiple paid media acco...	False	True

How to find external links

We now want to crawl the external links, get their status codes and see which ones have issues. We can use use the regular crawl function, or we can use the crawl_headers function. The second one only makes HEAD requests, and retreives status codes, redirects, and all available response headers. This is much faster and lighter on servers, and in this case we just want to make sure that the URL exists, and has no issues.

links_df_external = links_df[~links_df['internal']]
links_df_external

	url	link	text	nofollow	internal
0	https://supermetrics.com/	https://www.facebook.com/Supermetrics	.css-2ekw1k{display:-webkit-box;display:-webki...	False	False
0	https://supermetrics.com/	https://www.linkedin.com/company/supermetrics		False	False
0	https://supermetrics.com/	https://www.instagram.com/supermetrics		False	False
0	https://supermetrics.com/	https://www.youtube.com/@supermetrics		False	False
0	https://supermetrics.com/	https://twitter.com/Supermetrics		False	False
...	...	...	...	...	...
2621	https://supermetrics.com/connectors/iqm	https://www.facebook.com/Supermetrics	.css-2ekw1k{display:-webkit-box;display:-webki...	False	False
2621	https://supermetrics.com/connectors/iqm	https://www.linkedin.com/company/supermetrics		False	False
2621	https://supermetrics.com/connectors/iqm	https://www.instagram.com/supermetrics		False	False
2621	https://supermetrics.com/connectors/iqm	https://www.youtube.com/@supermetrics		False	False
2621	https://supermetrics.com/connectors/iqm	https://twitter.com/Supermetrics		False	False

13660 rows × 5 columns

How to check status codes of external links

We first get the external links from the links mapping that we created.

external_links_tocrawl = links_df[~links_df['internal']]['link'].drop_duplicates()

adv.crawl_headers(
    external_links_tocrawl,
    output_file='output_file_external.jl',
    custom_settings={
        "CONCURRENT_REQUESTS": 64,  # run really fast
        "CONCURRENT_REQUESTS_PER_DOMAIN": 1  # but go slow on each domain
    })

Read status code crawl file

status_codes_external = pd.read_json('output_file_external.jl', lines=True)

Getting external broken links

error_urls_external = status_codes_external[status_codes_external['status'].ne(200)][['url', 'status']]
error_urls_external

	url	status
5	https://support.google.com/analytics/answer/11...	404.0
10	https://optout.networkadvertising.org/?c=1	403.0
16	https://support.google.com/looker-studio/answe...	404.0
26	https://support.google.com/analytics/answer/92...	404.0
29	https://community.profitabledashboards.com/c/t...	404.0
...	...	...
2246	https://support.google.com/google-ads/answer/2...	404.0
2247	https://support.google.com/google-ads/answer/2...	404.0
2250	https://support.google.com/google-ads/answer/6...	404.0
2251	https://support.google.com/google-ads/answer/6...	404.0
2252	https://support.google.com/google-ads/answer/6...	404.0

371 rows × 2 columns

How to find links to external broken links

Just liked we did with internal broken links, we now want to also see which pages link to those broken links, so we can fix them.

links_df[links_df['link'].isin(error_urls_external['url'].tolist())]

	url	link	text	nofollow	internal
10	https://supermetrics.com/cookie-policy	https://optout.networkadvertising.org/?c=1	https://optout.networkadvertising.org/?c=1	False	False
33	https://supermetrics.com/blog/google-analytics...	https://community.profitabledashboards.com/c/t...	highest impact on Looker Studio reports were c...	False	False
41	https://supermetrics.com/blog/ga4	https://support.google.com/analytics/answer/92...	GA4 help site	False	False
50	https://supermetrics.com/blog/data-into-looker...	https://support.google.com/analytics/answer/76...	data retention for GA4 is only 14 months	False	False
56	https://supermetrics.com/blog/looker-studio-tu...	https://support.google.com/looker-studio/topic...	default charts available.	False	False
...	...	...	...	...	...
1995	https://supermetrics.com/case-studies/first-tu...	https://www.payscale.com/research/US/Job=Data_...	average salary of a data engineer	False	False
2142	https://community.supermetrics.com/trends-and-...	https://support.google.com/analytics/answer/11...	https://support.google.com/analytics/answer/11...	True	False
2144	https://community.supermetrics.com/trends-and-...	https://support.google.com/analytics/answer/11...	https://support.google.com/analytics/answer/11...	True	False
2537	https://supermetrics.com/careers/sales	https://www.g2.com/best-software-companies/emea	.css-2fb8tg{display:-webkit-box;display:-webki...	False	False
2543	https://supermetrics.com/supermetrics-charts	https://support.google.com/looker-studio/answe...	community visualizations	False	False

409 rows × 5 columns