Create a broken link checker with Python

crawling
scraping
auditing
advertools
A Python script that takes a an advertools crawl file, maps the links on all pages, finds broken internal links and locates them. Also runs the same for external links.

It is assumed that you have alread crawled a website and have the crawl file ready. You can run a simple crawl of a website as follows:

Crawling a website

import advertools as adv
import pandas as pd

adv.crawl(
    url_list="https://example.com",
    output_file="output_file.jsonl",
    follow_links=True)

crawl_df = pd.read_json("output_file.jsonl", lines=True)

Getting error URLs

error_urls = crawl_df[crawl_df['status'].ne(200)]['url'].tolist()
error_urls
['https://supermetrics.com/blog/google-ads-key-lessons',
 'https://supermetrics.com/blog/marketing-data-pipeline',
 'https://supermetrics.com/case-studies/vuelo6',
 'https://supermetrics.com/docs/integration-google-analytics-fields/',
 'https://supermetrics.com/webinars/vanmoof-data-warehouse',
 'https://support.supermetrics.com/support/solutions/articles/19000092290-can-i-schedule-queries-to-refresh-automatically-excel-',
 'https://supermetrics.com/blog/bulk-url-checker-template',
 'https://supermetrics.com/blog/insightful-ppc-report']

Errors here are defined as anything with a status code other than 200. It is up to you to define this by filtering the status codes that make sense to you. Examples:

  • Finding status codes between two numbers:
crawl_df[crawl_df['status'].between(400, 499)]

The between method optionally takes an inclusive parameter, to which you can supply any of [“both”, “neither”, “left”, “right”]

  • Finding status codes that belong to a defined set of codes
crawl_df[crawl_df['status'].isin([404, 403, 429])]
  • Finding status codes that are any of <, >, <=, >=, ==, != which are also available as method names. I prefer using the method names because this doesn’t have any spaces, and enables method chaining.
crawl_df[crawl_df['status'].lt(400)]  # < "less than"
crawl_df[crawl_df['status'].gt(400)]  # > "greater than"
crawl_df[crawl_df['status'].le(400)]  # <=
crawl_df[crawl_df['status'].ge(400)]  # >=
crawl_df[crawl_df['status'].eq(400)]  # ==
crawl_df[crawl_df['status'].ne(400)]  # !=

Locating error URLs

Finding which URLs had the status code(s) of choice is quite easy, but we also want to know which pages link to these URLs. You might want to change those links and link somewhere else in case the broken URLs have expired for example.

links_df[links_df['link'].fillna('').isin(error_urls)]
url link text nofollow internal
540 https://supermetrics.com/blog/marketing-report... https://supermetrics.com/blog/google-ads-key-l... I discussed Google Trends in an earlier post a... False True
1613 https://supermetrics.com/blog/google-analytics... https://supermetrics.com/docs/integration-goog... 463 metrics and 618 dimensions False True
1650 https://supermetrics.com/blog/marketing-data-w... https://supermetrics.com/webinars/vanmoof-data... To hear more about VanMoof’s data warehousing ... False True
1815 https://supermetrics.com/blog/excel-marketing-... https://support.supermetrics.com/support/solut... schedule refresh and emailing False True
1829 https://supermetrics.com/blog/ad-performance-r... https://supermetrics.com/blog/bulk-url-checker... HTTP header codes for multiple paid media acco... False True

Read status code crawl file

status_codes_external = pd.read_json('output_file_external.jl', lines=True)