A Python script that takes a an advertools crawl file, maps the links on all pages, finds broken internal links and locates them. Also runs the same for external links.
It is assumed that you have alread crawled a website and have the crawl file ready. You can run a simple crawl of a website as follows:
Crawling a website
import advertools as advimport pandas as pdadv.crawl( url_list="https://example.com", output_file="output_file.jsonl", follow_links=True)crawl_df = pd.read_json("output_file.jsonl", lines=True)
Mapping links on the website
Now that we have read the crawl into a DataFrame, we will use the advertools.crawlytics.links function to create a mapping of all links on the site.
internal: Whether or not it is an internal link. This is determined by the internal_url_regex parameter, and it is up to you how to define it. You might include/exclude sub-domains, social media links to your properties, or even separate domains that you know belong to the same company/brand.
Errors here are defined as anything with a status code other than 200. It is up to you to define this by filtering the status codes that make sense to you. Examples:
Finding status codes between two numbers:
crawl_df[crawl_df['status'].between(400, 499)]
The between method optionally takes an inclusive parameter, to which you can supply any of [“both”, “neither”, “left”, “right”]
Finding status codes that belong to a defined set of codes
Finding status codes that are any of <, >, <=, >=, ==, != which are also available as method names. I prefer using the method names because this doesn’t have any spaces, and enables method chaining.
Finding which URLs had the status code(s) of choice is quite easy, but we also want to know which pages link to these URLs. You might want to change those links and link somewhere else in case the broken URLs have expired for example.
We now want to crawl the external links, get their status codes and see which ones have issues. We can use use the regular crawl function, or we can use the crawl_headers function. The second one only makes HEAD requests, and retreives status codes, redirects, and all available response headers. This is much faster and lighter on servers, and in this case we just want to make sure that the URL exists, and has no issues.
adv.crawl_headers( external_links_tocrawl, output_file='output_file_external.jl', custom_settings={"CONCURRENT_REQUESTS": 64, # run really fast"CONCURRENT_REQUESTS_PER_DOMAIN": 1# but go slow on each domain })