Getting started with crawling with Python and how to use the advertools library for crawling and scraping websites.
This is a practical tutorial that starts with the simplest possible crawling process, and builds up to a more complex crawl. In the process, we will go through various options and see how to utilize them, and what combinations of options make sense. After completing this tutorial you can explore more options for crawl analysis. It is assumed that you have already set up Python, and installed advertools. You might also be interested in the difference between crawling and scraping as well.
The crawl function
In it’s simplest form, this is a very straightforward function that only takes a URL (or a list of URLs), and an output file as parameters:
import advertools as advadv.crawl(url_list="https://example.com", output_file="output.jsonl")
When you run this, you’ll notice a lot of log messages being printed in the console, or the output area of a Jupyter notebook, until the crawling stops. These provide a lot of details on the crawling process, and can uncover some errors or issues. We’ll discuss what you can do with them later, but just take it as a sign that things are running correctly.
Note that the final few lines provide some useful summary statistics about the crawl, when it started, how many requests it made, why it ended, and so on.
Adding more URLs is straightforward, and is simply done by supplying a list of URLs to the url_list parameter
The crawl data is saved as a jsonlines file (with the extension .jsonl or .jl), each row completely and independently representing a crawled URL. We can use pandas to read this file using the read_json function, and specifying that is a jsonlines file:
import pandas as pdcrawl_df = pd.read_json("output.jsonl", lines=True)crawl_df
url
title
viewport
charset
h1
body_text
size
download_timeout
download_slot
download_latency
...
resp_headers_Etag
resp_headers_Expires
resp_headers_Last-Modified
resp_headers_Server
resp_headers_Vary
resp_headers_X-Cache
request_headers_Accept
request_headers_Accept-Language
request_headers_User-Agent
request_headers_Accept-Encoding
0
https://example.com
Example Domain
width=device-width, initial-scale=1
utf-8
Example Domain
\n Example Domain \n This domain is fo...
1256
180
example.com
0.110213
...
"3147526947+gzip"
Tue, 07 Jan 2025 20:37:18 GMT
Thu, 17 Oct 2019 07:18:26 GMT
ECAcc (bsb/27D7)
Accept-Encoding
HIT
text/html,application/xhtml+xml,application/xm...
en
advertools/0.16.3
gzip, deflate
1 rows × 33 columns
Now that we have a pandas DataFrame, we have its full power at our disposal to run any analysis we want. This is also applies of course to all other types of libraries for data visualization, machine learning, text analysis, etc.
How to crawl in list mode
What we just did was essentially list mode, and we don’t need to do anything special, other than supplying a list of URLs to the url_list parameter.
How to crawl in spider mode
Spider mode is when the crawler discovers links on its own, keeps crawling the whole website, and stops when it’s done (or when a condition that you set is met).
Of course this has defaults, and we will discuss how to modify and manage them later.
To activate spider mode, all you have to do is set follow_links=True in the crawl function:
While following links, the crawling will by default follow links that belong to the same domain, and its sub-domains. If the current domain is “example.com” then a link to “blog.example.com” will be followed, but a link to “facebook.com” will not.
You can use the allowed_domains parameter to control which domains the crawler is allowed to follow. Maybe you want to only crawl a certin sub-domain and not others, or just three of them? You can easily set this restriction with this parameter.
If you want to crawl this website’s blog only, you can do it explicitly like this:
Another more flexible way to do so is through the parameters that give you fine-grained control over the process.
How to control which links get followed
The crawl function takes four parameters that should be self-explanatory in what they do
include_url_regex: Only follow a link if it matches the given regex.
exclude_url_regex: Don’t follow a link if it matches the given regex.
include_url_params: Follow a link if it has any of the given parameters (as a list).
exclude_url_params: Don’t follow link that contain any of the given parameters (also as a list).
Here’s an example of how these might interact together, and I encourage you to try any combination you want, and explore how it works, and how it might have differently from what you thought:
In this case, it is guaranteed that you won’t get any URLs with “/blog/” in them (unless they were redirected), and you would get as many “/sunglasses/” pages as possible, depending on the linking structure of the site.
Assuming the start URL had links to one such page then your crawl would go on, otherwise it would stop. There is also a chance that there would be abunch of URLs with this pattern that are not linked to, and you would miss out on them. A good way to improve your chances in such an exercise is to download the site’s XML sitemap, get the URLs that match the pattern you want, and the crawl them with follow_links=True.
How to control which URL query parameters to include/exclude while crawling
The _url_params parameters take a list and not a pattern.
Again, we have the same issue, with the exclude_ parameter guaranteed not follow any unwanted links, and with the include_ parameters hopefully following all the links you intend.
How to exclude all URL query parameter while crawling
You have a simple option of excluding any link that has any URL parameter, which is by setting exclude_url_params=True. This is important because it’s difficult to know beforehand which parameters exist, and this just makes things simpler.
In both cases, the exclude_ parameters are more likely to behave as expected and reason about. The reason is that you are allowing the crawler to crawl all pages, excepts the ones that you have excluded. The challenge with the include_ parameters is that you might not find any page that satisfies your conditions at the current state, and the crawler would stop, even though there are many of those pages on the website.
Assume you want to crawl the /books/ pages of a certain website, and you start from the home page. If there are no links that match this pattern, the crawler will stop and you might think that there are no such pages. This could be simply because there were no links to those pages from your start URLs. So please be careful when using include_.
After having decided how to “drive” the crawler, and which pages to tell it follow and scrape, you probably want to extract custom content that is not extracted by default, because it is not available in standard HTML elements like h1, h2, and so on.
How to do custom extraction with XPath and/or CSS selectors
Custom extraction is the process of extracting certain data from a page that are not easily identified by standard HTML elements like h1, h2, div, p, etc.
Essentially, what you want to achieve with custom extraction is to specify something specific like, “the divs that have the class attribute equal to ‘price’”, or “the p elements that are right after h2 elements with class ‘author’”, and so on.
The process of setting what you want to extract can be achieved using a dictionary for either of the two types of custom extraction
Once the crawl is finished you should end up with columns called column_1, column_1, column_3, or whatever you named them.
Under each column the data extracted from the respective page will be found, provided it was found on the page. Otherwise you would get a missing <NA> value.
How to specify CSS selectors while crawling
The process is exactly the same for CSS selectors, only the parameter is different, and of course, the selector patterns are different:
It can be very helpful to understand how advertools creates and updates the crawl file, as it discovers and saves pages to disk. It’s beyond the scope of this tutorial, but I encourage you to learn more about how the crawl file is structured.
How to extract the main content of a page
The advertools crawler attempts to extract the main content of a page (article text, product description, blog post, etc.) by using a special XPath expression. The main thing it does is it follows three rules:
Every tag in the <body> of a page
Any tag that is any of the specified tags (a, b, p, h1, …)
Any tag whos ancestor is not one of the specified tags (button, script, img, …)
And from these it extracts the text attribute. Then it joins all these in one string. After crawling, you should find this under a column called body_text.
Custom settings
The crawling behavior can be massively modified using dozens of custom settings, which also enables the use of many third party extensions for scrapy. This is also specified using a dictionary. Typically scrapy settings are setting using keys in all-caps:
For example, you can set CLOSESPIDER_PAGECOUNT to end crawling after reaching the desired number of pages. You can also set a file for saving the logs using the LOG_FILE setting, and so on:
This would stop crawling after reaching a thousand pages, and will save all log messages to the file output.log.
How to analyze a crawled website
Once the crawl has ended, you want to obviously analyze it, and you have complete control over how/what to analyze by using any of the data packages you want. However, advertools provides a few handy functions under the crawlytics module that can help you further:
compare: Compare a specific page element between two crawl files.
images: Get a mapping of all crawled images, together with all atributes that they use.
jl_subset: Read only a subset of the jsonlines file by setting a list of columns and/or a regular expression for the patterns of columns that you want. This is handy for very large crawl files.
jl_to_parquet: Convert the jsonlines file to parquet format, massively reducing its size, and making it very easy to read/analyze the columns that you want.
links: Get a mapping of all the links on a website.
parquet_columns: Get a DataFrame showing you what columns are available, and what their data types are.
redirects: Get a mapping of all redirects on the crawled website.
running_crawls: Get a simple summary table showing you which crawl processes are running, and some basic stats about them (only Linux and MacOS available for the moment).
The way to run each of these function is simple. Assuming you have a crawl DataFrame named crawl_df: