Crawling with Python

crawling

scraping

advertools

beginner

Getting started with crawling with Python and how to use the advertools library for crawling and scraping websites.

This is a practical tutorial that starts with the simplest possible crawling process, and builds up to a more complex crawl. In the process, we will go through various options and see how to utilize them, and what combinations of options make sense. After completing this tutorial you can explore more options for crawl analysis. It is assumed that you have already set up Python, and installed advertools. You might also be interested in the difference between crawling and scraping as well.

The `crawl` function

In it’s simplest form, this is a very straightforward function that only takes a URL (or a list of URLs), and an output file as parameters:

import advertools as adv
adv.crawl(url_list="https://example.com", output_file="output.jsonl")

When you run this, you’ll notice a lot of log messages being printed in the console, or the output area of a Jupyter notebook, until the crawling stops. These provide a lot of details on the crawling process, and can uncover some errors or issues. We’ll discuss what you can do with them later, but just take it as a sign that things are running correctly.

Example crawl logs:

2024-12-31 00:51:34 [scrapy.utils.log] INFO: Scrapy 2.12.0 started (bot: scrapybot)
2024-12-31 00:51:34 [scrapy.utils.log] INFO: Versions: lxml 5.3.0.0, libxml2 2.12.9, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.11.0, Python 3.13.0 (main, Oct 16 2024, 09:15:13) [Clang 18.1.8 ], pyOpenSSL 24.3.0 (OpenSSL 3.4.0 22 Oct 2024), cryptography 44.0.0, Platform macOS-14.0-x86_64-i386-64bit-Mach-O
2024-12-31 00:51:34 [scrapy.addons] INFO: Enabled addons:
[]
2024-12-31 00:51:34 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2024-12-31 00:51:34 [scrapy.extensions.telnet] INFO: Telnet Password: caf98da9dcb07545
2024-12-31 00:51:34 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2024-12-31 00:51:34 [scrapy.crawler] INFO: Overridden settings:
{'ROBOTSTXT_OBEY': True,
 'SPIDER_LOADER_WARN_ONLY': True,
 'USER_AGENT': 'advertools/0.16.3'}
2024-12-31 00:51:34 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-12-31 00:51:34 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-12-31 00:51:34 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-12-31 00:51:34 [scrapy.core.engine] INFO: Spider opened
2024-12-31 00:51:34 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-12-31 00:51:34 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-12-31 00:51:34 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://adver.tools/robots.txt> (referer: None)
2024-12-31 00:51:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://adver.tools> (referer: None)
2024-12-31 00:51:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://adver.tools>
2024-12-31 00:51:35 [scrapy.core.engine] INFO: Closing spider (finished)
2024-12-31 00:51:35 [scrapy.extensions.feedexport] INFO: Stored jsonl feed (1 items) in: /Users/me/Desktop/temp/adv_crawl2.jsonl
2024-12-31 00:51:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 398,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 2580,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 0.402682,
 'feedexport/success_count/FileFeedStorage': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2024, 12, 31, 0, 51, 35, 149035, tzinfo=datetime.timezone.utc),
 'httpcompression/response_bytes': 7037,
 'httpcompression/response_count': 2,
 'item_scraped_count': 1,
 'items_per_minute': None,
 'log_count/DEBUG': 4,
 'log_count/INFO': 11,
 'memusage/max': 127004672,
 'memusage/startup': 127004672,
 'response_received_count': 2,
 'responses_per_minute': None,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2024, 12, 31, 0, 51, 34, 746353, tzinfo=datetime.timezone.utc)}
2024-12-31 00:51:35 [scrapy.core.engine] INFO: Spider closed (finished)

Note that the final few lines provide some useful summary statistics about the crawl, when it started, how many requests it made, why it ended, and so on.

Adding more URLs is straightforward, and is simply done by supplying a list of URLs to the url_list parameter

adv.crawl(
    url_list=[
        "https://example.com/A",
        "https://example.com/B",
        "https://example.com/C",
        ],
    output_file="output.jsonl")

How to read the crawl file into a DataFrame

The crawl data is saved as a jsonlines file (with the extension .jsonl or .jl), each row completely and independently representing a crawled URL. We can use pandas to read this file using the read_json function, and specifying that it is a jsonlines file:

import pandas as pd
crawl_df = pd.read_json("output.jsonl", lines=True)
crawl_df

	url	title	viewport	charset	h1	body_text	size	download_timeout	download_slot	download_latency	...	resp_headers_Etag	resp_headers_Expires	resp_headers_Last-Modified	resp_headers_Server	resp_headers_Vary	resp_headers_X-Cache	request_headers_Accept	request_headers_Accept-Language	request_headers_User-Agent	request_headers_Accept-Encoding
0	https://example.com	Example Domain	width=device-width, initial-scale=1	utf-8	Example Domain	\n Example Domain \n This domain is fo...	1256	180	example.com	0.110213	...	"3147526947+gzip"	Tue, 07 Jan 2025 20:37:18 GMT	Thu, 17 Oct 2019 07:18:26 GMT	ECAcc (bsb/27D7)	Accept-Encoding	HIT	text/html,application/xhtml+xml,application/xm...	en	advertools/0.16.3	gzip, deflate

1 rows × 33 columns

Now that we have a pandas DataFrame, we have its full power at our disposal to run any analysis we want. This is also applies of course to all other types of libraries for data visualization, machine learning, text analysis, etc.

How to crawl in list mode

What we just did was essentially list mode, and we don’t need to do anything special, other than supplying a list of URLs to the url_list parameter.

How to crawl in spider mode

Spider mode is when the crawler discovers links on its own, keeps crawling the whole website, and stops when it’s done (or when a condition that you set is met).

Of course this has defaults, and we will discuss how to modify and manage them later.

To activate spider mode, all you have to do is set follow_links=True in the crawl function:

adv.crawl("https://example.com", "output.jsonl", follow_links=True)

Of course, you can combine list mode and spider mode by providing a list of URLs, and setting the follow_links parameter:

adv.crawl(
    url_list=[
        "https://example.com/A",
        "https://example.com/B",
        "https://example.com/C",
        ],
    output_file="output.jsonl",
    follow_lnks=True, 
    )

Things to keep in mind regarding spider mode

While following links, the crawling will by default follow links that belong to the same domain, and its sub-domains. If the current domain is “example.com” then a link to “blog.example.com” will be followed, but a link to “facebook.com” will not.
You can use the allowed_domains parameter to control which domains the crawler is allowed to follow. Maybe you want to only crawl a certin sub-domain and not others, or just three of them? You can easily set this restriction with this parameter.

If you want to crawl this website’s blog only, you can do it explicitly like this:

adv.crawl(
    url_list="https://blog.adver.tools",
    output_file="adv_crawl.jsonl",
    follow_links=True,
    allowed_domains=["blog.adver.tools"])

Another more flexible way to do so is through the parameters that give you fine-grained control over the process.

How to control which links get followed

The crawl function takes four parameters that should be self-explanatory in what they do

include_url_regex: Only follow a link if it matches the given regex.
exclude_url_regex: Don’t follow a link if it matches the given regex.
include_url_params: Follow a link if it has any of the given parameters (as a list).
exclude_url_params: Don’t follow links that contain any of the given parameters (also as a list).

Here’s an example of how these might interact together, and I encourage you to try any combination you want, and explore how it works, and how it might have differently from what you thought:

adv.crawl(
    url_list="htps://myecommeresite.com",
    output_file="output.jsonl",
    follow_links=True,
    include_url_regex="/sunglasses/", 
    exclude_url_regex="/blog/", 
    )

In this case, it is guaranteed that you won’t get any URLs with “/blog/” in them (unless they were redirected), and you would get as many “/sunglasses/” pages as possible, depending on the linking structure of the site.

Assuming the start URL had links to one such page then your crawl would go on, otherwise it would stop. There is also a chance that there would be abunch of URLs with this pattern that are not linked to, and you would miss out on them. A good way to improve your chances in such an exercise is to download the site’s XML sitemap, get the URLs that match the pattern you want, and the crawl them with follow_links=True.

How to control which URL query parameters to include/exclude while crawling

The *_url_params parameters take a list and not a pattern.

adv.crawl(
    url_list="htps://myecommeresite.com",
    output_file="output.jsonl",
    follow_links=True,
    include_url_params=["price", "color"], 
    exclude_url_params=["sold-out"], 
    )

Again, we have the same issue, with the exclude_ parameter guaranteed not follow any unwanted links, and with the include_ parameters hopefully following all the links you intend.

How to exclude all URL query parameters while crawling

You have a simple option of excluding any link that has any URL parameter, which is by setting exclude_url_params=True. This is important because it’s difficult to know beforehand which parameters exist, and this just makes things simpler.

adv.crawl(
    url_list="htps://myecommeresite.com",
    output_file="output.jsonl",
    follow_links=True,
    exclude_url_params=True 
    )

In both cases, the exclude_ parameters are more likely to behave as expected and reason about. The reason is that you are allowing the crawler to crawl all pages, excepts the ones that you have excluded. The challenge with the include_ parameters is that you might not find any page that satisfies your conditions at the current state, and the crawler would stop, even though there are many of those pages on the website.

Assume you want to crawl the /books/ pages of a certain website, and you start from the home page. If there are no links that match this pattern, the crawler will stop and you might think that there are no such pages. This could be simply because there were no links to those pages from your start URLs. So please be careful when using the include_ options.

After having decided how to “drive” the crawler, and which pages to tell it follow and scrape, you probably want to extract custom content that is not extracted by default, because it is not available in standard HTML elements like h1, h2, and so on.

How to do custom extraction with XPath and/or CSS selectors

Custom extraction is the process of extracting certain data from a page that are not easily identified by standard HTML elements like h1, h2, div, p, etc.

Essentially, what you want to achieve with custom extraction is to specify something specific like, “the divs that have the class attribute equal to ‘price’”, or “the p elements that are right after h2 elements with class ‘author’”, and so on.

The process of setting what you want to extract can be achieved using a dictionary for either of the two types of custom extraction

How to specify XPath selectors while crawling

adv.crawl(
    "https://example.com",
    "output.jsonl",
    follow_links=True,
    xpath_selectors={ 
        "column_1": "pattern_1", 
        "column_2": "pattern_2", 
        "column_3": "pattern_3", 
        # ... 
    } 
    )

Once the crawl is finished you should end up with columns called column_1, column_1, column_3, or whatever you named them.

Under each column the data extracted from the respective page will be found, provided it was found on the page. Otherwise you would get a missing <NA> value.

How to specify CSS selectors while crawling

The process is exactly the same for CSS selectors, only the parameter is different, and of course, the selector patterns are different:

adv.crawl(
    "https://example.com",
    "output.jsonl",
    follow_links=True,
    css_selectors={ 
        "column_4": "pattern_4", 
        "column_5": "pattern_5", 
        "column_6": "pattern_6", 
        # ... 
    } 
    )

The structure of a crawl file

It can be very helpful to understand how advertools creates and updates the crawl file, as it discovers and saves pages to disk. It’s beyond the scope of this tutorial, but I encourage you to learn more about how the crawl file is structured.

How to extract the main content of a page

The advertools crawler attempts to extract the main content of a page (article text, product description, blog post, etc.) by using a special XPath expression. The main thing it does is it follows three rules:

Every tag in the <body> of a page
Any tag that is any of the specified tags (a, b, p, h1, …)
Any tag whos ancestor is not one of the specified tags (button, script, img, …)

And from these it extracts the text attribute. Then it joins all these in one string. After crawling, you should find this under a column called body_text.

Custom settings

The crawling behavior can be flexibly modified using dozens of custom settings, which also enables the use of many third party extensions for scrapy. This is also specified using a dictionary. Typically scrapy settings are set using keys in all-caps:


adv.crawl(
    "https://example.com",
    "output.jsonl",
    follow_links=True,
    custom_settings={ 
        "SETTING_A": "VALUE_A", 
        "SETTING_B": "VALUE_B", 
        "SETTING_C": "VALUE_C", 
        # ... 
    } 
    )

For example, you can set CLOSESPIDER_PAGECOUNT to end crawling after reaching the desired number of pages. You can also set a file for saving the logs using the LOG_FILE setting, and so on:

adv.crawl(
    "https://example.com",
    "output.jsonl",
    follow_links=True,
    custom_settings={ 
        "CLOSESPIDER_PAGECOUNT": 1000, 
        "LOG_FILE": "output.log", 
    } 
    )

This would stop crawling after reaching a thousand pages, and will save all log messages to the file output.log.

How to analyze a crawled website

Once the crawl has ended, you obviously want to analyze it, and you have complete control over how/what to analyze by using any of the data packages you want. However, advertools provides a few handy functions under the crawlytics module that can help you further:

compare: Compare a specific page element between two crawl files.
images: Get a mapping of all crawled images, together with all atributes that they use.
jl_subset: Read only a subset of the jsonlines file by setting a list of columns and/or a regular expression for the patterns of columns that you want. This is handy for very large crawl files.
jl_to_parquet: Convert the jsonlines file to parquet format, massively reducing its size, and making it very easy to read/analyze the columns that you want.
links: Get a mapping of all the links on a website.
parquet_columns: Get a DataFrame showing you what columns are available, and what their data types are.
redirects: Get a mapping of all redirects on the crawled website.
running_crawls: Get a simple summary table showing you which crawl processes are running, and some basic stats about them (only Linux and MacOS available for the moment).

The way to run each of these function is simple. Assuming you have a crawl DataFrame named crawl_df:

adv.crawlytics.<function_name>(crawl_df, param1, param2, ...)