Crawling vs scraping: What’s the difference?
Web scraping and crawling are two distinct yet related processes used to extract data from the web. While they are sometimes used interchangeably, understanding their differences is crucial for choosing the right approach for your needs.
In a nutshell, scraping is about extracting certain elements from a downloaded HTML page, and crawling is the process of discovering links on those pages, following them, handling requests, errors, saving to disk, and the whole process, many details of which are outlined below.
Scraping
This is the process of extracting some content from a downloaded HTML page. Extraction can be of standard elements like the title tag, meta description, or h1, or using custom extraction with a specialized mechanism.
A typical workflow is using requests
and beautifulsoup
, which would use a parse like lxml
for example.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://example.com")
soup = BeautifulSoup(response.text, "lxml")
soup.find_all("title")[0].text
'Example Domain'
So basically, make a request, parse the response, and then query the BeautifulSoup
object in order to find any element you want. This is great for a small set of URLs that you already know about.
There is much more included in crawling.
Crawling
This involves various tasks:
Downloading a page
This is similar to the first step discussed above, which is the basis of everything. Get a single HTML file.
Discovering links
As you have probably noticed, the above process stopped at parsing the HTML and then getting whichever elements we want. There is nothing beyond that. Of course, we can create that process, but we will be crawling in this case, and we have to conceptually handle, and code all the steps mentioned here.
Not all links are “links”. Some link to JavaScript files, images, or PDF files, so there needs to be a mechanism of identifying the correct types of links that need to be followed. There are also many cases where you have relative links that you have to figure out to get the full URL from something like ../how-to/crawl-a-website/
for example.
Following links based on certain conditions
Now that we have found links on the downloaded page, we want the crawler to follow them. But not all of them. We need to differentiate between internal and external links, otherwise, our crawler will end up crawling the whole web. We need to handle deduplication, and make sure we don’t re-crawl pages that were already crawled before, otherwise, we will end up in an infinite loop. Custom behavior is also required, for example, to only follow links that (don’t) match a certain pattern.
More on these issues below.
Handling errors: crawl, redirect, HTTP, redirects, etc.
When crawling a website, you will most likely encounter issues. Broken links, server links, redirect issues, and many others. You need to have a policy of how to handle each error, what to do about it, and make sure these errors are logged/reported somehow.
(Dis)obeying robots.txt
rules
While you want to be respectful to the rules in those files, you also need a mechanism to read, parse and know which links you could(not) follow. This has to be done on the fly for each URL
Handling concurrency in running multiple page at the same time
The code shown above is for a single URL. That would be very slow, and we need a way to make concurrent requests to make the process as fast as we want/can. This adds another layer of complexity to the process.
Ending the crawl
While there are usually defaults, like stopping when the website is finished, you need to be able to customize the conditions for ending the crawl. Maybe you want it to stop after crawling a thousand pages for example, as an initial exploratory crawl. Maybe you want to end it if you get more than a certain number of errors, and so on.
Making sure not to crawl pages that were already crawled
A mechanism needs to be available for keeping track of which URLs were crawled (or dropped due to many errors), and making sure not to re-crawl them again.
Working with proxies
Crawling happens through the computer that you are running, and many times you might want to crawl from another location, or type of connection. This is an important feature that can uncover interesting content behavior if accessed from different countries.
Data storage
How do you save the extracted data? What format should it be in? Writing the data should be done periodically to free up memory. Should it be a file, or a database, or something else?
These are also important things to consider because otherwise your memory would end up holding massive amounts of data, and will probably crash.
The format of the stored data is crucial and should take into consideration how users are going to analyze it afterwards. Ideally, the data should be saved in a “tidy data” (long form) format, to enable immediate and efficient analysis using various Data Science libraries and workflows.
Post-processing
The data storage part ensures we have the data in a useful and consistent format. After saving it, we need to start analyzing and scale this process. There needs to be tools to handle common workflows, like:
- Auditing and analyzing redirects
- Auditing and analyzing images
- Auditing and finding broken links
- Auditing and visualizing status codes
- Auditing and analyzing structured data JSON-LD, Twitter, and OpenGraph
- Comparing crawl over crawl
Crawling with advertools
The advertools crawler handles all these details. This is possible because it is built with scrapy
, the leading web crawling framework in Python. The advantage that advertools
provides is that it puts the whole process in one function, crawl
.
import advertools as adv
adv.crawl(
url_list="https://example.com",
output_file="output.jsonl",
follow_links=True,
xpath_selectors={
"column_a": "xpath_expression_a",
"column_b": "xpath_expression_b",
# ...
},
css_selectors={
"column_x": "css_selector_x",
"column_y": "css_selector_y",
# ...
},
include_url_regex="your-regex",
exclude_url_regex="your-regex",
include_url_params=["list", "of", "params"],
exclude_url_params=["list", "of", "params"],
custom_settings={
"SETTING_A": "VALUE_A",
"SETTING_B": "VALUE_B",
"SETTING_C": "VALUE_C",
#...
})
As you can see the customizations available in a single function allow you to really tailor the crawling process, making the process of rerunning the same crawls very easy, not to mention debugging them.
Under custom_settings
there are over a hundred such settings that manage concurrent requests, crawl delays, when to stop, proxies, integration with third party packages, and many more.
The advantage of using advertools
over scrapy
is that it is mainly geared toward website analysis, and especially for SEO purposes. There is a lot of customized and customizable behavior that you get out of the box. If this is interesting for you and you want all these options out of the box then advertools
should be worth a try.
If you want to develop a scraping solution, or have a very specific use case, then you are probably better off writing your own logic with scrapy
.