The Structure of an advertools Crawl File
Whether you like to use the advertools no-code SEO crawler, or like crawling with Python, or the CLI, the output file that you get is pretty much the same.
The only difference is that with the no-code app, you get the JSON file converted to CSV. Still, the structure described here can help a lot in understanding what to do with the file, and how to analyze it.
File format jsonlines
(.jsonl
or .jl
)
The output file of crawling with advertools uses the jsonlines
(.jl) format, which is pretty much the same as the regular JSON file format, with the exception of having one independent JSON object per line.
It looks something like this:
output_file.jl
How does this format help?
- Independence: As you can see every row/line contains the name of the column as well its contents. So we can easily take the first row on its own, and know what the URL, title, status, and h1 are for that particular page. That means that each row on its own completely represents a scraped web page. Of course, we can select a set of rows satisfying a certain condition and audit those separately. Note that there are no commas at the end of the lines.
- Flexibility: You probably noticed that not all URLs contain the same data. Some have an h1, and some don’t. So this format allows us to work with pages that typically contain different data.
- It’s just one file: The whole crawl is saved in a single file. So you can easily read it, analyze it, and share it, as a single entity without having to use, relate, or merge multiple files.
The downside of using this format is a little extra overhead because column names are repeated in every line of the file, “url”, “url”, “url”, “h1”, “h1”,… It’s not a huge overhead, there are ways to handle this, and we will get to that in a separate article.
How the crawl file is created/updated
While crawling, every minute or so, the crawled and parsed pages are saved to the output file. It’s important to know that the new lines are appended to the file.
This means:
- The crawling process consumes very little memory. Because processed data is periodically dumped into the output file, at any point in time you will have very little memory consumed by the crawler.
- Running another crawl, and setting the same
output_file
will add newly crawled URLs to the end of the same file. So make sure you use a different file path for each crawl (unless you want to use the same file). - While crawling you can already read the available lines and start analyzing.
What columns are included
Various columns and types of columns can appear in a crawl file:
Standard columns
These are the columns that you should find in any crawl file, no matter what. Some of these are url
, status
, crawl_time
, and several others.
Variable columns
Since there are no rules on what HTML elements a web page should have, there is no way to know or expect what data will be there. Some pages might contain h2
tags and some might not. Some pages will have JSON-LD, some won’t, and not all servers return the same response headers. In a crawl file where some pages have, and some don’t have, a certain element, that element will be represented as not available NA
in the row belonging to the URL where it is missing. For example, URL_1 and URL_3 have an h3
tag, but URL_2 does not. In this case you will see a missing value:
url | h3 |
---|---|
URL_1 | price: $10 |
URL_2 | NA |
URL_3 | price: $20 |
Note that you might see missing values represented as <NA>
(nullable integers) or np.nan
(floats) as well. This depends on the pandas version you are using, oher values in the same column, but all essentially represent missing values.
Custom columns
Through custom extraction, using XPath and/or CSS selectors, you have the flexibility to extract custom data, as well as give those columns any name you want. For example, you might use the following dictionary to extract the price and availability from product pages:
xpath = {
"product_price": "//span[@class='price']/text()",
'product_availability': "//span[@class='availability']/text()"
}
import advertools as adv
adv.crawl(
'https://example.com',
'output_file.jl',
follow_links=True,
xpath_selectors=xpath)
After crawling this website, you will have two user-defined custom columns with the names product_price
and product_availability
. For pages that actually contain a price and availability you will see the data, and for pages that don’t, you will get NA
values.
Column categories
There are also several columns that can be thought of as a group, because they are either similar to one another, mean something similar, or form parts of a whole.
Similar:
The heading tags h1..h6 are similar to one another, but are independent.
Mean something similar
Response and request headers can be thought to belong to this category. You will see all these columns start with resp_headers_
and request_headers_
.
Forming parts of a whole
Image tags, JSON-LD, redirects, and links fall under this category. A single image can have multiple attributes like src
, alt
, width
and several more. Although these would belong to the same image, they are spread across columns, one for each. Redirects have multiple components as well: the URLs, status codes, and number of redirects. The JSON-LD elements can be quite complicated and nested to multiple levels. Again, each component will have its own column in the crawl file.
Now let’s see how we handle the case where we have multiple values of the same element on the same page.
Multiple values of the same element on one page
With most HTML elements, you can have more than one appearing on the same page. We typically see multiple images, multiple heading tags, or links on the same page.
Following “tidy data” principles, each column represents data about one, and only one, element. And there should be no other column that contains data for the same element.
In other words, the img_src
column has all the img_src
elements that appeared on the crawled page(s). Also, there is no other img_src
column to be found in the crawl DataFrame.
How multiple elements are represented
Whenever this case happens, you will see multiple elements delimited by two “@” signs @@
:
output_file.jl
After you crawl a website, or a set of URLs, this is what those columns would look like in a DataFrame
:
url | h2 |
---|---|
https://example.com/A | one@@two@@three |
https://example.com/B | <NA> |
https://example.com/C | one@@three@@four@@five |
https://example.com/D | nine@@ten |
On page A we have three h2
tags, on page B we don’t have any, then we have four and two h2
tags respectively, on pages C and D.
Interpretation of having multiple elements on the same page
Having more than one element per page can be:
- An error: In some cases, like the canonical link, which by definition needs to point to a single URL, having more than one on a single page is an issue. Same with the title tag and meta description, which you should be one per URL.
- A bad practice: Generally we should have a single
h1
tag on a page, to make it clear what the most important topic is. It’s not a technical issue (won’t crash your pages), but it’s good to keep it to one instance per page. - Absolutely meaningless: In most cases, it doens’t mean anything if you have multiple occurrences of the same element. Most pages have multiple images, links,
h2
,h3
, and various other tags, and this has no meaning at all.
You might, however, check the counts of tags, where this might give some hints. If you find that a page has a thousand images, it could a bug causing that to happen, or just bad usability.
This was an overview of how the crawl file is structured, and how you can understand it. I’ll share more practical tips and recipes for various use-cases, but this was intentionally more theoretical, and geared toward building an understanding of the crawl file.
Downsides of this format
An issue that comes out of this flexibility is that we pay for it with extra storage. I we want to flexibly and independently crawl websites, having a little extra information repeated on each row is what enables that. There are several solutions to this, like converting the crawl file to an efficient format like parquet
, reading a subset of the columns, and reading sequentially in chunks (small big data). Check out the advertools.crawlytics module for more details on how to handle these issues.