The Structure of an advertools Crawl File

crawling
scraping
advertools
Understanding how the crawl file is created, how its columns are structured, and building a deep understanding of how the various columns relate to one another.
Author

Elias Dabbas

Whether you like to use the advertools no-code SEO crawler, or like crawling with Python, or the CLI, the output file that you get is pretty much the same.

The only difference is that with the no-code app, you get the JSON file converted to CSV. Still, the structure described here can help a lot in understanding what to do with the file, and how to analyze it.

File format jsonlines (.jsonl or .jl)

The output file of crawling with advertools uses the jsonlines (.jl) format, which is pretty much the same as the regular JSON file format, with the exception of having one independent JSON object per line.

It looks something like this:

output_file.jl
{"url": "https://example.com/A", "title": "Page A", "status": 200, "h1": "Tutorial A"}
{"url": "https://example.com/B", "title": "Page B", "status": 200, "h1": "Tutorial B", "h2": "Today we will learn X"}
{"url": "https://example.com/C", "title": "Page C", "status": 200}

How does this format help?

  • Independence: As you can see every row/line contains the name of the column as well its contents. So we can easily take the first row on its own, and know what the URL, title, status, and h1 are for that particular page. That means that each row on its own completely represents a scraped web page. Of course, we can select a set of rows satisfying a certain condition and audit those separately. Note that there are no commas at the end of the lines.
  • Flexibility: You probably noticed that not all URLs contain the same data. Some have an h1, and some don’t. So this format allows us to work with pages that typically contain different data.
  • It’s just one file: The whole crawl is saved in a single file. So you can easily read it, analyze it, and share it, as a single entity without having to use, relate, or merge multiple files.

The downside of using this format is a little extra overhead because column names are repeated in every line of the file, “url”, “url”, “url”, “h1”, “h1”,… It’s not a huge overhead, there are ways to handle this, and we will get to that in a separate article.

How the crawl file is created/updated

While crawling, every minute or so, the crawled and parsed pages are saved to the output file. It’s important to know that the new lines are appended to the file.

This means:

  • The crawling process consumes very little memory. Because processed data is periodically dumped into the output file, at any point in time you will have very little memory consumed by the crawler.
  • Running another crawl, and setting the same output_file will add newly crawled URLs to the end of the same file. So make sure you use a different file path for each crawl (unless you want to use the same file).
  • While crawling you can already read the available lines and start analyzing.

What columns are included

Various columns and types of columns can appear in a crawl file:

Standard columns

These are the columns that you should find in any crawl file, no matter what. Some of these are url, status, crawl_time, and several others.

Variable columns

Since there are no rules on what HTML elements a web page should have, there is no way to know or expect what data will be there. Some pages might contain h2 tags and some might not. Some pages will have JSON-LD, some won’t, and not all servers return the same response headers. In a crawl file where some pages have, and some don’t have, a certain element, that element will be represented as not available NA in the row belonging to the URL where it is missing. For example, URL_1 and URL_3 have an h3 tag, but URL_2 does not. In this case you will see a missing value:

url h3
URL_1 price: $10
URL_2 NA
URL_3 price: $20

Note that you might see missing values represented as <NA> (nullable integers) or np.nan (floats) as well. This depends on the pandas version you are using, oher values in the same column, but all essentially represent missing values.

Custom columns

Through custom extraction, using XPath and/or CSS selectors, you have the flexibility to extract custom data, as well as give those columns any name you want. For example, you might use the following dictionary to extract the price and availability from product pages:

xpath = {
    "product_price": "//span[@class='price']/text()",
    'product_availability': "//span[@class='availability']/text()"
    }

import advertools as adv

adv.crawl(
    'https://example.com',
    'output_file.jl',
    follow_links=True,
    xpath_selectors=xpath)

After crawling this website, you will have two user-defined custom columns with the names product_price and product_availability. For pages that actually contain a price and availability you will see the data, and for pages that don’t, you will get NA values.

Column categories

There are also several columns that can be thought of as a group, because they are either similar to one another, mean something similar, or form parts of a whole.

Similar:

The heading tags h1..h6 are similar to one another, but are independent.

Mean something similar

Response and request headers can be thought to belong to this category. You will see all these columns start with resp_headers_ and request_headers_.

Forming parts of a whole

Image tags, JSON-LD, redirects, and links fall under this category. A single image can have multiple attributes like src, alt, width and several more. Although these would belong to the same image, they are spread across columns, one for each. Redirects have multiple components as well: the URLs, status codes, and number of redirects. The JSON-LD elements can be quite complicated and nested to multiple levels. Again, each component will have its own column in the crawl file.

Now let’s see how we handle the case where we have multiple values of the same element on the same page.

Multiple values of the same element on one page

With most HTML elements, you can have more than one appearing on the same page. We typically see multiple images, multiple heading tags, or links on the same page.

Following “tidy data” principles, each column represents data about one, and only one, element. And there should be no other column that contains data for the same element.

In other words, the img_src column has all the img_src elements that appeared on the crawled page(s). Also, there is no other img_src column to be found in the crawl DataFrame.

How multiple elements are represented

Whenever this case happens, you will see multiple elements delimited by two “@” signs @@:

output_file.jl
{"url": "https://example.com/A", "h2": "one@@two@@three"}
{"url": "https://example.com/B", "h2": null}
{"url": "https://example.com/C", "h2": "one@@three@@four@@five"}
{"url": "https://example.com/D", "h2": "nine@@ten"}

After you crawl a website, or a set of URLs, this is what those columns would look like in a DataFrame:

import pandas as pd

crawl_df = pd.read_json("output_file.jl", lines=True)
crawl_df
url h2
https://example.com/A one@@two@@three
https://example.com/B <NA>
https://example.com/C one@@three@@four@@five
https://example.com/D nine@@ten

On page A we have three h2 tags, on page B we don’t have any, then we have four and two h2 tags respectively, on pages C and D.

Interpretation of having multiple elements on the same page

Having more than one element per page can be:

  • An error: In some cases, like the canonical link, which by definition needs to point to a single URL, having more than one on a single page is an issue. Same with the title tag and meta description, which you should be one per URL.
  • A bad practice: Generally we should have a single h1 tag on a page, to make it clear what the most important topic is. It’s not a technical issue (won’t crash your pages), but it’s good to keep it to one instance per page.
  • Absolutely meaningless: In most cases, it doens’t mean anything if you have multiple occurrences of the same element. Most pages have multiple images, links, h2, h3, and various other tags, and this has no meaning at all.

You might, however, check the counts of tags, where this might give some hints. If you find that a page has a thousand images, it could a bug causing that to happen, or just bad usability.

This was an overview of how the crawl file is structured, and how you can understand it. I’ll share more practical tips and recipes for various use-cases, but this was intentionally more theoretical, and geared toward building an understanding of the crawl file.

Downsides of this format

An issue that comes out of this flexibility is that we pay for it with extra storage. I we want to flexibly and independently crawl websites, having a little extra information repeated on each row is what enables that. There are several solutions to this, like converting the crawl file to an efficient format like parquet, reading a subset of the columns, and reading sequentially in chunks (small big data). Check out the advertools.crawlytics module for more details on how to handle these issues.