advertools v0.17.0 New Features

crawling
scraping
advertools
markdown
Release notes and how to use the new features in the new release.
Published

June 13, 2025

New Features

This release brings three main additions to advertools that can help in various ways:

  • A way to restrict which columns get saved to the output file while crawling
  • Converting a crawled website to markdown format
  • Text partitioning functionality powered by regex

How to control which columns are kept or discarded while crawling

Why?

Some crawls of very large websites can take up a huge amount of disk space. In some cases this can be too big that you can’t even open the file, and/or you really don’t need all those columns for your specific use-case. While there are some solutions to handle large crawl files this feature prevents the problem from even happening.

Another reason you might want to use this is when you are crawling for purely data collection purposes. Maybe you just want the product name, price, description, and availability information for example.

How to control which columns are kept/discarded while crawling

This feature comes as two new parameters to the crawl function keep_columns and discard_columns. You can simply determine a few columns that you want to keep this way:

import advertools as adv
import pandas as pd

adv.crawl(
    url_list="https://example.com",
    output_file="ourput_file.jsonl",
    follow_links=True,
    keep_columns=["h1", "size", "status", "body_text"],
)

In this case your crawl output file will only have the four columns that were supplied: ["h1", "size", "status", "body_text"]. Keep in mind that you will always have the url column available, as well as errors if you get any. This is to make sure you know what pages those columns refer to.

crawldf = pd.read_json("output_file.jsonl", lines=True)
crawldf
url h1 body_text size status
0 https://example.com Example Domain \n Example Domain \n This domain is fo... 1256 200

As you can see our crawl DataFrame only has the columns we specified, together with url.

How to use regex to flexibly select which columns are kept/discarded while crawling

There are sets of columns that either belong to the same element, or are similar to one another that are typically found in a crawl output file. For example:

  • heading tags h1-h6
  • JSON-LD tags
  • Response/request headers
  • Image attributes

With these columns you usually can’t know what will be available in the crawled pages (which JSON-LD tags are they using?) and so you can simply specify it as a regular expression, i.e. jsonld_.

You can also discard sets of columns using regular expression with the same logic.

How to keep all response headers except for ones we don’t want

Here is an example where we know we want to get the response headers (and we can’t know which ones will be returned by the server), but we know that we don’t want to have ["resp_headers_Cache-Control", "resp_headers_Vary"] for example.

adv.crawl(
    url_list="https://example.com",
    output_file="headers.jsonl",
    follow_links=True,
    # list items are evaluated as regular expressions:
    keep_columns=["h1", "size", "status", "resp_headers"],
    # list items are evaluated as regular expressions:
    discard_columns=["resp_headers_Cache-Control", "resp_headers_Vary"]
)
headers = pd.read_json("headers.jsonl", lines=True)

Check if the two discarded columns are included:

"resp_headers_Cache-Control" in headers
False
"resp_headers_Vary" in headers
False

Make sure other response headers are included:

headers.filter(regex="resp_headers")
resp_headers_Content-Length resp_headers_Accept-Ranges resp_headers_Content-Type resp_headers_Etag resp_headers_Last-Modified resp_headers_Date
0 648 bytes text/html "84238dfc8092e5d9c0dac8ef93371a07:1736799080.1... Mon, 13 Jan 2025 20:11:20 GMT Fri, 13 Jun 2025 11:32:33 GMT

Keep/discard crawl columns summary

Whith the regex flexibility we can use very powerful and flexible combinations to keep/discard columns.

  • You want JSON-LD tags to be kept, you don’t know how many will be available in each page, but you know which ones you want to discard.
  • You want to extract all headers except for h5 and h6.
  • The discard_columns overrides the keep_columns parameter. Keep this in mind when setting the options.
  • The body_text column is typically the largest one, and can take up 20-40% of the crawl file size. This might get you the largest reduction in file size if you don’t need the body text.

How to convert HTML pages to markdown in bulk

A new function is now available that converts all crawled URLs in a crawl DataFrame to markdown.

How does the HTML to markdown conversion happen

  • From the crawl DataFrame the body_text as well as all heading tags’ columns are selected.
  • Every heading tag is located in the body_text string.
  • Heading tags are converted to their markdown counterparts # for h1, ## for h2, and so on.
  • Newlines are inserted between headings and the text they split
  • The markdown strings are returned as a list

To illustrate, assume we have this body text string with the following heading tags:

body_text = "First heading Today we will be talking about this topic. Second heading I hope you liked today's topic."

h1 = "First heading"
h2 = "Second heading"

After conversion the markdown string will look like this:

# First heading

Today we will be talking about this topic.

## Second heading

I hope you liked today's topic.

Let’s see how it works with our crawled URL:

crawldf.filter(regex=r"h\d|body_text")
h1 body_text
0 Example Domain \n Example Domain \n This domain is fo...
adv.crawlytics.generate_markdown(crawldf)
['# Example Domain\n\nThis domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission. \n     More information...']

Printing for better readability:

print(adv.crawlytics.generate_markdown(crawldf)[0])
# Example Domain

This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission. 
     More information...

HTML to markdown conversion example with real pages

Let’s crawl two pages from this website, and see how their markdown strings can be created.

adv.crawl(
    url_list=["https://adver.tools/seo-crawler/", "https://adver.tools/xml-sitemaps/"],
    output_file="crawler_sitemaps.jsonl")
crawler_sitemaps = pd.read_json("crawler_sitemaps.jsonl", lines=True)
crawler_sitemaps.filter(regex=r"h\d|body_text")
h1 h2 h4 body_text h3
0 SEO Crawler SEO Crawler Features@@Crawling features@@Crawl... Free up to five thousand URLs@@Custom extracti... \n \n SEO Crawler \n \n\n \n \n ... NaN
1 @@Download, Extract, and Parse XML Sitemaps Analyze XML sitemaps in bulk and gain immediat... Lastmod cumulative distribution and histogram ... \n \n \n \n\n \n \n \n Downl... XML sitemap types supported@@Extracted data@@E...

Markdown for the first page

crawler_sitemaps_md = adv.crawlytics.generate_markdown(crawler_sitemaps)
print(crawler_sitemaps_md[0])
# SEO Crawler

Loading...

## SEO Crawler Features

## Crawling features

#### Free up to five thousand URLs

You can also set a lower limit for exploratory purposes.

#### Custom extraction with XPath and/or CSS selectors

For custom extraction you need to enter two things in the tables above: 
- Column Name: This should be a descriptive name for the page elements you want to exract. Examples could be "product_price", "blog_author", et. 
- XPtah/CSS Selector: This is the selector pattern for the element(s) you want to extract.

#### Setting any User-agent

The names are given using human-readable device names like "iPhone 13", "Samsung Galaxy S22", etc. 
You can also get any user-agent and paste it in the input box, so don't need to only use the ones provided.

#### Spider and/or list mode

To activate spider mode you just need to select the "Follow links" checkbox. Add more than one URL to run in list mode. You can combine both of course.

#### Include/exclude URL parameters

When encountering new links, should the crawler follow them if they (don't) contain the URL parameters that you chose?

#### Include/exclude URL regular expression

Similar to the above but using a regex to match links.

## Crawl analytics features

Once you hit the  Start crawling  button, you will be given a URL where you can start auditing and analyzing the crawled website. The following are some of the available feature, together with a link to explore a live dashboard.

#### Visualize the URL structure

Using an interactive treemap chart you can see how the website's content is split. 

Get the count of URLs for each directory of URLs  /blog/ ,  /sports/ , etc. 
Get the percentages of each of the directories. 
Beyond ten directories the chart might look really cluttered, so all other directories are displayed under their own "Others" segment.

#### Structured data overview (JSON-LD, OpenGraph, and Twitter)

For each of the above structured data types, you can see the count and percentage of URLs that contain each tag of each type. For example, for the  @context  JSON-LD tag, you can see how many URLs contain it, their percentage of the crawled URLs.

#### Filter and export data based on whichever URLs you want

Once you see an interesting insight, you can export a subset of URLs and columns. For example, get the URL, title, and status of URLs whos status code is not 200. Get the URL, h1, and size, of pages whos size is larger than 300KB, and so on.

#### Count page elements' duplications if any (title, h1, h2, etc.)

For a selected page element get the counts of each element on the website. How many times was each h2, meta_desc, etc. element duplicated across the website?

#### N-gram analysis for any page element (set stop words, and select from 40 languages)

Select unigrams, bigrams, or trigrams 
Select the page element to analyze (title, body_text, h3, etc) 
Use editable default stopwords 
Chose stopwords from forty languages

#### Link analysis

External links : See which domains the crawled website links to the most. 
 Internal links : Get a score for each internal URL as a node in the network (website): 
 
 Links: Total number of links to and from the page. 
 Inlinks: Total incoming links to the page. 
 Outlinks: Total outgoing links to the page. 
 Degree centrality: The percentage of internal URLs that are connected to this URL. 
 Page rank: The PageRank 
 
 Note that it is up to you to define what "internal" means using a regular expression. 
 For an example of those features you can explore this  crawl audit and analytics dashboard

Markdown for the second page

print(crawler_sitemaps_md[0])
# Download, Extract, and Parse XML Sitemaps

Enter a sitemap URL and start analyzing immediately. See publishing trends, website structure, and export the full data to a CSV file. 

Loading...

## Analyze XML sitemaps in bulk and gain immediate insights

This app supports any kind of XML sitemap URL.

### XML sitemap types supported

### Extracted data

The following columns will typically be included in the converted sitemap: 

loc : The URLs (locations)  <loc> 
lastmod : If available, the  <lastmod>  tag as a datetime object 
sitemap : The URL of the sitemap to which this URL belongs 
sitemap_size_mb : Self explanatory 
download_date : When the sitemap was downloaded, so you can compare later 
sitemap_last_modified : If declared by the server, you also get this response header. 
ETag : If declared by the server you also get the  ETag  of the sitemap. 

Other columns could also be available, depending on the sitemap. For example, if the sitemap contained images, it might contain the following columns: 

image 
image_loc 
image_title 
image_caption

### Example Charts for XML Sitemaps

#### Lastmod cumulative distribution and histogram chart

#### URL structure treemap chart - first level

#### URL structure treemap chart - second level

HTML to markdown conversion summary and caveats

  • Headings and body_text are used to figure out the markdown text
  • No other elements are extracted (links, bullet points, etc)
  • Repeated headings under the same level are handled properly (two h2 tags that have the same text)
  • Repeated headings under different levels will probably result in one of them being misplaced. If you have “click here” in two places in the document, once as h1 and once as h2, one of them will be incorrectly placed in the resulting string. Please get in touch if you have a solution
  • In upcoming releases links might be supported as well

Text partitioning with regular expressions

Another feature that can help in this workflow is flexible text partitioning. The main difference between splitting and partitioning is that splitting removes the splitting characters, while partitioning keep them. The Python standard libarary’s str class has a partition method already. But it has two main limitations:

  • It supports partitioning only once
  • It does not support partitioning by regular expression

Here is how it works:

text = "# Heading 1\n\nSome text\n\n## Heading 2\n\nAnother text"
print(text.partition('#'))
('', '#', ' Heading 1\n\nSome text\n\n## Heading 2\n\nAnother text')

The text was correctly split using the “#” character, but only once. In this case we would like to split using any markdown heading, without knowing what level it is and we want to do it multiple times in the same document.

import re
adv.partition(text, r"^#+ .*", flags=re.MULTILINE)
['# Heading 1', 'Some text', '## Heading 2', 'Another text']

Printing the lines for readability:

print(*adv.partition(text, r"^#+ .*", flags=re.MULTILINE), sep="\n")
# Heading 1
Some text
## Heading 2
Another text

Keep in mind that you might need to add regex flags to the function. In the previous example I used re.MULTILINE because of the way Python parses newline characters, so I needed to supply that.

Partitioning a markdown string

Here is a real life example of how this works:

md_parts = adv.partition(
    text=crawler_sitemaps_md[0],
    regex=f"^#+ .*",
    flags=re.MULTILINE
)
md_parts
['# SEO Crawler',
 'Loading...',
 '## SEO Crawler Features',
 '## Crawling features',
 '#### Free up to five thousand URLs',
 'You can also set a lower limit for exploratory purposes.',
 '#### Custom extraction with XPath and/or CSS selectors',
 'For custom extraction you need to enter two things in the tables above: \n- Column Name: This should be a descriptive name for the page elements you want to exract. Examples could be "product_price", "blog_author", et. \n- XPtah/CSS Selector: This is the selector pattern for the element(s) you want to extract.',
 '#### Setting any User-agent',
 'The names are given using human-readable device names like "iPhone 13", "Samsung Galaxy S22", etc. \nYou can also get any user-agent and paste it in the input box, so don\'t need to only use the ones provided.',
 '#### Spider and/or list mode',
 'To activate spider mode you just need to select the "Follow links" checkbox. Add more than one URL to run in list mode. You can combine both of course.',
 '#### Include/exclude URL parameters',
 "When encountering new links, should the crawler follow them if they (don't) contain the URL parameters that you chose?",
 '#### Include/exclude URL regular expression',
 'Similar to the above but using a regex to match links.',
 '## Crawl analytics features',
 'Once you hit the  Start crawling  button, you will be given a URL where you can start auditing and analyzing the crawled website. The following are some of the available feature, together with a link to explore a live dashboard.',
 '#### Visualize the URL structure',
 'Using an interactive treemap chart you can see how the website\'s content is split. \n\nGet the count of URLs for each directory of URLs  /blog/ ,  /sports/ , etc. \nGet the percentages of each of the directories. \nBeyond ten directories the chart might look really cluttered, so all other directories are displayed under their own "Others" segment.',
 '#### Structured data overview (JSON-LD, OpenGraph, and Twitter)',
 'For each of the above structured data types, you can see the count and percentage of URLs that contain each tag of each type. For example, for the  @context  JSON-LD tag, you can see how many URLs contain it, their percentage of the crawled URLs.',
 '#### Filter and export data based on whichever URLs you want',
 'Once you see an interesting insight, you can export a subset of URLs and columns. For example, get the URL, title, and status of URLs whos status code is not 200. Get the URL, h1, and size, of pages whos size is larger than 300KB, and so on.',
 "#### Count page elements' duplications if any (title, h1, h2, etc.)",
 'For a selected page element get the counts of each element on the website. How many times was each h2, meta_desc, etc. element duplicated across the website?',
 '#### N-gram analysis for any page element (set stop words, and select from 40 languages)',
 'Select unigrams, bigrams, or trigrams \nSelect the page element to analyze (title, body_text, h3, etc) \nUse editable default stopwords \nChose stopwords from forty languages',
 '#### Link analysis',
 'External links : See which domains the crawled website links to the most. \n Internal links : Get a score for each internal URL as a node in the network (website): \n \n Links: Total number of links to and from the page. \n Inlinks: Total incoming links to the page. \n Outlinks: Total outgoing links to the page. \n Degree centrality: The percentage of internal URLs that are connected to this URL. \n Page rank: The PageRank \n \n Note that it is up to you to define what "internal" means using a regular expression. \n For an example of those features you can explore this  crawl audit and analytics dashboard']

Now that we have the markdown string representation converted to a list, for each of our crawled pages, we can now extract the chunks of content that each page has.

One way to define a chunk is by taking each heading and the subsequent text, and consider that a chunk. You are free to chunk using another approach of course.

How to get chunks of content from a markdown string

This function achieves this approach. It takes every pair of heading and its subsequent text and creates a chunk out of them.

def get_markdown_chunks(md_list):
    chunks = []
    current_chunk = []

    for item in md_list:
        if item.strip().startswith("#"):
            if current_chunk:
                chunks.append(current_chunk)
            current_chunk = [item]
        else:
            if current_chunk:
                current_chunk.append(item)
            else:
                current_chunk = [item]
                chunks.append(current_chunk)
                current_chunk = []
    if current_chunk:
        chunks.append(current_chunk)

    return chunks
chunks = get_markdown_chunks(md_parts)

for i, chunk in enumerate(chunks, 1):
    print(f"Chunk: {i}")
    print(*chunk, sep="\n")
    print('-----\n')
Chunk: 1
# SEO Crawler
Loading...
-----

Chunk: 2
## SEO Crawler Features
-----

Chunk: 3
## Crawling features
-----

Chunk: 4
#### Free up to five thousand URLs
You can also set a lower limit for exploratory purposes.
-----

Chunk: 5
#### Custom extraction with XPath and/or CSS selectors
For custom extraction you need to enter two things in the tables above: 
- Column Name: This should be a descriptive name for the page elements you want to exract. Examples could be "product_price", "blog_author", et. 
- XPtah/CSS Selector: This is the selector pattern for the element(s) you want to extract.
-----

Chunk: 6
#### Setting any User-agent
The names are given using human-readable device names like "iPhone 13", "Samsung Galaxy S22", etc. 
You can also get any user-agent and paste it in the input box, so don't need to only use the ones provided.
-----

Chunk: 7
#### Spider and/or list mode
To activate spider mode you just need to select the "Follow links" checkbox. Add more than one URL to run in list mode. You can combine both of course.
-----

Chunk: 8
#### Include/exclude URL parameters
When encountering new links, should the crawler follow them if they (don't) contain the URL parameters that you chose?
-----

Chunk: 9
#### Include/exclude URL regular expression
Similar to the above but using a regex to match links.
-----

Chunk: 10
## Crawl analytics features
Once you hit the  Start crawling  button, you will be given a URL where you can start auditing and analyzing the crawled website. The following are some of the available feature, together with a link to explore a live dashboard.
-----

Chunk: 11
#### Visualize the URL structure
Using an interactive treemap chart you can see how the website's content is split. 

Get the count of URLs for each directory of URLs  /blog/ ,  /sports/ , etc. 
Get the percentages of each of the directories. 
Beyond ten directories the chart might look really cluttered, so all other directories are displayed under their own "Others" segment.
-----

Chunk: 12
#### Structured data overview (JSON-LD, OpenGraph, and Twitter)
For each of the above structured data types, you can see the count and percentage of URLs that contain each tag of each type. For example, for the  @context  JSON-LD tag, you can see how many URLs contain it, their percentage of the crawled URLs.
-----

Chunk: 13
#### Filter and export data based on whichever URLs you want
Once you see an interesting insight, you can export a subset of URLs and columns. For example, get the URL, title, and status of URLs whos status code is not 200. Get the URL, h1, and size, of pages whos size is larger than 300KB, and so on.
-----

Chunk: 14
#### Count page elements' duplications if any (title, h1, h2, etc.)
For a selected page element get the counts of each element on the website. How many times was each h2, meta_desc, etc. element duplicated across the website?
-----

Chunk: 15
#### N-gram analysis for any page element (set stop words, and select from 40 languages)
Select unigrams, bigrams, or trigrams 
Select the page element to analyze (title, body_text, h3, etc) 
Use editable default stopwords 
Chose stopwords from forty languages
-----

Chunk: 16
#### Link analysis
External links : See which domains the crawled website links to the most. 
 Internal links : Get a score for each internal URL as a node in the network (website): 
 
 Links: Total number of links to and from the page. 
 Inlinks: Total incoming links to the page. 
 Outlinks: Total outgoing links to the page. 
 Degree centrality: The percentage of internal URLs that are connected to this URL. 
 Page rank: The PageRank 
 
 Note that it is up to you to define what "internal" means using a regular expression. 
 For an example of those features you can explore this  crawl audit and analytics dashboard
-----

We have now come full circle:

  • We crawled a website, optionally keeping only the columns we want, for example r"h\d|body_text" to keep heading and body text only
  • The crawl function automatically extracts the body text of the crawled pages for us
  • We converted its body text to markdown using adv.crawlytics.generate_markdown
  • We partitioned the text using its headings as a regex using adv.partition
  • We extracted chunks of content (heading + subsequent text)

Many things can be done with those chunks, but the important thing is that we can now evaluate them independently, and this can help answer some questions about our content:

  • How coherent/similar are chunks of the same article
  • Which chunks are more similar to which chunks from other pages
  • Find clusters of similar chunks across the website
  • Enable search by chunk for the site

We are now moving from the classical “bag of words” approach, to “bag of chunks” approach.

These features have just been released and any feedback of bugs you find would be greatly appreciated. Feel free to submit an issue if you find any such problems.