url | h1 | body_text | size | status | |
---|---|---|---|---|---|
0 | https://example.com | Example Domain | \n Example Domain \n This domain is fo... | 1256 | 200 |
advertools v0.17.0 New Features
New Features
This release brings three main additions to advertools
that can help in various ways:
- A way to restrict which columns get saved to the output file while crawling
- Converting a crawled website to markdown format
- Text partitioning functionality powered by regex
How to control which columns are kept or discarded while crawling
Why?
Some crawls of very large websites can take up a huge amount of disk space. In some cases this can be too big that you can’t even open the file, and/or you really don’t need all those columns for your specific use-case. While there are some solutions to handle large crawl files this feature prevents the problem from even happening.
Another reason you might want to use this is when you are crawling for purely data collection purposes. Maybe you just want the product name, price, description, and availability information for example.
How to control which columns are kept/discarded while crawling
This feature comes as two new parameters to the crawl function keep_columns
and discard_columns
. You can simply determine a few columns that you want to keep this way:
import advertools as adv
import pandas as pd
adv.crawl(
url_list="https://example.com",
output_file="ourput_file.jsonl",
follow_links=True,
keep_columns=["h1", "size", "status", "body_text"],
)
In this case your crawl output file will only have the four columns that were supplied: ["h1", "size", "status", "body_text"]
. Keep in mind that you will always have the url
column available, as well as errors
if you get any. This is to make sure you know what pages those columns refer to.
As you can see our crawl DataFrame only has the columns we specified, together with url
.
How to use regex to flexibly select which columns are kept/discarded while crawling
There are sets of columns that either belong to the same element, or are similar to one another that are typically found in a crawl output file. For example:
- heading tags h1-h6
- JSON-LD tags
- Response/request headers
- Image attributes
With these columns you usually can’t know what will be available in the crawled pages (which JSON-LD tags are they using?) and so you can simply specify it as a regular expression, i.e. jsonld_
.
You can also discard sets of columns using regular expression with the same logic.
How to keep all response headers except for ones we don’t want
Here is an example where we know we want to get the response headers (and we can’t know which ones will be returned by the server), but we know that we don’t want to have ["resp_headers_Cache-Control", "resp_headers_Vary"]
for example.
adv.crawl(
url_list="https://example.com",
output_file="headers.jsonl",
follow_links=True,
# list items are evaluated as regular expressions:
keep_columns=["h1", "size", "status", "resp_headers"],
# list items are evaluated as regular expressions:
discard_columns=["resp_headers_Cache-Control", "resp_headers_Vary"]
)
headers = pd.read_json("headers.jsonl", lines=True)
Check if the two discarded columns are included:
Make sure other response headers are included:
resp_headers_Content-Length | resp_headers_Accept-Ranges | resp_headers_Content-Type | resp_headers_Etag | resp_headers_Last-Modified | resp_headers_Date | |
---|---|---|---|---|---|---|
0 | 648 | bytes | text/html | "84238dfc8092e5d9c0dac8ef93371a07:1736799080.1... | Mon, 13 Jan 2025 20:11:20 GMT | Fri, 13 Jun 2025 11:32:33 GMT |
Keep/discard crawl columns summary
Whith the regex flexibility we can use very powerful and flexible combinations to keep/discard columns.
- You want JSON-LD tags to be kept, you don’t know how many will be available in each page, but you know which ones you want to discard.
- You want to extract all headers except for h5 and h6.
- The
discard_columns
overrides thekeep_columns
parameter. Keep this in mind when setting the options. - The
body_text
column is typically the largest one, and can take up 20-40% of the crawl file size. This might get you the largest reduction in file size if you don’t need the body text.
How to convert HTML pages to markdown in bulk
A new function is now available that converts all crawled URLs in a crawl DataFrame to markdown.
How does the HTML to markdown conversion happen
- From the crawl DataFrame the
body_text
as well as all heading tags’ columns are selected. - Every heading tag is located in the
body_text
string. - Heading tags are converted to their markdown counterparts # for h1, ## for h2, and so on.
- Newlines are inserted between headings and the text they split
- The markdown strings are returned as a list
To illustrate, assume we have this body text string with the following heading tags:
body_text = "First heading Today we will be talking about this topic. Second heading I hope you liked today's topic."
h1 = "First heading"
h2 = "Second heading"
After conversion the markdown string will look like this:
# First heading
Today we will be talking about this topic.
## Second heading
I hope you liked today's topic.
Let’s see how it works with our crawled URL:
h1 | body_text | |
---|---|---|
0 | Example Domain | \n Example Domain \n This domain is fo... |
['# Example Domain\n\nThis domain is for use in illustrative examples in documents. You may use this\n domain in literature without prior coordination or asking for permission. \n More information...']
Printing for better readability:
HTML to markdown conversion example with real pages
Let’s crawl two pages from this website, and see how their markdown strings can be created.
crawler_sitemaps = pd.read_json("crawler_sitemaps.jsonl", lines=True)
crawler_sitemaps.filter(regex=r"h\d|body_text")
h1 | h2 | h4 | body_text | h3 | |
---|---|---|---|---|---|
0 | SEO Crawler | SEO Crawler Features@@Crawling features@@Crawl... | Free up to five thousand URLs@@Custom extracti... | \n \n SEO Crawler \n \n\n \n \n ... | NaN |
1 | @@Download, Extract, and Parse XML Sitemaps | Analyze XML sitemaps in bulk and gain immediat... | Lastmod cumulative distribution and histogram ... | \n \n \n \n\n \n \n \n Downl... | XML sitemap types supported@@Extracted data@@E... |
Markdown for the first page
crawler_sitemaps_md = adv.crawlytics.generate_markdown(crawler_sitemaps)
print(crawler_sitemaps_md[0])
# SEO Crawler
Loading...
## SEO Crawler Features
## Crawling features
#### Free up to five thousand URLs
You can also set a lower limit for exploratory purposes.
#### Custom extraction with XPath and/or CSS selectors
For custom extraction you need to enter two things in the tables above:
- Column Name: This should be a descriptive name for the page elements you want to exract. Examples could be "product_price", "blog_author", et.
- XPtah/CSS Selector: This is the selector pattern for the element(s) you want to extract.
#### Setting any User-agent
The names are given using human-readable device names like "iPhone 13", "Samsung Galaxy S22", etc.
You can also get any user-agent and paste it in the input box, so don't need to only use the ones provided.
#### Spider and/or list mode
To activate spider mode you just need to select the "Follow links" checkbox. Add more than one URL to run in list mode. You can combine both of course.
#### Include/exclude URL parameters
When encountering new links, should the crawler follow them if they (don't) contain the URL parameters that you chose?
#### Include/exclude URL regular expression
Similar to the above but using a regex to match links.
## Crawl analytics features
Once you hit the Start crawling button, you will be given a URL where you can start auditing and analyzing the crawled website. The following are some of the available feature, together with a link to explore a live dashboard.
#### Visualize the URL structure
Using an interactive treemap chart you can see how the website's content is split.
Get the count of URLs for each directory of URLs /blog/ , /sports/ , etc.
Get the percentages of each of the directories.
Beyond ten directories the chart might look really cluttered, so all other directories are displayed under their own "Others" segment.
#### Structured data overview (JSON-LD, OpenGraph, and Twitter)
For each of the above structured data types, you can see the count and percentage of URLs that contain each tag of each type. For example, for the @context JSON-LD tag, you can see how many URLs contain it, their percentage of the crawled URLs.
#### Filter and export data based on whichever URLs you want
Once you see an interesting insight, you can export a subset of URLs and columns. For example, get the URL, title, and status of URLs whos status code is not 200. Get the URL, h1, and size, of pages whos size is larger than 300KB, and so on.
#### Count page elements' duplications if any (title, h1, h2, etc.)
For a selected page element get the counts of each element on the website. How many times was each h2, meta_desc, etc. element duplicated across the website?
#### N-gram analysis for any page element (set stop words, and select from 40 languages)
Select unigrams, bigrams, or trigrams
Select the page element to analyze (title, body_text, h3, etc)
Use editable default stopwords
Chose stopwords from forty languages
#### Link analysis
External links : See which domains the crawled website links to the most.
Internal links : Get a score for each internal URL as a node in the network (website):
Links: Total number of links to and from the page.
Inlinks: Total incoming links to the page.
Outlinks: Total outgoing links to the page.
Degree centrality: The percentage of internal URLs that are connected to this URL.
Page rank: The PageRank
Note that it is up to you to define what "internal" means using a regular expression.
For an example of those features you can explore this crawl audit and analytics dashboard
Markdown for the second page
# Download, Extract, and Parse XML Sitemaps
Enter a sitemap URL and start analyzing immediately. See publishing trends, website structure, and export the full data to a CSV file.
Loading...
## Analyze XML sitemaps in bulk and gain immediate insights
This app supports any kind of XML sitemap URL.
### XML sitemap types supported
### Extracted data
The following columns will typically be included in the converted sitemap:
loc : The URLs (locations) <loc>
lastmod : If available, the <lastmod> tag as a datetime object
sitemap : The URL of the sitemap to which this URL belongs
sitemap_size_mb : Self explanatory
download_date : When the sitemap was downloaded, so you can compare later
sitemap_last_modified : If declared by the server, you also get this response header.
ETag : If declared by the server you also get the ETag of the sitemap.
Other columns could also be available, depending on the sitemap. For example, if the sitemap contained images, it might contain the following columns:
image
image_loc
image_title
image_caption
### Example Charts for XML Sitemaps
#### Lastmod cumulative distribution and histogram chart
#### URL structure treemap chart - first level
#### URL structure treemap chart - second level
HTML to markdown conversion summary and caveats
- Headings and
body_text
are used to figure out the markdown text - No other elements are extracted (links, bullet points, etc)
- Repeated headings under the same level are handled properly (two h2 tags that have the same text)
- Repeated headings under different levels will probably result in one of them being misplaced. If you have “click here” in two places in the document, once as h1 and once as h2, one of them will be incorrectly placed in the resulting string. Please get in touch if you have a solution
- In upcoming releases links might be supported as well
Text partitioning with regular expressions
Another feature that can help in this workflow is flexible text partitioning. The main difference between splitting and partitioning is that splitting removes the splitting characters, while partitioning keep them. The Python standard libarary’s str
class has a partition
method already. But it has two main limitations:
- It supports partitioning only once
- It does not support partitioning by regular expression
Here is how it works:
('', '#', ' Heading 1\n\nSome text\n\n## Heading 2\n\nAnother text')
The text was correctly split using the “#” character, but only once. In this case we would like to split using any markdown heading, without knowing what level it is and we want to do it multiple times in the same document.
['# Heading 1', 'Some text', '## Heading 2', 'Another text']
Printing the lines for readability:
# Heading 1
Some text
## Heading 2
Another text
Keep in mind that you might need to add regex flags to the function. In the previous example I used re.MULTILINE
because of the way Python parses newline characters, so I needed to supply that.
Partitioning a markdown string
Here is a real life example of how this works:
md_parts = adv.partition(
text=crawler_sitemaps_md[0],
regex=f"^#+ .*",
flags=re.MULTILINE
)
md_parts
['# SEO Crawler',
'Loading...',
'## SEO Crawler Features',
'## Crawling features',
'#### Free up to five thousand URLs',
'You can also set a lower limit for exploratory purposes.',
'#### Custom extraction with XPath and/or CSS selectors',
'For custom extraction you need to enter two things in the tables above: \n- Column Name: This should be a descriptive name for the page elements you want to exract. Examples could be "product_price", "blog_author", et. \n- XPtah/CSS Selector: This is the selector pattern for the element(s) you want to extract.',
'#### Setting any User-agent',
'The names are given using human-readable device names like "iPhone 13", "Samsung Galaxy S22", etc. \nYou can also get any user-agent and paste it in the input box, so don\'t need to only use the ones provided.',
'#### Spider and/or list mode',
'To activate spider mode you just need to select the "Follow links" checkbox. Add more than one URL to run in list mode. You can combine both of course.',
'#### Include/exclude URL parameters',
"When encountering new links, should the crawler follow them if they (don't) contain the URL parameters that you chose?",
'#### Include/exclude URL regular expression',
'Similar to the above but using a regex to match links.',
'## Crawl analytics features',
'Once you hit the Start crawling button, you will be given a URL where you can start auditing and analyzing the crawled website. The following are some of the available feature, together with a link to explore a live dashboard.',
'#### Visualize the URL structure',
'Using an interactive treemap chart you can see how the website\'s content is split. \n\nGet the count of URLs for each directory of URLs /blog/ , /sports/ , etc. \nGet the percentages of each of the directories. \nBeyond ten directories the chart might look really cluttered, so all other directories are displayed under their own "Others" segment.',
'#### Structured data overview (JSON-LD, OpenGraph, and Twitter)',
'For each of the above structured data types, you can see the count and percentage of URLs that contain each tag of each type. For example, for the @context JSON-LD tag, you can see how many URLs contain it, their percentage of the crawled URLs.',
'#### Filter and export data based on whichever URLs you want',
'Once you see an interesting insight, you can export a subset of URLs and columns. For example, get the URL, title, and status of URLs whos status code is not 200. Get the URL, h1, and size, of pages whos size is larger than 300KB, and so on.',
"#### Count page elements' duplications if any (title, h1, h2, etc.)",
'For a selected page element get the counts of each element on the website. How many times was each h2, meta_desc, etc. element duplicated across the website?',
'#### N-gram analysis for any page element (set stop words, and select from 40 languages)',
'Select unigrams, bigrams, or trigrams \nSelect the page element to analyze (title, body_text, h3, etc) \nUse editable default stopwords \nChose stopwords from forty languages',
'#### Link analysis',
'External links : See which domains the crawled website links to the most. \n Internal links : Get a score for each internal URL as a node in the network (website): \n \n Links: Total number of links to and from the page. \n Inlinks: Total incoming links to the page. \n Outlinks: Total outgoing links to the page. \n Degree centrality: The percentage of internal URLs that are connected to this URL. \n Page rank: The PageRank \n \n Note that it is up to you to define what "internal" means using a regular expression. \n For an example of those features you can explore this crawl audit and analytics dashboard']
Now that we have the markdown string representation converted to a list, for each of our crawled pages, we can now extract the chunks of content that each page has.
One way to define a chunk is by taking each heading and the subsequent text, and consider that a chunk. You are free to chunk using another approach of course.
How to get chunks of content from a markdown string
This function achieves this approach. It takes every pair of heading and its subsequent text and creates a chunk out of them.
def get_markdown_chunks(md_list):
chunks = []
current_chunk = []
for item in md_list:
if item.strip().startswith("#"):
if current_chunk:
chunks.append(current_chunk)
current_chunk = [item]
else:
if current_chunk:
current_chunk.append(item)
else:
current_chunk = [item]
chunks.append(current_chunk)
current_chunk = []
if current_chunk:
chunks.append(current_chunk)
return chunks
chunks = get_markdown_chunks(md_parts)
for i, chunk in enumerate(chunks, 1):
print(f"Chunk: {i}")
print(*chunk, sep="\n")
print('-----\n')
Chunk: 1
# SEO Crawler
Loading...
-----
Chunk: 2
## SEO Crawler Features
-----
Chunk: 3
## Crawling features
-----
Chunk: 4
#### Free up to five thousand URLs
You can also set a lower limit for exploratory purposes.
-----
Chunk: 5
#### Custom extraction with XPath and/or CSS selectors
For custom extraction you need to enter two things in the tables above:
- Column Name: This should be a descriptive name for the page elements you want to exract. Examples could be "product_price", "blog_author", et.
- XPtah/CSS Selector: This is the selector pattern for the element(s) you want to extract.
-----
Chunk: 6
#### Setting any User-agent
The names are given using human-readable device names like "iPhone 13", "Samsung Galaxy S22", etc.
You can also get any user-agent and paste it in the input box, so don't need to only use the ones provided.
-----
Chunk: 7
#### Spider and/or list mode
To activate spider mode you just need to select the "Follow links" checkbox. Add more than one URL to run in list mode. You can combine both of course.
-----
Chunk: 8
#### Include/exclude URL parameters
When encountering new links, should the crawler follow them if they (don't) contain the URL parameters that you chose?
-----
Chunk: 9
#### Include/exclude URL regular expression
Similar to the above but using a regex to match links.
-----
Chunk: 10
## Crawl analytics features
Once you hit the Start crawling button, you will be given a URL where you can start auditing and analyzing the crawled website. The following are some of the available feature, together with a link to explore a live dashboard.
-----
Chunk: 11
#### Visualize the URL structure
Using an interactive treemap chart you can see how the website's content is split.
Get the count of URLs for each directory of URLs /blog/ , /sports/ , etc.
Get the percentages of each of the directories.
Beyond ten directories the chart might look really cluttered, so all other directories are displayed under their own "Others" segment.
-----
Chunk: 12
#### Structured data overview (JSON-LD, OpenGraph, and Twitter)
For each of the above structured data types, you can see the count and percentage of URLs that contain each tag of each type. For example, for the @context JSON-LD tag, you can see how many URLs contain it, their percentage of the crawled URLs.
-----
Chunk: 13
#### Filter and export data based on whichever URLs you want
Once you see an interesting insight, you can export a subset of URLs and columns. For example, get the URL, title, and status of URLs whos status code is not 200. Get the URL, h1, and size, of pages whos size is larger than 300KB, and so on.
-----
Chunk: 14
#### Count page elements' duplications if any (title, h1, h2, etc.)
For a selected page element get the counts of each element on the website. How many times was each h2, meta_desc, etc. element duplicated across the website?
-----
Chunk: 15
#### N-gram analysis for any page element (set stop words, and select from 40 languages)
Select unigrams, bigrams, or trigrams
Select the page element to analyze (title, body_text, h3, etc)
Use editable default stopwords
Chose stopwords from forty languages
-----
Chunk: 16
#### Link analysis
External links : See which domains the crawled website links to the most.
Internal links : Get a score for each internal URL as a node in the network (website):
Links: Total number of links to and from the page.
Inlinks: Total incoming links to the page.
Outlinks: Total outgoing links to the page.
Degree centrality: The percentage of internal URLs that are connected to this URL.
Page rank: The PageRank
Note that it is up to you to define what "internal" means using a regular expression.
For an example of those features you can explore this crawl audit and analytics dashboard
-----
We have now come full circle:
- We crawled a website, optionally keeping only the columns we want, for example
r"h\d|body_text"
to keep heading and body text only - The crawl function automatically extracts the body text of the crawled pages for us
- We converted its body text to markdown using
adv.crawlytics.generate_markdown
- We partitioned the text using its headings as a regex using
adv.partition
- We extracted chunks of content (heading + subsequent text)
Many things can be done with those chunks, but the important thing is that we can now evaluate them independently, and this can help answer some questions about our content:
- How coherent/similar are chunks of the same article
- Which chunks are more similar to which chunks from other pages
- Find clusters of similar chunks across the website
- Enable search by chunk for the site
We are now moving from the classical “bag of words” approach, to “bag of chunks” approach.
These features have just been released and any feedback of bugs you find would be greatly appreciated. Feel free to submit an issue if you find any such problems.