How to audit structured data on a website with Python

Python
advertools
Structured Data
Crawling
Auditing
Analysis
Intermediate
A Python script that takes a site crawled with advertools and provides mechanisms for understanding structured data; which are used (JSON-LD, Twitter, and OpenGrap), which tags are present, and what they contain.

Read the crawl file

import advertools as adv
import adviz
import pandas as pd

crawldf = pd.read_json("output_file.jsonl", lines=True)

Filter the columns that contain structured data

crawldf.filter(regex="jsonld|og:|twitter:").head(3)
og:image og:title og:description twitter:card og:url og:site_name og:type twitter:title twitter:url twitter:image og:image:secure_url twitter:image:alt jsonld_@context jsonld_@type jsonld_mainEntityOfPage jsonld_headline jsonld_text jsonld_url jsonld_datePublished jsonld_comment jsonld_author.@type jsonld_author.name jsonld_author.url jsonld_interactionStatistic.@type jsonld_interactionStatistic.interactionType jsonld_interactionStatistic.userInteractionCount jsonld_mainEntity.@type jsonld_mainEntity.name jsonld_mainEntity.text jsonld_mainEntity.dateCreated jsonld_mainEntity.upvoteCount jsonld_mainEntity.author.@type jsonld_mainEntity.author.name jsonld_mainEntity.answerCount jsonld_mainEntity.acceptedAnswer jsonld_mainEntity.suggestedAnswer twitter:description twitter:site twitter:creator og:image:width og:image:height jsonld_image jsonld_mainEntity.acceptedAnswer.@type jsonld_mainEntity.acceptedAnswer.text jsonld_mainEntity.acceptedAnswer.dateCreated jsonld_mainEntity.acceptedAnswer.upvoteCount jsonld_mainEntity.acceptedAnswer.author.@type jsonld_mainEntity.acceptedAnswer.author.name jsonld_mainEntity.acceptedAnswer.url jsonld_name jsonld_startDate jsonld_endDate jsonld_eventAttendanceMode jsonld_description jsonld_location.@type jsonld_location.url
0 https://supermetrics.com/images/supermetrics.png Supermetrics: Turn your marketing data into opportunity - Supermetrics Focus on growth, not data silos. Streamline your marketing data so you can take control of what matters. Start your ... summary_large_image NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 https://supermetrics.com/images/supermetrics.png Become a Supermetrics Affiliate - Supermetrics Refer Supermetrics to others and get 20% recurring commissions from each sale. Join now! summary_large_image NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 https://supermetrics.com/images/supermetrics.png About Supermetrics - Supermetrics Whether you’re a small business getting started on your data journey or a global enterprise working with business c... summary_large_image NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Define a function to get counts or averages of a set of columns

This function takes the following parameters:

  • df: A crawl DataFrame
  • regex: The pattern to find, in this case we will use “jsonld_”, “twitter:”, and “og:”
  • func: Either “count” to get the counts or “mean” to get the percentage for each tag
def column_values(df, regex, func):
    func_dict = {
        "mean": {"fmt": "{:.1%}" ,"colname": "% usage"},
        "count": {"fmt": "{:,}", "colname": "count"},
    }
    return(crawldf
           .filter(regex=regex)
           .notna()
           .apply("sum" if func == "count" else func)
           .sort_values(ascending=False)
           .to_frame()
           .rename(columns={0: func_dict[func]["colname"]})
           .style
           .format(func_dict[func]["fmt"])
           .bar(color='lightgray'))

Display counts and percentages of all three structured data types on the website

from itertools import product
for regex, func in product(["og:", "twitter:", "jsonld"], ["mean", "count"]):
    display(column_values(crawldf, regex, func))
    print('')
  % usage
og:title 99.3%
og:image 98.6%
og:description 94.0%
og:type 66.3%
og:url 65.6%
og:site_name 58.8%
og:image:secure_url 6.9%
og:image:width 0.0%
og:image:height 0.0%
  count
og:title 2,603
og:image 2,586
og:description 2,464
og:type 1,738
og:url 1,721
og:site_name 1,541
og:image:secure_url 182
og:image:width 1
og:image:height 1
  % usage
twitter:card 91.2%
twitter:title 43.1%
twitter:url 43.1%
twitter:image 43.1%
twitter:description 37.9%
twitter:image:alt 15.0%
twitter:site 0.1%
twitter:creator 0.1%
  count
twitter:card 2,391
twitter:title 1,131
twitter:url 1,131
twitter:image 1,131
twitter:description 994
twitter:image:alt 393
twitter:site 2
twitter:creator 2
  % usage
jsonld_@context 5.6%
jsonld_@type 5.6%
jsonld_headline 4.3%
jsonld_datePublished 4.3%
jsonld_author.url 4.3%
jsonld_author.name 4.3%
jsonld_author.@type 4.3%
jsonld_url 3.4%
jsonld_text 3.4%
jsonld_mainEntityOfPage 3.4%
jsonld_comment 3.4%
jsonld_interactionStatistic.@type 3.4%
jsonld_interactionStatistic.interactionType 3.4%
jsonld_interactionStatistic.userInteractionCount 3.4%
jsonld_mainEntity.@type 1.2%
jsonld_mainEntity.name 1.2%
jsonld_mainEntity.text 1.2%
jsonld_mainEntity.dateCreated 1.2%
jsonld_mainEntity.upvoteCount 1.2%
jsonld_mainEntity.author.@type 1.2%
jsonld_mainEntity.author.name 1.2%
jsonld_mainEntity.answerCount 1.2%
jsonld_mainEntity.suggestedAnswer 1.2%
jsonld_image 1.0%
jsonld_mainEntity.acceptedAnswer.url 0.2%
jsonld_mainEntity.acceptedAnswer.@type 0.2%
jsonld_mainEntity.acceptedAnswer.text 0.2%
jsonld_mainEntity.acceptedAnswer.dateCreated 0.2%
jsonld_mainEntity.acceptedAnswer.upvoteCount 0.2%
jsonld_mainEntity.acceptedAnswer.author.@type 0.2%
jsonld_mainEntity.acceptedAnswer.author.name 0.2%
jsonld_eventAttendanceMode 0.0%
jsonld_name 0.0%
jsonld_startDate 0.0%
jsonld_endDate 0.0%
jsonld_location.@type 0.0%
jsonld_description 0.0%
jsonld_location.url 0.0%
jsonld_mainEntity.acceptedAnswer 0.0%
  count
jsonld_@context 147
jsonld_@type 147
jsonld_headline 114
jsonld_datePublished 114
jsonld_author.url 114
jsonld_author.name 114
jsonld_author.@type 114
jsonld_url 90
jsonld_text 90
jsonld_mainEntityOfPage 90
jsonld_comment 90
jsonld_interactionStatistic.@type 90
jsonld_interactionStatistic.interactionType 90
jsonld_interactionStatistic.userInteractionCount 90
jsonld_mainEntity.@type 32
jsonld_mainEntity.name 32
jsonld_mainEntity.text 32
jsonld_mainEntity.dateCreated 32
jsonld_mainEntity.upvoteCount 32
jsonld_mainEntity.author.@type 32
jsonld_mainEntity.author.name 32
jsonld_mainEntity.answerCount 32
jsonld_mainEntity.suggestedAnswer 32
jsonld_image 25
jsonld_mainEntity.acceptedAnswer.url 6
jsonld_mainEntity.acceptedAnswer.@type 6
jsonld_mainEntity.acceptedAnswer.text 6
jsonld_mainEntity.acceptedAnswer.dateCreated 6
jsonld_mainEntity.acceptedAnswer.upvoteCount 6
jsonld_mainEntity.acceptedAnswer.author.@type 6
jsonld_mainEntity.acceptedAnswer.author.name 6
jsonld_eventAttendanceMode 1
jsonld_name 1
jsonld_startDate 1
jsonld_endDate 1
jsonld_location.@type 1
jsonld_description 1
jsonld_location.url 1
jsonld_mainEntity.acceptedAnswer 0

Count actual values of the selected structured data column

adviz.value_counts(crawldf["og:title"], width=None)
fig = adviz.value_counts(crawldf["og:title"], width=None)
fig.data[1].hoverinfo = 'text'
fig.layout.margin.l = 10
fig.layout.margin.r = 0
fig
adviz.value_counts(crawldf["jsonld_headline"])

Counting ngrams of the desired columns

adv.word_frequency(
    crawldf["og:title"].fillna(""),
    phrase_len=2).head(15)
word abs_freq
0 - supermetrics 866
1 how to 581
2 | supermetrics 574
3 supermetrics documentation 393
4 connection guide 190
5 supermetrics community 185
6 supermetrics connection 182
7 data warehouse 171
8 metrics and 149
9 and dimensions 148
10 dimensions | 146
11 looker studio 135
12 standard data 129
13 warehouse schema 129
14 schema | 129