og:image | og:title | og:description | twitter:card | og:url | og:site_name | og:type | twitter:title | twitter:url | twitter:image | og:image:secure_url | twitter:image:alt | jsonld_@context | jsonld_@type | jsonld_mainEntityOfPage | jsonld_headline | jsonld_text | jsonld_url | jsonld_datePublished | jsonld_comment | jsonld_author.@type | jsonld_author.name | jsonld_author.url | jsonld_interactionStatistic.@type | jsonld_interactionStatistic.interactionType | jsonld_interactionStatistic.userInteractionCount | jsonld_mainEntity.@type | jsonld_mainEntity.name | jsonld_mainEntity.text | jsonld_mainEntity.dateCreated | jsonld_mainEntity.upvoteCount | jsonld_mainEntity.author.@type | jsonld_mainEntity.author.name | jsonld_mainEntity.answerCount | jsonld_mainEntity.acceptedAnswer | jsonld_mainEntity.suggestedAnswer | twitter:description | twitter:site | twitter:creator | og:image:width | og:image:height | jsonld_image | jsonld_mainEntity.acceptedAnswer.@type | jsonld_mainEntity.acceptedAnswer.text | jsonld_mainEntity.acceptedAnswer.dateCreated | jsonld_mainEntity.acceptedAnswer.upvoteCount | jsonld_mainEntity.acceptedAnswer.author.@type | jsonld_mainEntity.acceptedAnswer.author.name | jsonld_mainEntity.acceptedAnswer.url | jsonld_name | jsonld_startDate | jsonld_endDate | jsonld_eventAttendanceMode | jsonld_description | jsonld_location.@type | jsonld_location.url | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | https://supermetrics.com/images/supermetrics.png | Supermetrics: Turn your marketing data into opportunity - Supermetrics | Focus on growth, not data silos. Streamline your marketing data so you can take control of what matters. Start your ... | summary_large_image | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | https://supermetrics.com/images/supermetrics.png | Become a Supermetrics Affiliate - Supermetrics | Refer Supermetrics to others and get 20% recurring commissions from each sale. Join now! | summary_large_image | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | https://supermetrics.com/images/supermetrics.png | About Supermetrics - Supermetrics | Whether you’re a small business getting started on your data journey or a global enterprise working with business c... | summary_large_image | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
How to audit structured data on a website with Python
Python
advertools
Structured Data
Crawling
Auditing
Analysis
Intermediate
A Python script that takes a site crawled with advertools and provides mechanisms for understanding structured data; which are used (JSON-LD, Twitter, and OpenGrap), which tags are present, and what they contain.
Read the crawl file
Filter the columns that contain structured data
Define a function to get counts or averages of a set of columns
This function takes the following parameters:
df
: A crawl DataFrameregex
: The pattern to find, in this case we will use “jsonld_”, “twitter:”, and “og:”func
: Either “count” to get the counts or “mean” to get the percentage for each tag
def column_values(df, regex, func):
func_dict = {
"mean": {"fmt": "{:.1%}" ,"colname": "% usage"},
"count": {"fmt": "{:,}", "colname": "count"},
}
return(crawldf
.filter(regex=regex)
.notna()
.apply("sum" if func == "count" else func)
.sort_values(ascending=False)
.to_frame()
.rename(columns={0: func_dict[func]["colname"]})
.style
.format(func_dict[func]["fmt"])
.bar(color='lightgray'))
Display counts and percentages of all three structured data types on the website
from itertools import product
for regex, func in product(["og:", "twitter:", "jsonld"], ["mean", "count"]):
display(column_values(crawldf, regex, func))
print('')
% usage | |
---|---|
og:title | 99.3% |
og:image | 98.6% |
og:description | 94.0% |
og:type | 66.3% |
og:url | 65.6% |
og:site_name | 58.8% |
og:image:secure_url | 6.9% |
og:image:width | 0.0% |
og:image:height | 0.0% |
count | |
---|---|
og:title | 2,603 |
og:image | 2,586 |
og:description | 2,464 |
og:type | 1,738 |
og:url | 1,721 |
og:site_name | 1,541 |
og:image:secure_url | 182 |
og:image:width | 1 |
og:image:height | 1 |
% usage | |
---|---|
twitter:card | 91.2% |
twitter:title | 43.1% |
twitter:url | 43.1% |
twitter:image | 43.1% |
twitter:description | 37.9% |
twitter:image:alt | 15.0% |
twitter:site | 0.1% |
twitter:creator | 0.1% |
count | |
---|---|
twitter:card | 2,391 |
twitter:title | 1,131 |
twitter:url | 1,131 |
twitter:image | 1,131 |
twitter:description | 994 |
twitter:image:alt | 393 |
twitter:site | 2 |
twitter:creator | 2 |
% usage | |
---|---|
jsonld_@context | 5.6% |
jsonld_@type | 5.6% |
jsonld_headline | 4.3% |
jsonld_datePublished | 4.3% |
jsonld_author.url | 4.3% |
jsonld_author.name | 4.3% |
jsonld_author.@type | 4.3% |
jsonld_url | 3.4% |
jsonld_text | 3.4% |
jsonld_mainEntityOfPage | 3.4% |
jsonld_comment | 3.4% |
jsonld_interactionStatistic.@type | 3.4% |
jsonld_interactionStatistic.interactionType | 3.4% |
jsonld_interactionStatistic.userInteractionCount | 3.4% |
jsonld_mainEntity.@type | 1.2% |
jsonld_mainEntity.name | 1.2% |
jsonld_mainEntity.text | 1.2% |
jsonld_mainEntity.dateCreated | 1.2% |
jsonld_mainEntity.upvoteCount | 1.2% |
jsonld_mainEntity.author.@type | 1.2% |
jsonld_mainEntity.author.name | 1.2% |
jsonld_mainEntity.answerCount | 1.2% |
jsonld_mainEntity.suggestedAnswer | 1.2% |
jsonld_image | 1.0% |
jsonld_mainEntity.acceptedAnswer.url | 0.2% |
jsonld_mainEntity.acceptedAnswer.@type | 0.2% |
jsonld_mainEntity.acceptedAnswer.text | 0.2% |
jsonld_mainEntity.acceptedAnswer.dateCreated | 0.2% |
jsonld_mainEntity.acceptedAnswer.upvoteCount | 0.2% |
jsonld_mainEntity.acceptedAnswer.author.@type | 0.2% |
jsonld_mainEntity.acceptedAnswer.author.name | 0.2% |
jsonld_eventAttendanceMode | 0.0% |
jsonld_name | 0.0% |
jsonld_startDate | 0.0% |
jsonld_endDate | 0.0% |
jsonld_location.@type | 0.0% |
jsonld_description | 0.0% |
jsonld_location.url | 0.0% |
jsonld_mainEntity.acceptedAnswer | 0.0% |
count | |
---|---|
jsonld_@context | 147 |
jsonld_@type | 147 |
jsonld_headline | 114 |
jsonld_datePublished | 114 |
jsonld_author.url | 114 |
jsonld_author.name | 114 |
jsonld_author.@type | 114 |
jsonld_url | 90 |
jsonld_text | 90 |
jsonld_mainEntityOfPage | 90 |
jsonld_comment | 90 |
jsonld_interactionStatistic.@type | 90 |
jsonld_interactionStatistic.interactionType | 90 |
jsonld_interactionStatistic.userInteractionCount | 90 |
jsonld_mainEntity.@type | 32 |
jsonld_mainEntity.name | 32 |
jsonld_mainEntity.text | 32 |
jsonld_mainEntity.dateCreated | 32 |
jsonld_mainEntity.upvoteCount | 32 |
jsonld_mainEntity.author.@type | 32 |
jsonld_mainEntity.author.name | 32 |
jsonld_mainEntity.answerCount | 32 |
jsonld_mainEntity.suggestedAnswer | 32 |
jsonld_image | 25 |
jsonld_mainEntity.acceptedAnswer.url | 6 |
jsonld_mainEntity.acceptedAnswer.@type | 6 |
jsonld_mainEntity.acceptedAnswer.text | 6 |
jsonld_mainEntity.acceptedAnswer.dateCreated | 6 |
jsonld_mainEntity.acceptedAnswer.upvoteCount | 6 |
jsonld_mainEntity.acceptedAnswer.author.@type | 6 |
jsonld_mainEntity.acceptedAnswer.author.name | 6 |
jsonld_eventAttendanceMode | 1 |
jsonld_name | 1 |
jsonld_startDate | 1 |
jsonld_endDate | 1 |
jsonld_location.@type | 1 |
jsonld_description | 1 |
jsonld_location.url | 1 |
jsonld_mainEntity.acceptedAnswer | 0 |
Count actual values of the selected structured data column
fig = adviz.value_counts(crawldf["og:title"], width=None)
fig.data[1].hoverinfo = 'text'
fig.layout.margin.l = 10
fig.layout.margin.r = 0
fig
Counting ngrams of the desired columns
word | abs_freq | |
---|---|---|
0 | - supermetrics | 866 |
1 | how to | 581 |
2 | | supermetrics | 574 |
3 | supermetrics documentation | 393 |
4 | connection guide | 190 |
5 | supermetrics community | 185 |
6 | supermetrics connection | 182 |
7 | data warehouse | 171 |
8 | metrics and | 149 |
9 | and dimensions | 148 |
10 | dimensions | | 146 |
11 | looker studio | 135 |
12 | standard data | 129 |
13 | warehouse schema | 129 |
14 | schema | | 129 |