You can also set a lower limit for exploratory purposes.
For custom extraction you need to enter two things in the tables above:
- Column Name: This should be a descriptive name for the page elements you want to exract. Examples could be "product_price", "blog_author", et.
- XPtah/CSS Selector: This is the selector pattern for the element(s) you want to extract.
The names are given using human-readable device names like "iPhone 13", "Samsung Galaxy S22", etc.
You can also get any user-agent and paste it in the input box, so don't need to only use the ones provided.
To activate spider mode you just need to select the "Follow links" checkbox. Add more than one URL to run in list mode. You can combine both of course.
When encountering new links, should the crawler follow them if they (don't) contain the URL parameters that you chose?
Similar to the above but using a regex to match links.
Once you hit the Start crawling
button, you will be given a URL where you can start auditing and analyzing the crawled website. The following are some of the available feature, together with a link to explore a live dashboard.
Using an interactive treemap chart you can see how the website's content is split.
/blog/
, /sports/
, etc.For each of the above structured data types, you can see the count and percentage of URLs that contain each tag of each type. For example, for the @context
JSON-LD tag, you can see how many URLs contain it, their percentage of the crawled URLs.
Once you see an interesting insight, you can export a subset of URLs and columns. For example, get the URL, title, and status of URLs whos status code is not 200. Get the URL, h1, and size, of pages whos size is larger than 300KB, and so on.
For a selected page element get the counts of each element on the website. How many times was each h2, meta_desc, etc. element duplicated across the website?
External links: See which domains the crawled website links to the most.
Internal links: Get a score for each internal URL as a node in the network (website):
Note that it is up to you to define what "internal" means using a regular expression.
For an example of those features you can explore this crawl audit and analytics dashboard