Checking the Health of Your Site with Web Crawlers

Connor Wade • Sep 27, 2022

Maintaining a website’s quality is an important part of driving traffic and keeping an engaged audience. A link that isn’t working, incorrect content on a page, or broken images will frustrate users, hinder your SEO, and create problems for your marketing team and developers. Hoodoo’s automation team removes these frustrations with custom web crawlers that check the health of your web pages. Despite recent bad-press, web crawlers can be tremendously helpful when used for good. Most businesses use their own web crawlers to evaluate the health of their site and identify potential problem areas. 


All that said, the negative reputation web crawlers have stems from the fact that many people don’t fully understand what they are or how they work – so, let's start at the top. 


What is a web crawler? 
Web crawlers (also called web scrapers, depending on the context) have been around for decades. In a non-technical sense, just imagine a secretary that goes through all the pages on your website, using all the links on a page and collecting information as they go along. You can then use this information to identify errors on your site that need to be corrected. 

Even if you do not use web-crawling directly, your website is crawled at least once by search engine providers for SEO purposes and more often than you might think by rogue bots looking for site vulnerabilities and personal information. While web crawlers can be used with malicious intent, in the right hands, they can be important tools for ensuring the stability of websites. 


How can web-crawlers help site-owners? 

Maintaining a web-crawling tool, especially for large sites with lots of content, can be helpful in keeping site quality high. Some of the ways a web crawler can help you are: 

  • Checking for broken pages, links, and images 
  • Running diagnostics on SEO content 
  • Checking that API data is showing on a page 
  • Checking for unused assets 


Broken pages, links, and images 
A broken page, a broken link, and a broken image are the stuff of nightmares for a marketing team. However, a quality web crawler should already be going through every page and every link anyway. You might as well have it keep track of what links weren’t working and where they originated from. Running regular scans to check for broken server requests in a coding pipeline can help ensure that developers aren’t pushing any code to a live site that would interfere with the user experience. 


Running diagnostics on SEO content 
A crawler you’ve built specifically for your site is a great way to run regular internal audits on your SEO without paying for 3rd party evaluations. A crawler that belongs to your team can also be customized to get information that is most relevant to your marketing efforts. A web crawler can scrape all your header content and check it against data from your SEO team to ensure that your content is on point to drive traffic. While a crawler probably shouldn’t be used for checking performance as they are browser agnostic (or at least - most are browser agnostic), they can be used for quickly checking the integrity of your content. 


Checking that API data is showing on a page 
By setting up calls to APIs and cross-checking the responses with content crawled on a page, you can perform a very quick check of API usage on the actual page they should be showing on. Note that because most crawlers do not use a browser or web engine, the content may not be on a page when it is served as a singular HTML resource. It will depend on how your website is built. If you are having trouble crawling because content isn’t loaded, consider using a UI testing library, such as Puppeteer, for crawling. 


Checking for unused assets 
Improving build time for a project means improving developer workflow and getting better return on developer time. Having a crawler gather information on which assets are used on a site and then comparing that with the project directory can be a quick way to know which legacy assets are in use. Deleting unused files can lead to serious improvements for build performance of large websites. 


What tools are used for web-crawling? 
There are three tools we really like here at Hoodoo. This is by no means an exhaustive list, but it gives you a good starting point of what to look for: 

  • Scrapy(https://scrapy.org/), a Python library - Scrapy is a well-maintained and robust library that is very accessible for most engineers and organizations. 
  • Colly(http://go-colly.org/), a Go library - Colly is an open-source crawler that excels at speed. If you are working with a large site where performance is an issue, Colly should be your first choice. 
  • End-to-end UI testing libraries such as Puppeteer or Playwright - While they are not made for crawling, end-to-end UI testing libraries can be modified to be web crawlers. While performance might suffer and you might have to cobble it together a little, the advantages are that nothing will be missed, and the crawl will take place in a browser. For instance, if your site uses lazy loaded images, a crawler cannot get that information because it would not be part of the initial response from a server. However, a UI testing library can get those images. 


Here at Hoodoo, we use crawlers as part of our automation toolkit to deliver great developer and customer experiences. Crawlers make managing quality, pages, and content a much simpler process, especially for large websites.

Hoodoo The Next Evolution: Rightpoint
16 Mar, 2023
Hoodoo is now Rightpoint, and we couldn’t be more excited to have a new name, a new look, and new capabilities.
By Kim Melton 29 Nov, 2022
Google is sunsetting Google Analytics - and a lot of people are left wondering what to do next. Don't worry - we have a plan (and a team) that can help.
By Sara Wetmore 22 Nov, 2022
A recent Forrester report evaluated enterprise marketing software - from Adobe to SalesForce and more. Find out how Adobe fared against their competitors across 25 different categories.
Show More
Share by: