About the Author: Connor manages the Software Development Engineer in Test (SDET) team at Hoodoo Digital. He has 2 years of experience building end-to-end automated testing designed specifically for customer’s websites. As a key part of the QA team, Connor tests websites and apps. He has experience in Playwright.js (Typescript and Node.js), HTML, CSS, Typescript, React, Flutter, and Go. Connor holds several Adobe Certifications, including Adobe Certified Expert, Professional AEM Sites Business Practitioner, and AEM Sites Front-End Developer.
A web crawler can make an excellent tool for quality assurance for a large website. The ability to check the health of every page and most content on a large site is useful when you don’t have the resources otherwise to cover everything.
For this tutorial, we are going to be using Go, a popular scripting language from Google, and one of its packages, Colly. Making a crawler in Go and Colly is easy. In this article, I’ll show you how to create a basic script that can check the health of links and images across a website. I’ll also share some things that will help you build a professional crawler.
The steps that we’ll go over for this are:
Installing Go
If you haven’t installed Go, download it here and follow the installation instructions. If you have installed Go but haven’t recently updated it, do it before you start. Preferably, you’ll use the latest stable version of Go (at the time of writing this article, I am using version 1.17), but at the very least you will need a Go version greater than 1.13.
Starting your project and installing Colly
Now that you have Go installed, we can start our project. Enter your local-repo directory (wherever you keep your code on your machine) and create a new directory for this project. In a bash terminal it would look like this:
mkdir example-crawler
cd example-crawler
Next, let’s set up our Go module to manage packages:
Go mod init example.com/crawler
You can replace “http://example.com” with wherever you are hosting your code (such as github or bitbucket). Or if you don’t intend to share your code, just write whatever you want. The name doesn’t matter; it just provides others with a module path to your project.
Now we are ready to install the Colly package and its dependencies. To do so, run the “go get” command (ellipsis is part of the command):
go get -u http://github.com/gocolly/colly/...
This should fetch all the packages for Colly and its dependencies. Take some time and ensure that you now have “go.mod” and “go.sum” files.
Starting Your Colly Project
To start the project, we will need a main.go file. Technically you can name this file whatever you want, but you must have a “main” function. The main function is where all go programs start from. I’ll create the main.go file with the touch command:
touch main.go
Now let’s open main.go and add some code to it:
package main
func main() {
c := colly.NewCollector()
}
If you save the file, you should see that Go adds the import for Colly automatically for you.
package main
import "http://github.com/gocolly/colly "
func main() {
c := colly.NewCollector()
}
Now, in main, I’m going to add some boilerplate code from the Colly website:
package main
import (
"fmt"
"log"
"github.com/gocolly/colly"
)
func main() {
c := colly.NewCollector()
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL)
})
c.OnError(func(_ *colly.Response, err error) {
log.Println("Something went wrong:", err)
})
c.OnResponse(func(r *colly.Response) {
fmt.Println("Visited", r.Request.URL)
})
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
e.Request.Visit(e.Attr("href"))
})
c.OnHTML("tr td:nth-of-type(1)", func(e *colly.HTMLElement) {
fmt.Println("First column of a table row:", e.Text)
})
c.OnXML("//h1", func(e *colly.XMLElement) {
fmt.Println(e.Text)
})
c.OnScraped(func(r *colly.Response) {
fmt.Println("Finished", r.Request.URL)
})
}
It probably looks like I just added a whole lot of code, but do not fear. I will give you a quick, surface-level explanation for each of these methods:
Checking all links on a site
For our simple link-checker, we’re only going to need OnRequest, OnResponse, OnError, and OnHTML.
Let’s remove all the other boilerplate code besides those functions:
...
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL)
})
c.OnError(func(_ *colly.Response, err error) {
log.Println("Something went wrong:", err)
})
c.OnResponse(func(r *colly.Response) {
fmt.Println("Visited", r.Request.URL)
})
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
e.Request.Visit(e.Attr("href"))
})
}
Now, we are just missing a few things. We haven’t set what site we would like to crawl and more importantly, we haven’t even started our crawl. I’m going to set my domain at the top of the main function. Feel free to set yours as a global variable if that makes more sense for your program:
...
func main() {
baseURL := "www.example.com"
startingURL := "https://" + baseURL
allowedUrls := []string{baseURL}
c := colly.NewCollector(
colly.AllowedDomains(allowedUrls...),
colly.MaxDepth(0),
colly.IgnoreRobotsTxt(),
)
...
I have created a variable to assign my domain to and then used that base URL to create a starting URL and an allowed URL’s array. I then set Colly’s “AllowedDomains” properties to be the allowed URLs. The ellipse following “allowedURLs” in the property setting is just a spreader that takes our array and spreads out its entries to be a list of parameters for a function.
Now all that’s left is to add the “c.Visit” function to the end of the main function. This will start our crawl of the specified URL.
...
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
e.Request.Visit(e.Attr("href"))
})
fmt.Println("Starting crawl at: ", startingURL)
if err := c.Visit(startingURL); err != nil {
fmt.Println("Error on start of crawl: ", err)
}
}
Preferably, for all professional Go code, you should always catch errors and handle them appropriately. In this case, we are just going to print the error to the console. You could also choose to use “log.Fatal” to terminate the program. However, in this case, our script will terminate regardless. Also, while using prints to retrieve data from my program, I would suggest logging errors with the log method over fmt for a professional use case.
Stop and look at your code. It should look like this now:
package main
import (
"fmt"
"log"
"github.com/gocolly/colly"
)
func main() {
baseURL := "www.example.com"
startingURL := "https://" + baseURL
allowedUrls := []string{baseURL}
c := colly.NewCollector(
colly.AllowedDomains(allowedUrls...),
)
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL)
})
c.OnError(func(_ *colly.Response, err error) {
log.Println("Something went wrong:", err)
})
c.OnResponse(func(r *colly.Response) {
fmt.Println("Visited", r.Request.URL)
})
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
e.Request.Visit(e.Attr("href"))
})
fmt.Println("Starting crawl at: ", startingURL)
if err := c.Visit(startingURL); err != nil {
fmt.Println("Error on start of crawl: ", err)
}
}
Let’s sum up what our code is doing so far:
At this point, you have a basic crawler that can check if any of your pages return an error when visited. For most smaller websites, this crawler is perfectly fine. However, for very large sites with lots of info we would probably prefer a more performative crawler.
Making it go fast
I said that the reason I use Go and Colly is because they are very fast at what they do. If speed and performance is not an issue, you can use an alternative like Scrapy with Python. But if your website is thousands of pages, it might be worth implementing some of these changes to the code.
Async
Colly supports asynchronous calls to servers to parallelize our requests. This is great because it allows us to send responses while we are crawling HTML resources. Now, while this does give us a huge performance boost, I will say, I usually notice a bottleneck at some point with responses. I can send lots of requests, but then the responses will come in so fast that Colly can’t quite keep up with them all. If you experience that, Colly does support queues, but they are too advanced for this tutorial.
To add asynchronous calls, we need to set Colly’s Async property to be true:
...
c := colly.NewCollector(
colly.AllowedDomains(url...),
colly.Async(true),
)
...
DO NOT RUN THIS YET. Hopefully, you’ve recognized that allowing a piece of code to attack a server with thousands of requests per second is a bad idea. We need to make sure that your crawler doesn’t inadvertently become a ddosifier and take down the site you were just checking the health of. (Well, unless you’ve done your load testing and already know the site can handle it, then go for it.) We need to limit how many requests we send at one time. We do this by setting Colly’s Limit property with a LimitRule
...
c := colly.NewCollector(
colly.AllowedDomains(url...),
colly.Async(true),
)
c.Limit(&colly.LimitRule{DomainGlob: "*", Parallelism: 2})
...
The DomainGlob just is asking what domains I’d like to set this rule for. In this case, I want it for all domains that will be visited so I just set it to “*”. Parallelism defines how many requests to send at a time. For this example, I’m going to use 2 just so no one comes blaming my article for testing their server load. If you know how many will work for your site, feel free to use that. I will say though, I can get away with using 5 for websites that are thousands of pages and still get crawls under 30 minutes. If you don’t know, start small and go higher till you hit good enough times without taxing the server too much. However, like I mentioned, at some point you may experience a response bottleneck. If that happens, adding more parallel requests won’t help.
There’s just one more thing we need to do before our code will work. We need to add a “Wait” function to the end of the main function so that our crawler knows it needs to wait for all its requests to finish before ending the program.
...
if err := c.Visit(startingURL); err != nil {
fmt.Println("Error on start of crawl: ", err)
}
c.Wait()
...
Let’s take a look at our entire code once again:
package main
import (
"fmt"
"log"
"github.com/gocolly/colly"
)
func main() {
baseURL := "www.example.com"
startingURL := "https://" + baseURL
url := []string{baseURL}
c := colly.NewCollector(
colly.AllowedDomains(url...),
colly.Async(true),
)
c.Limit(&colly.LimitRule{DomainGlob: "*", Parallelism: 2})
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL)
})
c.OnError(func(_ *colly.Response, err error) {
log.Println("Something went wrong:", err)
})
c.OnResponse(func(r *colly.Response) {
fmt.Println("Visited", r.Request.URL)
})
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
e.Request.Visit(e.Attr("href"))
})
fmt.Println("Starting crawl at: ", startingURL)
if err := c.Visit(startingURL); err != nil {
fmt.Println("Error on start of crawl: ", err)
}
c.Wait()
}
Now our crawler is ready to run. At this stage, it is perfectly serviceable for basic quality checks on a site’s health.
Where to go from here:
Rightpoint brings simplicity to the complexity of Adobe Experience Cloud implementations and complements it with outstanding experience design.
Utah Office:
132 S State St
Salt Lake City, UT 84111
Mailing Address:
50 W Broadway Ste 333
PMB 27084
Salt Lake City, Utah 84101-2027