How to Make a Web Crawler Using Go and Colly Tutorial

Connor Wade • Sep 15, 2022

About the Author: Connor manages the Software Development Engineer in Test (SDET) team at Hoodoo Digital. He has 2 years of experience building end-to-end automated testing designed specifically for customer’s websites. As a key part of the QA team, Connor tests websites and apps. He has experience in Playwright.js (Typescript and Node.js), HTML, CSS, Typescript, React, Flutter, and Go. Connor holds several Adobe Certifications, including Adobe Certified Expert, Professional AEM Sites Business Practitioner, and AEM Sites Front-End Developer.


A web crawler can make an excellent tool for quality assurance for a large website. The ability to check the health of every page and most content on a large site is useful when you don’t have the resources otherwise to cover everything.  

 

For this tutorial, we are going to be using Go, a popular scripting language from Google, and one of its packages, Colly. Making a crawler in Go and Colly is easy. In this article, I’ll show you how to create a basic script that can check the health of links and images across a website. I’ll also share some things that will help you build a professional crawler.  

The steps that we’ll go over for this are: 

  

  1. Installing Go to your machine  
  2. Starting your project and installing Colly  
  3. Coding the crawler to check links  
  4. Making it go fast  

 

Installing Go  

 If you haven’t installed Go, download it here and follow the installation instructions. If you have installed Go but haven’t recently updated it, do it before you start. Preferably, you’ll use the latest stable version of Go (at the time of writing this article, I am using version 1.17), but at the very least you will need a Go version greater than 1.13.  

 

Starting your project and installing Colly  

 Now that you have Go installed, we can start our project. Enter your local-repo directory (wherever you keep your code on your machine) and create a new directory for this project. In a bash terminal it would look like this:  

 

mkdir example-crawler  

cd example-crawler 

 

Next, let’s set up our Go module to manage packages:  

 

 Go mod init example.com/crawler 

 

You can replace “http://example.com” with wherever you are hosting your code (such as github or bitbucket). Or if you don’t intend to share your code, just write whatever you want. The name doesn’t matter; it just provides others with a module path to your project.  

 

Now we are ready to install the Colly package and its dependencies. To do so, run the “go get” command (ellipsis is part of the command):  

 

 go get -u http://github.com/gocolly/colly/... 

 

This should fetch all the packages for Colly and its dependencies. Take some time and ensure that you now have “go.mod” and “go.sum” files.   

 

Starting Your Colly Project  

To start the project, we will need a main.go file. Technically you can name this file whatever you want, but you must have a “main” function. The main function is where all go programs start from. I’ll create the main.go file with the touch command:  

 

touch main.go 

  

Now let’s open main.go and add some code to it: 

 

package main  

 

func main() {  

  c := colly.NewCollector()  

 

If you save the file, you should see that Go adds the import for Colly automatically for you.  

package main  

 

import "http://github.com/gocolly/colly "  

 

func main() {  

  c := colly.NewCollector()  

}  

 

Now, in main, I’m going to add some boilerplate code from the Colly website:  

package main 

 

import ( 

"fmt" 

"log" 

 

"github.com/gocolly/colly" 

 

func main() { 

c := colly.NewCollector() 

 

c.OnRequest(func(r *colly.Request) { 

fmt.Println("Visiting", r.URL) 

}) 

 

c.OnError(func(_ *colly.Response, err error) { 

log.Println("Something went wrong:", err) 

}) 

 

c.OnResponse(func(r *colly.Response) { 

fmt.Println("Visited", r.Request.URL) 

}) 

 

c.OnHTML("a[href]", func(e *colly.HTMLElement) { 

e.Request.Visit(e.Attr("href")) 

}) 

 

c.OnHTML("tr td:nth-of-type(1)", func(e *colly.HTMLElement) { 

fmt.Println("First column of a table row:", e.Text) 

}) 

 

c.OnXML("//h1", func(e *colly.XMLElement) { 

fmt.Println(e.Text) 

}) 

 

c.OnScraped(func(r *colly.Response) { 

fmt.Println("Finished", r.Request.URL) 

}) 

 

It probably looks like I just added a whole lot of code, but do not fear. I will give you a quick, surface-level explanation for each of these methods: 

 

  • OnRequest – runs when our program sends a request to the server.  

 

  • OnError – runs when or if we receive an error from the server. In Colly, this is any response that isn’t in the 200s for server codes.  

 

  • OnResponse – runs when our program receives a response from the server.  

 

  • OnHTML – runs when our program accesses the HTML resource that was served to it. It takes in a selector as the first argument. Selectors in Colly use the goquery library [https://github.com/PuerkitoBio/goquery] which uses the same syntax as jQuery selectors. If you know your CSS selectors, you should be good to go. 

 

  • OnXML – runs if our program receives an XML resource rather than an HTML resource. This is usually helpful for scraping sitemaps and other site resources. For XML selectors you usually use XPaths.  

 

  • OnScraped – runs after the program stops scraping a resource.  I have rarely ever seen this method utilized. Usually, any functionality you may want to associate with it is better handled by one of the other methods. 

 

Checking all links on a site  

For our simple link-checker, we’re only going to need OnRequest, OnResponse, OnError, and OnHTML.  

Let’s remove all the other boilerplate code besides those functions:  

 

... 

 

c.OnRequest(func(r *colly.Request) { 

fmt.Println("Visiting", r.URL) 

}) 

 

c.OnError(func(_ *colly.Response, err error) { 

log.Println("Something went wrong:", err) 

}) 

 

c.OnResponse(func(r *colly.Response) { 

fmt.Println("Visited", r.Request.URL) 

}) 

 

c.OnHTML("a[href]", func(e *colly.HTMLElement) { 

e.Request.Visit(e.Attr("href")) 

}) 

  

Now, we are just missing a few things. We haven’t set what site we would like to crawl and more importantly, we haven’t even started our crawl. I’m going to set my domain at the top of the main function. Feel free to set yours as a global variable if that makes more sense for your program:  

... 

 

func main() { 

baseURL := "www.example.com

startingURL := "https://" + baseURL 

allowedUrls := []string{baseURL} 

 

c := colly.NewCollector( 

colly.AllowedDomains(allowedUrls...), 

colly.MaxDepth(0), 

colly.IgnoreRobotsTxt(), 


...  

 

I have created a variable to assign my domain to and then used that base URL to create a starting URL and an allowed URL’s array. I then set Colly’s “AllowedDomains” properties to be the allowed URLs. The ellipse following “allowedURLs” in the property setting is just a spreader that takes our array and spreads out its entries to be a list of parameters for a function.  

 

Now all that’s left is to add the “c.Visit” function to the end of the main function. This will start our crawl of the specified URL.   

 

... 

 

c.OnHTML("a[href]", func(e *colly.HTMLElement) { 

e.Request.Visit(e.Attr("href")) 

}) 

 

fmt.Println("Starting crawl at: ", startingURL) 

 

if err := c.Visit(startingURL); err != nil { 

fmt.Println("Error on start of crawl: ", err) 

 

Preferably, for all professional Go code, you should always catch errors and handle them appropriately. In this case, we are just going to print the error to the console. You could also choose to use “log.Fatal” to terminate the program. However, in this case, our script will terminate regardless. Also, while using prints to retrieve data from my program, I would suggest logging errors with the log method over fmt for a professional use case.  

 

Stop and look at your code. It should look like this now:  

  

package main 

 

import ( 

"fmt" 

"log" 

 

"github.com/gocolly/colly" 

 

func main() { 

baseURL := "www.example.com

startingURL := "https://" + baseURL 

allowedUrls := []string{baseURL} 

 

c := colly.NewCollector( 

colly.AllowedDomains(allowedUrls...), 

 

c.OnRequest(func(r *colly.Request) { 

fmt.Println("Visiting", r.URL) 

}) 

 

c.OnError(func(_ *colly.Response, err error) { 

log.Println("Something went wrong:", err) 

}) 

 

c.OnResponse(func(r *colly.Response) { 

fmt.Println("Visited", r.Request.URL) 

}) 

 

c.OnHTML("a[href]", func(e *colly.HTMLElement) { 

e.Request.Visit(e.Attr("href")) 

}) 

 

fmt.Println("Starting crawl at: ", startingURL) 

 

if err := c.Visit(startingURL); err != nil { 

fmt.Println("Error on start of crawl: ", err) 

 

Let’s sum up what our code is doing so far:  

  1. We are setting our base URL.  
  2. We are then using that base URL to constrain our crawler to only crawl URLs with the same domain name.  
  3. We are initializing a new Colly crawler and setting its allowed domains properties.  
  4. We are starting a crawl by visiting the starting URL.  
  5. We are requesting resources from the webserver where our website is located. On that request we are printing that we are visiting the URL we are requesting from.  
  6. If there is an error code received from the server after our request, we are printing that error.  
  7. Once we have the HTML resource from the site, we are looking for anchor tags with href properties. We grab the URL from the href and use that to request a new page.  
  8. The crawler is then going to repeat steps 4-7 until there are no more unique links to follow.  

 

At this point, you have a basic crawler that can check if any of your pages return an error when visited. For most smaller websites, this crawler is perfectly fine. However, for very large sites with lots of info we would probably prefer a more performative crawler.  

 

Making it go fast  

 

I said that the reason I use Go and Colly is because they are very fast at what they do. If speed and performance is not an issue, you can use an alternative like Scrapy with Python. But if your website is thousands of pages, it might be worth implementing some of these changes to the code.  

 

Async  

 

Colly supports asynchronous calls to servers to parallelize our requests. This is great because it allows us to send responses while we are crawling HTML resources. Now, while this does give us a huge performance boost, I will say, I usually notice a bottleneck at some point with responses. I can send lots of requests, but then the responses will come in so fast that Colly can’t quite keep up with them all. If you experience that, Colly does support queues, but they are too advanced for this tutorial.   

 

To add asynchronous calls, we need to set Colly’s Async property to be true: 

 

... 

c := colly.NewCollector( 

colly.AllowedDomains(url...),

colly.Async(true), 

... 

  

DO NOT RUN THIS YET. Hopefully, you’ve recognized that allowing a piece of code to attack a server with thousands of requests per second is a bad idea. We need to make sure that your crawler doesn’t inadvertently become a ddosifier and take down the site you were just checking the health of. (Well, unless you’ve done your load testing and already know the site can handle it, then go for it.) We need to limit how many requests we send at one time. We do this by setting Colly’s Limit property with a LimitRule  

 

... 

c := colly.NewCollector( 

colly.AllowedDomains(url...), 

colly.Async(true), 

 

c.Limit(&colly.LimitRule{DomainGlob: "*", Parallelism: 2}) 

...  

 

The DomainGlob just is asking what domains I’d like to set this rule for. In this case, I want it for all domains that will be visited so I just set it to “*”. Parallelism defines how many requests to send at a time. For this example, I’m going to use 2 just so no one comes blaming my article for testing their server load. If you know how many will work for your site, feel free to use that. I will say though, I can get away with using 5 for websites that are thousands of pages and still get crawls under 30 minutes. If you don’t know, start small and go higher till you hit good enough times without taxing the server too much. However, like I mentioned, at some point you may experience a response bottleneck. If that happens, adding more parallel requests won’t help.  

 

There’s just one more thing we need to do before our code will work. We need to add a “Wait” function to the end of the main function so that our crawler knows it needs to wait for all its requests to finish before ending the program.  

  

... 

 

if err := c.Visit(startingURL); err != nil { 

fmt.Println("Error on start of crawl: ", err) 

c.Wait() 

... 

Let’s take a look at our entire code once again:  

package main 

 

import ( 

"fmt" 

"log" 

 

"github.com/gocolly/colly" 

 

func main() { 

baseURL := "www.example.com

startingURL := "https://" + baseURL 

url := []string{baseURL} 

 

c := colly.NewCollector( 

colly.AllowedDomains(url...), 

colly.Async(true), 

 

c.Limit(&colly.LimitRule{DomainGlob: "*", Parallelism: 2}) 

 

c.OnRequest(func(r *colly.Request) { 

fmt.Println("Visiting", r.URL) 

}) 

 

c.OnError(func(_ *colly.Response, err error) { 

log.Println("Something went wrong:", err) 

}) 

 

c.OnResponse(func(r *colly.Response) { 

fmt.Println("Visited", r.Request.URL) 

}) 

 

c.OnHTML("a[href]", func(e *colly.HTMLElement) { 

e.Request.Visit(e.Attr("href")) 

}) 

 

fmt.Println("Starting crawl at: ", startingURL) 

 

if err := c.Visit(startingURL); err != nil { 

fmt.Println("Error on start of crawl: ", err) 

c.Wait() 

 

Now our crawler is ready to run. At this stage, it is perfectly serviceable for basic quality checks on a site’s health. 

Where to go from here: 

  • Create reports for broken links for your development and content teams 
  • Check for broken images 
  • Check header content and create reports for the SEO team 
  • Check important content across pages for content teams 


Hoodoo The Next Evolution: Rightpoint
16 Mar, 2023
Hoodoo is now Rightpoint, and we couldn’t be more excited to have a new name, a new look, and new capabilities.
By Kim Melton 29 Nov, 2022
Google is sunsetting Google Analytics - and a lot of people are left wondering what to do next. Don't worry - we have a plan (and a team) that can help.
By Sara Wetmore 22 Nov, 2022
A recent Forrester report evaluated enterprise marketing software - from Adobe to SalesForce and more. Find out how Adobe fared against their competitors across 25 different categories.
Show More
Share by: