Home

Speed Up Web Crawling with Parallel Processing in Go (Golang)

39 views

Crawling multiple websites in parallel can significantly speed up the process of gathering data from the web. Go (Golang) is well-suited for this task due to its concurrency model based on goroutines. Below is an example of how you can crawl multiple websites in parallel using Go.

First, you'll need to import the necessary packages:

package main

import (
	"fmt"
	"io/ioutil"
	"net/http"
	"sync"
	"time"
)

// function to fetch the content of a given URL
func fetchURL(url string, wg *sync.WaitGroup, results chan<- string) {
	defer wg.Done()

	// Fetch the URL.
	resp, err := http.Get(url)
	if err != nil {
		results <- fmt.Sprintf("Error fetching URL %s: %v", url, err)
		return
	}
	defer resp.Body.Close()

	// Read the body of the response.
	body, err := ioutil.ReadAll(resp.Body)
	if err != nil {
		results <- fmt.Sprintf("Error reading response body from URL %s: %v", url, err)
		return
	}

	results <- fmt.Sprintf("Fetched URL %s: %s", url, string(body[:100])) // Print first 100 chars
}

func main() {
	start := time.Now()

	// List of URLs to crawl
	urls := []string{
		"http://example.com",
		"http://example.org",
		"http://example.net",
		// Add more URLs as needed
	}

	var wg sync.WaitGroup
	results := make(chan string, len(urls))

	// Create a goroutine for each URL
	for _, url := range urls {
		wg.Add(1)
		go fetchURL(url, &wg, results)
	}

	// Close the results channel when all goroutines are done
	go func() {
		wg.Wait()
		close(results)
	}()

	// Gather results
	for result := range results {
		fmt.Println(result)
	}

	fmt.Printf("Crawling completed in %s\n", time.Since(start))
}

Explanation:

  1. Imports:

    • The fmt package for formatted I/O.
    • The io/ioutil package to read the response body.
    • The net/http package for making HTTP requests.
    • The sync package to manage a WaitGroup.
    • The time package to measure the time taken to complete the operation.
  2. fetchURL Function:

    • This function takes a URL, WaitGroup, and a channel for sending results.
    • It performs the HTTP GET request, reads the response, and sends a formatted string with part of the content to the results channel.
  3. main Function:

    • A list of URLs is defined.
    • A WaitGroup and a channel for results are initialized.
    • A goroutine is spawned for each URL to fetch the content concurrently.
    • The results channel is closed when all goroutines are done using a goroutine.
    • Results are printed as they are received from the results channel.

This example demonstrates a basic web crawler that can be expanded further to handle more complex tasks, such as error handling, rate limiting, and parsing specific data from the fetched web pages.