Speed Up Web Crawling with Parallel Processing in Go (Golang)
39 views
Crawling multiple websites in parallel can significantly speed up the process of gathering data from the web. Go (Golang) is well-suited for this task due to its concurrency model based on goroutines. Below is an example of how you can crawl multiple websites in parallel using Go.
First, you'll need to import the necessary packages:
package main
import (
"fmt"
"io/ioutil"
"net/http"
"sync"
"time"
)
// function to fetch the content of a given URL
func fetchURL(url string, wg *sync.WaitGroup, results chan<- string) {
defer wg.Done()
// Fetch the URL.
resp, err := http.Get(url)
if err != nil {
results <- fmt.Sprintf("Error fetching URL %s: %v", url, err)
return
}
defer resp.Body.Close()
// Read the body of the response.
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
results <- fmt.Sprintf("Error reading response body from URL %s: %v", url, err)
return
}
results <- fmt.Sprintf("Fetched URL %s: %s", url, string(body[:100])) // Print first 100 chars
}
func main() {
start := time.Now()
// List of URLs to crawl
urls := []string{
"http://example.com",
"http://example.org",
"http://example.net",
// Add more URLs as needed
}
var wg sync.WaitGroup
results := make(chan string, len(urls))
// Create a goroutine for each URL
for _, url := range urls {
wg.Add(1)
go fetchURL(url, &wg, results)
}
// Close the results channel when all goroutines are done
go func() {
wg.Wait()
close(results)
}()
// Gather results
for result := range results {
fmt.Println(result)
}
fmt.Printf("Crawling completed in %s\n", time.Since(start))
}
Explanation:
-
Imports:
- The
fmtpackage for formatted I/O. - The
io/ioutilpackage to read the response body. - The
net/httppackage for making HTTP requests. - The
syncpackage to manage a WaitGroup. - The
timepackage to measure the time taken to complete the operation.
- The
-
fetchURLFunction:- This function takes a URL, WaitGroup, and a channel for sending results.
- It performs the HTTP GET request, reads the response, and sends a formatted string with part of the content to the results channel.
-
mainFunction:- A list of URLs is defined.
- A WaitGroup and a channel for results are initialized.
- A goroutine is spawned for each URL to fetch the content concurrently.
- The results channel is closed when all goroutines are done using a goroutine.
- Results are printed as they are received from the
resultschannel.
This example demonstrates a basic web crawler that can be expanded further to handle more complex tasks, such as error handling, rate limiting, and parsing specific data from the fetched web pages.