Webcrawling tradeoffs

September 6, 2022

A couple of years ago I built our internal crawling platform at Globality, which needed to be capable of scaling to two billion pages each crawl. We had to consider some early design tradeoffs that influenced the rest of the architecture. The most fundamental was which rendering engine we wanted to adopt - and like everything in systems design, they each had different trade offs.

The two main types of crawlers that are deployed in the wild are typically raw or headless:

Raw HTTP

Request raw html over the wire. Bind to the host's socket and issue a GET for the page of interest, then continue to discovered links in a BFS search.

  • Pros
    • Fast. Only downloads the html text payload (still measured in kb on the largest sites).
    • Trivially parallelizable through async processing or threading in your language of choice. An average server can usually accommodate tens of thousand requests in parallel.
  • Cons
    • SPAs are broken, often don't render or don't wrap their links with <a> tags.
    • Additional JS-populated data pages are often blank.

Headless Browsers

Using a runtime like Webkit or Google Chrome, execute the logic of the webpage like a user would see. Run javascript, background scripts, etc.

  • Pros
    • More comprehensive coverage of pages, guaranteed to render SPAs or conditional elements in javascript.
    • Can selectively download image content or reject these requests to save on download speeds.
  • Cons
    • Quite slow in comparison to raw http requests; has to download often unnecessary scripts (tracker scripts, ads, etc.) and execute javascript.
    • Scaling isn't as trivial. Chromium and webkit require sizable CPU and memory requirements to be able to launch and run quickly.

Both approaches operate on the extrema of webpage richness. One supports simple pages, the other the most complicated. If you have a particular set of target domains, you can make a decision that is best for your end use. Building a generic crawler that can deal with any domain you throw at it is a more difficult task. An increasing number of sites are built as SPAs or React applications even if they have static content; likely due to the increase of popularity of JS on the frontend and Node on the backend.

Still - most websites render fine with plaintext. Going with a headless browser is a waste of resources when dealing with the general case; you're burning CPU to render mostly raw html. But going with raw http then misses out on capturing a solid proportion of these new webpages. You seem stuck with the highest common denominator of rendering requirements. Are headless browsers the only answer?

Hybrid Crawlers

I'm surprised there isn't more discussion of crawlers that blend the two approaches. I call this approach hybrid crawling since it makes use of the strengths of both while trying to minimize their weaknesses. It relies on the notion that pages are plaintext until proven otherwise. If pages don't appear valid when retrieved plain, we can delegate the rendering to a headless browser to pull in additional dependencies.

I'd like to propose two main strategies to making this identification:

Page-Level Classification: Fetch the page of interest through a raw http request. Featurize the content and determine whether it contain meaningful data. Since this approach likely uses tag counts or page attribute thresholding, it can be very fast and therefore conducted on every page. Some featurization strategies include:

  • Determine a presence of a <noscript> tag, which usually indicates there's some content that can only be revealed when rendered with javascript.
  • Count the amount of embedded <script> tags that come from the same domain, which usually render some additional content.
  • Count the amount of <a> links that are identified within the body. If there are none or only a handful, it's likely that the page contents aren't fully captured via the raw payload and need to be re-crawled by a headless browser.

Domain-Level Classification: Assume that a domain either uses raw html or rich rendering. Fetch the raw http and full rendering concurrently. Since we are extrapolating for the whole domain, we only have to expend the headless resources a minimum of once per domain.

  • Compare the payload sizes of the body by bytes, by dom tag count, or by words. If the difference is greater than some percent threshold, tag the domain as requiring rich-text and delegate all subsequent pages to the headless browser cluster. Otherwise, continue to crawl as plain text.

To calibrate some of these hyperparameters, select an initial sample of 100-500 websites that you know are involved in your crawling seed set. Crawl these with both the http crawler and the headless browser. For each page, featurize the elements and compare them.

HTTP Headless Ratio
<noscript> present yes no N/A
<a> counts 2 10 0.2
<script> counts 10 10 1
word counts 250 560 0.44
bytes 2500 6000 0.42

For page-level classification you'll have to rely on absolute quantities since they appear in a vaccum. For domain-quantities, you'll be better off relying on the ratios since these contain more signal. If you're lucky there will be a clear bimodal separation between the two ranges, which will let you choose clear hyperparameters by eyeballing it.

If you want to get even more precise, you can label the datasets by whether the raw html contain enough information for your crawler and train a simple model that weights the input criteria.

When you're done, you should have a hybrid browser that balances the best of both worlds.

  • Pros
    • Faster than headless browsers, by my measure an order of magnitude
    • Works for SPAs and javascript-heavy webpages
    • Saves on bandwidth
  • Cons
    • Some initial work and data preparation

Hybrid crawlers shift some of the implementation burden to R&D and some data analysis. But if that investment yields a crawler that performs at the sweet spot of time and coverage, it saves an integral of time each time it's used. For a crawler that runs perpetually that's typically worth the trade off.