Refactoring

Refactoring

🧵 Essays

The State of Web Data Collection Tools 🕷️

The best strategies + an extensive comparison of the most popular tools

Luca Rossi's avatar
Luca Rossi
Apr 02, 2025
∙ Paid
22
3
Share
Upgrade to paid to play voiceover

When some industry trend is growing fast, it is often compared to a gold rush.

This is mainly in reference to the California Gold Rush, which took place in the mid 19th century. A famous related quote is: “During a gold rush, sell shovels”, which is attributed to Sam Brannan, a merchant who became a millionaire by selling supplies to miners.

It is well-known that the gold rush drove shovel demand, but it is lesser-known just how much the same gold rush challenged the shovel industry and forced it to improve on every axis: design, production, distribution, and more.

After centuries during which shovels went untouched, in less than 10 years they changed to the point of becoming basically unrecognizable: more durable materials, reinforced handles, ergonomic grips, plus they spawned a handful of specialized new tools that didn’t exist before.

Fast forward to these days, AI is doing the same to many complementary industries. Some of these may have existed for a long time, but have now a renewed, central role in the internet economy because of the unique needs of AI.

Out of these, one of my favorites is web data collection.

As a former co-founder of a startup in the travel space, I have relied on scraping and data extraction of various kinds for almost ten years, so I have been following this space closely. Back then, though, this was a niche need: products and companies who legitimately needed large-scale data from the web were few.

AI has changed everything for two reasons:

  1. 🧱 Foundational Models — just need an unprecedented amount of data for pre-training. We are talking trillions of tokens, which equals to billions of web pages.

  2. 🤖 AI Applications — companies, in turn, are building on top of these models by either fine-tuning them or by creating RAG systems, in both cases using huge amounts of data on top of the already data-hungry base models.

This shift is happening across all domains and industries: e-commerce, finance, security, healthcare, social media, you name it.

In other words, AI finally lets companies make sense of a lot of data that was already available but not easy to take advantage from — and much of this data comes from the web.


🎯 Outline

Last year we partnered with Bright Data on writing a primer on web scraping fundamentals — that article got a fantastic reception, so this year we are back with a big, extended version that includes our updated outlook on the industry in 2025, plus a in-depth comparison of the best tools that exist today.

And that’s because there exist a lot of solutions to extract data from the web, and the best option for your use case depends on a huge number of factors.

So we are going to map everything out, in various steps:

  1. 🛠️ Extraction techniques — we cover the four main approaches to collect web data today.

  2. ✅ Foundational qualities — the high-level qualities that you should look for in a vendor.

  3. 🔍 How to choose — how to determine the best solution for your scenario in practice, based on scale, frequency, type of data, skills, and budget.

  4. 🔭 What’s next — how you can expect the market to evolve in the next few years, and what are the main challenges.

We are starting with the first principles that should guide your decisions, and continue with advice about what you should do in practice, including a comparison of the most popular tools, such as Bright Data, Oxylabs, Webz, Zyte, Apify, Smartproxy, and NetNut.

The goal is to give you all the tools (pun intended) to make an informed decision for your use case.

Let’s dive in!


This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Refactoring ETS
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture