The State of Web Data Collection Tools 🕷️

The best strategies + an extensive comparison of the most popular tools

Luca Rossi

Apr 02, 2025

Article voiceover

1×

0:00

-20:31

When some industry trend is growing fast, it is often compared to a gold rush.

This is mainly in reference to the California Gold Rush, which took place in the mid 19th century. A famous related quote is: “During a gold rush, sell shovels”, which is attributed to Sam Brannan, a merchant who became a millionaire by selling supplies to miners.

It is well-known that the gold rush drove shovel demand, but it is lesser-known just how much the same gold rush challenged the shovel industry and forced it to improve on every axis: design, production, distribution, and more.

After centuries during which shovels went untouched, in less than 10 years they changed to the point of becoming basically unrecognizable: more durable materials, reinforced handles, ergonomic grips, plus they spawned a handful of specialized new tools that didn’t exist before.

Fast forward to these days, AI is doing the same to many complementary industries. Some of these may have existed for a long time, but have now a renewed, central role in the internet economy because of the unique needs of AI.

Out of these, one of my favorites is web data collection.

As a former co-founder of a startup in the travel space, I have relied on scraping and data extraction of various kinds for almost ten years, so I have been following this space closely. Back then, though, this was a niche need: products and companies who legitimately needed large-scale data from the web were few.

AI has changed everything for two reasons:

🧱 Foundational Models — just need an unprecedented amount of data for pre-training. We are talking trillions of tokens, which equals to billions of web pages.
🤖 AI Applications — companies, in turn, are building on top of these models by either fine-tuning them or by creating RAG systems, in both cases using huge amounts of data on top of the already data-hungry base models.

This shift is happening across all domains and industries: e-commerce, finance, security, healthcare, social media, you name it.

In other words, AI finally lets companies make sense of a lot of data that was already available but not easy to take advantage from — and much of this data comes from the web.

🎯 Outline

Last year we partnered with Bright Data on writing a primer on web scraping fundamentals — that article got a fantastic reception, so this year we are back with a big, extended version that includes our updated outlook on the industry in 2025, plus a in-depth comparison of the best tools that exist today.

And that’s because there exist a lot of solutions to extract data from the web, and the best option for your use case depends on a huge number of factors.

So we are going to map everything out, in various steps:

🛠️ Extraction techniques — we cover the four main approaches to collect web data today.
✅ Foundational qualities — the high-level qualities that you should look for in a vendor.
🔍 How to choose — how to determine the best solution for your scenario in practice, based on scale, frequency, type of data, skills, and budget.
🔭 What’s next — how you can expect the market to evolve in the next few years, and what are the main challenges.

We are starting with the first principles that should guide your decisions, and continue with advice about what you should do in practice, including a comparison of the most popular tools, such as Bright Data, Oxylabs, Webz, Zyte, Apify, Smartproxy, and NetNut.

The goal is to give you all the tools (pun intended) to make an informed decision for your use case.

Let’s dive in!

Before we start, a couple of disclaimers:

Even though Oxylabs and Smartproxy are owned by the same company, we have reviewed them as two separate tools, because for final users they are effectively so.
Bright Data is a long-time sponsor of Refactoring and we have already worked together in the past. However, I commit to writing my unbiased opinion about the whole landscape.

🛠️ Extraction techniques

One of the best ways to think about the various ways to extract web data is to put them on a spectrum that goes from buy to build.

Based on this there are four main ones:

1) Ready-made datasets 🗃️

Several companies offer pre-collected datasets on various topics. These datasets are pricey but also incredibly rich, and they get refreshed on a regular basis. For example, for ~$50K, you can buy the likes of:

💼 500M+ Linkedin profile records — to power a massive outreach campaign.
🛒 250M+ Amazon product records — to perform market analysis for your e-commerce, inform pricing and predict trends.
🏠 150M+ Zillow listings — to spot real estate trends and investment opportunities.

The main advantage of this approach, of course, is convenience: you get clean, structured data without the hassle of collecting it yourself.

Downsides are 1) cost, 2) not real-time data, and 3) only available for popular websites.

Here is a comparison of the most popular services:

A comparison of the most popular services offering ready-made datasets taken from the web

2) Wrapper APIs 🕷️

Just like you can buy datasets, you can buy access to APIs that wrap popular websites like Amazon or Google.

This is often a confusing proposition, because at first glance you may wonder: why should I use a wrapper instead of the API of the same service?

Answers depend on the service, so let’s make the example of search engine (SERP) APIs, which are probably the most wrapped in tech:

📊 Higher limits — wrapped SERP APIs generally offer higher thresholds for data access compared to direct API limits. You can retrieve more data in a single query, which is crucial for extensive research.
👔 Tailored fit — SERP APIs are usually tailored to accommodate specific use cases, so they are 1) simpler to use, and 2) more feature-rich. E.g. Tavily provides an extremely stripped down API that is optimized for LLMs; SERP APIs for SEO add historical ranking, search volume, and domain insights.
💪 Resilience — they usually have mitigations against IP bans and CAPTCHA challenges.
🌐 Multiple suppliers — finally, they may provide results from multiple search engines, to give a more well-rounded perspective.

Here is a comparison of the most popular services:

A comparison of the most popular services providing APIs that *wrap* other websites

Disclaimer: there exists a long tail of specific providers based on the type of API that you need. E.g. if you need SERP, there exist a long list of specific tools like Semrush, Tavily, and others.

The ones covered above are the most popular and adopted for a variety of use cases, but you may find something that fits better yours by looking specifically for it.

3) Scraping Browsers 🖥️

If you need more flexibility but don't want to reinvent the wheel, there exist plenty of hosted web scraping services and APIs.

These services provide the endpoints and the infrastructure to scrape data from any website. They handle infrastructure for you and provide easy-to-use APIs, so they are a good intermediate step vs a full in-house implementation.

Here is our full comparison:

A round-up of the most popular scraping browsers

4) In-house 🏡

When developers look at hosted scraping services, they are always surprised by how much they cost.

For the price of a basic tier on any of the services above, you can rent a server that would easily do 100x of the requests.

The problem is that scraping is one of those tasks that seem simple until it’s not. In my experience, there are three main challenges:

🎭 Dynamic Content — modern, JavaScript-heavy websites can be tricky to scrape. You'll need to render the pages properly to access all the data.
🧩 CAPTCHAs — many websites don't like being scraped, and they'll throw CAPTCHAs at you to prove you're human.
🚫 IP Blocking — if a website catches on to your scraping, they might ban your IP address.

There are many ways to address these problems one by one, but they are not always trivial. So, unless 1) web scraping is 100% core to your business, 2) there is relevant IP you want to build around it, and/or 3) volume is very high so that cost is an issue, I discourage people from implementing everything from scratch.

Still, if you want to go that route, here are my recommendations. Tech choices largely depend on the type of website you need to scrape:

For complex, interactive, JS-heavy sites where user interaction is required, I recommend Playwright. Playwright is mostly known for e2e testing, but it is extremely flexible and can easily automate browser actions, making it great for scraping tasks.
On the other hand, if you're dealing with simpler, static pages where you don't need to execute JS, libraries like Scrapy or BeautifulSoup in Python are simpler choices that will do the job.

So, unless you are going for the in-house route, the next step in your data extraction journey is to choose a tool from a vendor. How should you choose? Let’s start with foundational qualities, and why these matter 👇

✅ Foundational Qualities

When choosing an extraction tool, I look for three qualities: trust, performance, and service. Exactly in this order:

1) Trust 🤝

Scraping is a complex domain, from many angles: technical, regulatory, and even ethical. On all these things, you have to trust your vendor.

There is a huge amount of scraping tools: from the biggest, most general purpose ones, down to the smallest and most niche. The first step to me is always making sure you are working with a company that you can trust. Here are a few questions you may ask yourself:

Is the company legitimate? — Talking about the basics here: does it have a Linkedin page, employees, etc? You will be surprised by how many shady tools do exist.
Can you find examples or case studies? — Are there other legit companies that use the tool?
Is it a match for your scale? — Is there proof that it can match your scaling needs?

Ideally, your answer should be a resounding yes on all three.

2) Performance 📊

Performance in web data extraction is measured in two ways:

🥇 Success rate — this is a proxy for many things: infra stability, how good they are at not being blocked, and how well they handle modern web challenges like dynamic content. You want a success rate that is above 90%, ideally above 95%.
⚡ Speed — how long it takes to retrieve results. It may or may not be critical depending on your use case, but it is indeed an important factor. It is also something that varies hugely across different vendors, and is often not advertised.

With respect to trust and service, performance is also extremely tricky to evaluate, because it depends on the type of service you are interested in, and, above all, on scale.

Last year AI Multiple ran a fantastic research about how different scraping providers behave at scale, which you can find here.

How the median response time of popular scraping services change based on scale

3) Service 🛠️

Once you are a good level of trust and you are happy with basic performance, you should consider the level of service you get.

I especially look at two things here:

The variety of solutions and products you have at disposal from the same vendor, and
The quality of customer support

They both matter. Having multiple solutions available is very handy, because sometimes it is unclear what the best approach is, and you will switch over time. E.g. you may start with a scraping browser only to discover after a while that a wrapper API is more than enough.

Customer support also matters, because these things may break in a variety of ways. E.g. the Google results page may change and knock out your SERP API from one minute to the other. How well does your vendor handle this? How timely are things back up? How do they communicate with you in the meantime?

🔍 How to choose

All that said, how do you make a final decision?

I have always found answers to come easier when the questions are good, so here is a recap of the best questions you should ask yourself about this, organized by topic:

1) Scale 📈

Is the collection real-time or batch? If batch, how often do you need data to be refreshed?
Is the collection high-load? Do you need thousands of requests / second?

2) Importance 🏆

How core to your product this is?
How different is your use case from that of other companies?
How important is it that you build IP and skills around this?

3) Type of data 🎨

Are the websites famous?
Are the websites many?

4) Team skills 🎓

Is your team experienced in web scraping or not?

5) Budget 🏦

How much are you willing to spend on this?

Based on these questions, we can sketch a high-level decision graph:

A simple flowchart to decide what kind of data collection strategy to adopt

🔨 Final tool comparison

In the sections above we have compared the most popular scraping providers on the market. So this is our final score:

And here is my final note about the top picks:

Bright Data — the only provider excelling across all categories. Offers the most comprehensive ecosystem but at premium pricing. Ideal for businesses looking for a reliable provider to scale with.
Zyte & Smartproxy — they are less comprehensive than Bright Data (miss ready-made datasets), but they both excel in their respective specialties, also offering competitive pricing. Ideal for businesses that are just getting started with scraping, or that don’t need large scale solutions.
NetNut — excellent Wrapper APIs and proxy network, but doesn’t have the same depth of tools of other providers. Ideal for those looking for exactly these tools.

🔭 What’s next in web scraping

Finally, as the industry continues to evolve rapidly, I believe three trends are more important than ever in shaping its trajectory:

1) AI's unprecedented data demands 🧠

The explosion of AI has created data needs at a scale we've never seen before. While the tools remain largely the same, the volume requirements have increased exponentially.

In particular, two distinct AI data pipelines have emerged:

⚡ Training pipeline — companies building foundation models need massive amounts of data, both historical and current. Some providers now store what essentially amounts to a copy of the internet — hundreds of billions of pages collected and processed specifically for AI training.
🔄 Inference pipeline — AI applications need fresh, relevant data at the moment of execution. This has driven an evolution in existing API products to provide more real-time, context-relevant data feeds optimized for AI consumption.

So, as scale increases, maintaining data quality becomes exponentially harder. Providers are investing heavily in automated QA to ensure data remains reliable.

2) Access wars ⚔️

For the reasons above, the battle between data collectors and websites trying to protect their content is intensifying:

➡️ More advanced protection systems… — Cloudflare, Akamai, and specialized services like Datadome are developing increasingly sophisticated methods to detect and block automated collection.
⬅️ …for more advanced extraction tools — In response, web scraping tools are developing more human-like behavior patterns, with some even leveraging AI to mimic genuine user interaction.

The industry is also developing better boundaries between legitimate vs problematic data collection, but we are still far from a comprehensive solution 👇

3) Ethical Framework 🧭

The legal side of scraping is still murky and subject to interpretation, but here are some guidelines you can keep in mind:

😇 Fair and transformative use — in some jurisdictions, like the US, scraping may be protected under the fair use doctrine if 1) the use of the scraped data is sufficiently transformative (e.g., for research, commentary, or criticism) and 2) it doesn’t harm the original content owner’s market.
📝 Contractual Obligations — many websites have Terms of Service that explicitly prohibit scraping. Violating these terms can lead to legal action for breach of contract. The problem is: it’s not always clear when websites can enforce these contracts, in fact… 👇
🌐 Public Webpages — courts have ruled that scraping publicly accessible data (e.g., data not behind a login wall) does not constitute unauthorized access, and TOS can’t be enforced in that case because the user didn’t explicitly accept it (as they would if they had logged in).

It is also worth noting that bypassing CAPTCHAs and IP bans generally does not qualify as unauthorized access:

The court additionally disagreed […] that using automated tools to bypass access restrictions, like CAPTCHAs, was the same as accessing a “password-protected website.”

So where does this leave us?

My first observation is that, whenever the legal framework is uncertain or contradictory, you can’t blindly rely on it and distance yourself from ethical considerations. Actually, to me this is always true, not just when regulation is unclear, but we can argue it is especially true here.

Of course, this is a personal stance. I have no pretense to say this should be universally valid, nor I will judge others (most of the times 🙃) for doing things differently.

So, for scraping, I ask myself two questions:

🏆 Does it bring any upside to the data owner?
🤝 Is it coherent with the intended usage of the data?

To move forward you need at least one strong yes.

The simple version is: if you are using data in a way that is wildly different from what was intended, it should cause no harm to the owner.

📌 Bottom line

And that's it for today! Here are the key takeaways:

📈 Web data collection is booming in the AI era — AI has revitalized web data extraction, making it a critical component of modern tech infrastructure.
🛠️ Choose your approach wisely — from ready-made datasets to building in-house solutions, each extraction method has its own tradeoffs in terms of cost, flexibility, and implementation effort.
🤝 Trust is paramount — when selecting a vendor, prioritize legitimacy, proven track record, and ability to scale with your needs before considering other factors.
⚡ Performance matters at scale — success rates and speed vary dramatically between providers, especially as volume increases, so benchmark against your actual needs.
⚔️ Collection vs. protection is intensifying — as extraction tools become more sophisticated, websites are implementing increasingly advanced protection measures.
🧭 Ethical considerations should guide your approach — beyond legal frameworks, consider whether your data collection benefits owners or aligns with intended usage.

See you next week!

Sincerely,
Luca