Agentic Web Scraping in 2024

Web scraping best practices have evolved significantly in the past couple of years, with the rise of agentic web scraping marking a new era in data collection and analysis. In this post, we'll explore the concept of agentic web scraping, its benefits, and how it is transforming the landscape of data-driven decision-making.

Evolution of Web Scraping

Typically, web scraping involved extracting data from websites by mimiking browser behaviour through HTTP requests and web automation frameworks like Selenium, Puppeteer, or Playwright. This process required developers to write specific code for each website, making it time-consuming, error-prone, and susceptible to changes in website structure. So much so that 50% to 70% of engineering resources in data aggregation teams were spent on scraping stystems early on. However, with the advent of agentic web scraping, this approach has been revolutionized. LLMs are able to make sense of any data thrown at them, allowing them to understand large amounts of raw HTML and make decisions based on it.

This comes with a drawback, however. The more unstructured data you throw at an LLM, the more likely it is to make mistakes and the more tokens are consumed. This is why it's important to have as close to structured, human-readable data as possible.

Structuring Data for Agentic Web Scraping

In order to be able to use LLM Scraper Agents and Reasoning Agents, we need to convert raw HTML data into a more structured format. Markdown is a great choice for this, as it is human-readable and easily parsed by LLMs. After converting scraped data into structured markdown, we can feed it into LLM Scraper Agents and Reasoning Agents to make sense of it and extract insights.

Web Scraper Agents for Public Data

Public data is data that is freely available on the web, such as news articles, blog posts, and product descriptions. This data can be scraped and used for various purposes and does not require any special permissions such as bypassing CAPTCHAs or logging in.

Some APIs that can be used to convert raw HTML data into structured markdown include:

Firecrawl

Firecrawl turns entire websites into clean, LLM-ready markdown or structured data. Scrape, crawl and extract the web with a single API

Output: Good quality markdown with most hyperlinks preserved

Rate limit: 1000 requests per minute

Cost: $0.06 per 100 pages

Jina

Turn a website into a structured data by adding r.jina.ai in front of the URL.

Output: Focuses primarily on extracting content rather than preserving hyperlinks

Rate limit: 1000 requests per minute

Cost: Free

Spider Cloud

Spider is a leading web crawling tool designed for speed and cost-effectiveness, supporting various data formats including LLM-ready markdown.

Output: Happy medium between Firecrawl and Jina with good quality markdown

Rate limit: 50000 requests per minute

Cost: $0.03 per 100 pages

Web Scraper Agents for Private Data

As mentioned earlier, web automation frameworks like Selenium, Puppeteer, or Playwright are used to scrape private data that requires interaction to access restricted areas of a website. These tools can now be used to build agentic web scraping systems that can understand and reason about the data they collect. However, the issue with these tools is determining which UI elements to interact with to access the abovementioned restricted areas of a site. This is where AgentQL comes in.

AgentQL

AgentQL allows web automation frameworks to accurately navigate websites, even when the website structure changes.

Rate limit: 10 API calls per minute

Cost: $0.02 per API call

Using AgentQL in conjunction with web automation frameworks enables developers to build agentic web scraping systems that can access and reason about private data, making the process more efficient and reliable.

How AgentQL Works

Some examples of actions we're able to perform with AgentQL along with Playwright or Selenium include:

Save and load authenticated state
Wait for a page to load
Close a cookie dialog
Close popup windows
Compare product prices across multiple websites

Conclusion

Agentic web scraping is transforming the way data is collected and analyzed, enabling developers to build systems that can understand and reason about the data they collect. By structuring data in a human-readable format like markdown and using tools like LLM Scraper Agents, Reasoning Agents, and AgentQL, developers can create efficient and reliable web scraping systems that can access both public and private data. This new approach to web scraping is revolutionizing the field of data-driven decision-making and opening up new possibilities for data analysis and insights.

Created 2024-11-26, Updated 2024-11-29
Authors: Harminder Singh Nijjar (5)