Skip to content

Programming

Building an Agentic Web Scraping Pipeline for Crypto and Meme Coins

How to Build an Agentic Web Scraping Pipeline for Crypto and Meme Coins

Agentic web scraping revolutionizes data collection by leveraging advanced scraping tools and LLM-based reasoning to analyze websites for actionable insights. This guide demonstrates how to build a closed-loop pipeline for analyzing popular crypto and meme coin websites to enhance trading strategies.


Websites to Scrape

The following websites will serve as data inputs for the pipeline:

  1. Movement Market
    Facilitates buying and selling meme coins with email and credit card integration.

  2. Raydium
    A decentralized exchange for trading tokens and coins.

  3. Jupiter
    A platform for seamless token swaps.

  4. Rugcheck
    A tool for evaluating meme coins and identifying scams.

  5. Photon Sol
    A browser-based solution for trading low-cap coins.

  6. Cielo Finance
    Offers a copy-trading platform to follow top-performing wallets.


Step 1: Structuring Data for Public Websites

For effective analysis, raw HTML data from these websites must be structured into human-readable Markdown.

Tool: Firecrawl

Use Firecrawl to scrape and format the websites:

Example: Scraping Movement Market

import requests

FIRECRAWL_API = "https://api.firecrawl.com/v1/scrape"
API_KEY = "your_firecrawl_api_key"

def scrape_with_firecrawl(url):
    headers = {"Authorization": f"Bearer {API_KEY}"}
    data = {"url": url, "output": "markdown"}
    response = requests.post(FIRECRAWL_API, json=data, headers=headers)

    if response.status_code == 200:
        return response.json().get("markdown")
    else:
        print(f"Error: {response.status_code} - {response.text}")
        return None

markdown_data = scrape_with_firecrawl("https://movement.market/")
print(markdown_data)

Repeat the process for all listed websites to create structured Markdown files.


Step 2: Analyze Public Data with Reasoning Agents

Once the data is structured, LLMs can be used to analyze trends, extract features, and provide actionable insights.

Example: Analyzing Data with OpenAI API
import openai

openai.api_key = "your_openai_api_key"

def analyze_markdown(markdown_data):
    response = openai.Completion.create(
        model="text-davinci-003",
        prompt=f"Analyze this Markdown data to identify trading opportunities and community sentiment:\n\n{markdown_data}",
        max_tokens=1000
    )
    return response.choices[0].text.strip()

markdown_example = "# Example Markdown\nThis is an example of markdown content for analysis."
analysis = analyze_markdown(markdown_example)
print(analysis)

Step 3: Scraping Private Data with Web Automation

For websites requiring interaction (e.g., logins or dynamic content), use Python's Playwright library with AgentQL for advanced navigation and data extraction.

Example: Scraping Photon Sol with Playwright and AgentQL

Install Playwright and AgentQL:

pip install playwright
playwright install

Write the Python Script:

from playwright.sync_api import sync_playwright

def scrape_photon_sol():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        # Navigate to Photon Sol
        page.goto("https://photon-sol.tinyastro.io/")

        # Simulate interactions if needed
        page.wait_for_timeout(3000)  # Wait for the page to load completely
        content = page.content()

        print(content)  # Print or save the page content
        browser.close()

scrape_photon_sol()

This approach ensures data can be extracted even from dynamic websites.


Step 4: Automating the Pipeline

Use Python-based automation tools like Apache Airflow to schedule and run the scraping and analysis pipeline.

Example: Airflow Configuration for the Pipeline
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime

def scrape():
    # Add scraping logic for all websites here
    print("Scraping data...")

def analyze():
    # Add analysis logic here
    print("Analyzing data...")

with DAG('crypto_pipeline', start_date=datetime(2024, 11, 25), schedule_interval='@daily') as dag:
    scrape_task = PythonOperator(task_id='scrape', python_callable=scrape)
    analyze_task = PythonOperator(task_id='analyze', python_callable=analyze)

    scrape_task >> analyze_task

Insights from Websites

Here's what you can focus on while analyzing the scraped data:

  1. Movement Market: Review ease of use, transaction speed, and user feedback.
  2. Raydium: Analyze liquidity and trading fees for tokens.
  3. Jupiter: Evaluate swap rates and platform efficiency.
  4. Rugcheck: Identify red flags in meme coin projects to avoid scams.
  5. Photon Sol: Assess platform usability for low-cap token trading.
  6. Cielo Finance: Analyze wallet strategies and portfolio performance.

Step 5: Closing the Loop

To maintain a closed-loop pipeline, configure the workflow to automatically re-scrape websites at regular intervals and update analyses with new data. This ensures decisions are based on the latest information.


Conclusion

By integrating structured scraping, advanced analysis, and automation, this agentic pipeline enables real-time insights into the crypto and meme coin ecosystem. Use the steps outlined above to stay ahead in the volatile world of meme coins while minimizing risks and maximizing returns. 🚀

Agentic Web Scraping in 2024

Web scraping best practices have evolved significantly in the past couple of years, with the rise of agentic web scraping marking a new era in data collection and analysis. In this post, we'll explore the concept of agentic web scraping, its benefits, and how it is transforming the landscape of data-driven decision-making.

Evolution of Web Scraping

Typically, web scraping involved extracting data from websites by mimiking browser behaviour through HTTP requests and web automation frameworks like Selenium, Puppeteer, or Playwright. This process required developers to write specific code for each website, making it time-consuming, error-prone, and susceptible to changes in website structure. So much so that 50% to 70% of engineering resources in data aggregation teams were spent on scraping stystems early on. However, with the advent of agentic web scraping, this approach has been revolutionized. LLMs are able to make sense of any data thrown at them, allowing them to understand large amounts of raw HTML and make decisions based on it.

This comes with a drawback, however. The more unstructured data you throw at an LLM, the more likely it is to make mistakes and the more tokens are consumed. This is why it's important to have as close to structured, human-readable data as possible.

Structuring Data for Agentic Web Scraping

In order to be able to use LLM Scraper Agents and Reasoning Agents, we need to convert raw HTML data into a more structured format. Markdown is a great choice for this, as it is human-readable and easily parsed by LLMs. After converting scraped data into structured markdown, we can feed it into LLM Scraper Agents and Reasoning Agents to make sense of it and extract insights.

Web Scraper Agents for Public Data

Public data is data that is freely available on the web, such as news articles, blog posts, and product descriptions. This data can be scraped and used for various purposes and does not require any special permissions such as bypassing CAPTCHAs or logging in.

Some APIs that can be used to convert raw HTML data into structured markdown include:

Firecrawl

Firecrawl turns entire websites into clean, LLM-ready markdown or structured data. Scrape, crawl and extract the web with a single API

Output: Good quality markdown with most hyperlinks preserved

Rate limit: 1000 requests per minute

Cost: $0.06 per 100 pages

Jina

Turn a website into a structured data by adding r.jina.ai in front of the URL.

Output: Focuses primarily on extracting content rather than preserving hyperlinks

Rate limit: 1000 requests per minute

Cost: Free

Spider Cloud

Spider is a leading web crawling tool designed for speed and cost-effectiveness, supporting various data formats including LLM-ready markdown.

Output: Happy medium between Firecrawl and Jina with good quality markdown

Rate limit: 50000 requests per minute

Cost: $0.03 per 100 pages

Web Scraper Agents for Private Data

As mentioned earlier, web automation frameworks like Selenium, Puppeteer, or Playwright are used to scrape private data that requires interaction to access restricted areas of a website. These tools can now be used to build agentic web scraping systems that can understand and reason about the data they collect. However, the issue with these tools is determining which UI elements to interact with to access the abovementioned restricted areas of a site. This is where AgentQL comes in.

AgentQL

AgentQL allows web automation frameworks to accurately navigate websites, even when the website structure changes.

Rate limit: 10 API calls per minute

Cost: $0.02 per API call

Using AgentQL in conjunction with web automation frameworks enables developers to build agentic web scraping systems that can access and reason about private data, making the process more efficient and reliable.

How AgentQL Works

Some examples of actions we're able to perform with AgentQL along with Playwright or Selenium include:

  • Save and load authenticated state
  • Wait for a page to load
  • Close a cookie dialog
  • Close popup windows
  • Compare product prices across multiple websites

Conclusion

Agentic web scraping is transforming the way data is collected and analyzed, enabling developers to build systems that can understand and reason about the data they collect. By structuring data in a human-readable format like markdown and using tools like LLM Scraper Agents, Reasoning Agents, and AgentQL, developers can create efficient and reliable web scraping systems that can access both public and private data. This new approach to web scraping is revolutionizing the field of data-driven decision-making and opening up new possibilities for data analysis and insights.

Transferring Script Files to Local System or VPS

Transferring Script Files to Local System or VPS

This guide explains the process of transferring a Python script for a Facebook Marketplace Scraper and setting it up on either a local system or a VPS. This scraper helps you collect and manage data from online listings efficiently.

Features of the Facebook Marketplace Scraper

  • Data Storage: Uses SQLite for local storage and integration with Google Sheets for cloud-based storage.
  • Notifications: Optional Telegram Bot integration for updates.
  • Proxy Support: Includes compatibility with services like Smartproxy to manage requests.

Local System Setup Process (Windows)

This section outlines the steps to set up the scraper on your local machine.

Prerequisites

Before proceeding, ensure you have:

  • Python 3.6 or higher installed.
  • Access to Google Cloud with credentials for Google Sheets API.
  • An SQLite-supported system.
  • A Telegram bot token (optional).
  • Dependencies listed in the requirements.txt.

Setup Steps

Step 1: Obtain Script Files
  • Download the script files (typically a ZIP archive) and extract them.
  • Ensure the following files are present:
  • fb_parser.py: The main script.
  • requirements.txt: Python dependencies.
Step 2: Install Dependencies

Open a terminal, navigate to the script folder, and run:

pip install -r requirements.txt
Step 3: Configure Google Sheets API
  1. Create a Google Cloud project and enable the Sheets API.
  2. Download the credentials.json file and place it in the script folder.
Step 4: Initialize the Database

Run the following command to create the SQLite database:

python fb_parser.py --initdb
Step 5: Configure Telegram Notifications (Optional)

Edit fb_parser.py and add your bot_token and bot_chat_id.

Step 6: Run the Scraper

Start the scraper with:

python fb_parser.py
Step 7: Automation (Optional)

Use Task Scheduler to automate script execution.


VPS Setup Process

VPS Requirements

  • VPS with SSH access and Python 3.6+ installed.
  • Linux OS (Ubuntu or CentOS preferred).
  • Necessary script files and dependencies.

Setup Steps

Step 1: Log in to VPS

Access your VPS via SSH:

ssh username@hostname
Step 2: Transfer Script Files

Upload files using SCP or SFTP:

scp fb_parser.py requirements.txt username@hostname:/path/to/directory
Step 3: Install Python and Dependencies

Update your system and install Python dependencies:

sudo apt update
sudo apt install python3-pip
pip3 install -r requirements.txt
Step 4: Configure Credentials

Follow the same steps as the local setup to configure Google Sheets and Telegram credentials.

Step 5: Run the Scraper

Navigate to the script directory and execute:

python3 fb_parser.py
Step 6: Automate with Cron

Use cron to schedule periodic script execution:

crontab -e
# Add the line below to run daily at midnight
0 0 * * * python3 /path/to/fb_parser.py

Conclusion

By following this guide, you can effectively transfer and set up the Facebook Marketplace Scraper on your local system or VPS. This tool simplifies the process of collecting and managing online listing data.


References

Transferring Files Between WSL and Windows

Transferring Files Between WSL and Windows

This guide provides a step-by-step approach to transferring files between Windows Subsystem for Linux (WSL) and Windows using tools like SCP (Secure Copy). It includes commands for file management and efficient navigation in both environments.


Logging into Zomro VPS using WSL in Ubuntu CLI

To access your Zomro VPS using WSL’s Ubuntu terminal:

  1. Open the Ubuntu terminal via the Windows Start menu.
  2. Use the ssh command to connect to your VPS:

ssh your_username@your_server_ip

Replace your_username with your VPS username and your_server_ip with the server’s IP address.

  1. Enter your VPS password when prompted. After logging in, you can manage your VPS from the Ubuntu CLI.

Locating File Paths in Ubuntu CLI

Navigating and identifying file paths in Ubuntu is essential for transferring files. Use these commands for efficient file management:

1. Present Working Directory (pwd)

Displays the absolute path of the current directory:

pwd

2. List Directory Contents (ls)

Shows files and directories in the current location:

ls

3. Find a File (find)

Searches for a file in the system:

find / -name example.txt

4. Change Directory (cd)

Navigates through directories:

cd /path/to/directory

5. Access Windows Files

WSL allows access to Windows files via /mnt. For example:

cd /mnt/c/Users/YourUsername/Desktop


File Transfer Methods Using SCP

Using SCP (Secure Copy)

The scp command securely copies files between WSL and Windows.

Syntax
scp username@source:/path/to/source/file /path/to/destination/
Example: Copy Files from WSL to Windows

To copy screenshots from a remote VPS to your Windows Desktop:

scp root@45.88.107.136:/root/zomro-selenium-base/screenshots/* "/mnt/c/Users/Harminder Nijjar/Desktop/"

This command transfers all files from the VPS directory to the specified Windows Desktop folder.


Conclusion

Transferring files between WSL and Windows is simple and efficient using commands like scp. Mastering these techniques will streamline your workflow and enhance your productivity across WSL and Windows environments.