Programming¶

January 15, 2025
in Programming
2 min read

Leveraging Selenium with Undetected-Chromedriver for Cloudflare Mitigation

Leveraging Selenium with Undetected-Chromedriver for CAPTCHA and Cloudflare Mitigation

By combining Selenium with undetected-chromedriver (UC), you can overcome common automation challenges like Cloudflare's browser verification. This guide explores practical workflows and techniques to enhance your web automation projects.

Why Use Selenium with Undetected-Chromedriver?

Cloudflare protections are designed to block bots, posing challenges for developers. By using undetected-chromedriver with Selenium, you can:

Bypass Browser Fingerprinting: UC modifies ChromeDriver to avoid detection.
Handle Cloudflare Challenges: Seamlessly bypass "wait while your browser is verified" messages.
Mitigate CAPTCHA Issues: Reduce interruptions caused by automated bot checks.

Detection Challenges in Web Automation

Websites employ multiple strategies to detect and prevent automated interactions:

CAPTCHA Challenges: Validating user authenticity.
Cloudflare Browser Verification: Infinite loading screens or token-based checks.
Bot Detection Mechanisms: Browser fingerprinting, behavioral analytics, and cookie validation.

These barriers often require advanced techniques to maintain automation workflows.

The Solution: Selenium and Undetected-Chromedriver

The undetected-chromedriver library modifies the default ChromeDriver to emulate human-like behavior and evade detection. When integrated with Selenium, it allows:

Seamless CAPTCHA Bypass: Minimize interruptions by automating responses or avoiding challenges.
Cloudflare Token Handling: Automatically manage verification processes.
Cookie Reuse for Session Preservation: Skip repetitive verifications by reusing authenticated cookies.

Implementation Guide: Setting Up Selenium with Undetected-Chromedriver

Step 1: Install Required Libraries

Install Selenium and undetected-chromedriver:

pip install selenium undetected-chromedriver

Step 2: Initialize the Browser Driver

Set up a Selenium session with UC:

import undetected_chromedriver.v2 as uc

# Initialize the driver
driver = uc.Chrome()

# Navigate to a website
driver.get("https://example.com")
print("Page Title:", driver.title)

# Quit the driver
driver.quit()

Step 3: Handle CAPTCHA and Cloudflare Challenges

Use UC to bypass passive bot checks.

Extract and reuse cookies to maintain session continuity:

cookies = driver.get_cookies()
driver.add_cookie(cookies)

Advanced Automation Workflow with Cookies

Step 1: Attempt Standard Automation

Use Selenium with UC to navigate and interact with the website.

Step 2: Use Cookies for Session Continuity

Manually authenticate once, extract cookies, and reuse them for automated sessions:

# Save cookies after manual login
cookies = driver.get_cookies()

# Use cookies in future sessions
for cookie in cookies:
    driver.add_cookie(cookie)
driver.refresh()

Step 3: Fall Back to Manual Assistance

Prompt users to resolve CAPTCHA or login challenges in a separate session and capture the cookies for automation.

Proposed Workflow for Automation

Initial Attempt: Start with Selenium and UC for automation.
Fallback to Cookies: Reuse cookies for continuity if CAPTCHA or Cloudflare challenges arise.
Manual Assistance: Open a browser session for user input, capture cookies, and resume automation.

This iterative process ensures maximum efficiency and minimizes disruptions.

Conclusion

Selenium and undetected-chromedriver provide a powerful toolkit for overcoming automation barriers like CAPTCHA and Cloudflare protections. By leveraging cookies and manual fallbacks, you can create robust workflows that streamline automation processes.

Ready to enhance your web automation? Start integrating Selenium with UC today and unlock new possibilities!

References

January 5, 2025
in Programming
3 min read

Setting Up Venom for WhatsApp Translation

Automating WhatsApp messaging can be a powerful tool for customer service, personal projects, or language translation. Using Venom and Google Translate, this guide will show you how to build a script that translates incoming Spanish messages to English and replies in Spanish.

Why Use Venom?

Venom is a robust Node.js library that allows you to interact with WhatsApp Web. It’s perfect for creating bots, automating tasks, or building translation systems like the one we’ll create here.

Prerequisites

Before diving in, ensure you have the following installed:

Node.js: Install from Node.js Official Website.
npm or yarn: Installed alongside Node.js.
Google Translate Library: For text translation.
Venom: For WhatsApp automation.

Install Required Packages

Run the following commands to install the required libraries:

npm install venom-bot translate-google crypto

Implementation

Here’s how to set up and use Venom to translate WhatsApp messages:

1. Initialize the Project

Create a new file named whatsapp_translator.js and start with the following boilerplate:

const venom = require('venom-bot');
const translate = require('translate-google');
const crypto = require('crypto');

2. Set Up Your WhatsApp Contacts

Define your own WhatsApp ID (for self-messages) and the target contact:

const MY_CONTACT_ID = '12345678900@c.us'; // Your number
const TARGET_CONTACT_ID = '01234567890@c.us'; // Target contact's number

3. Implement the Translation Logic

Here’s the full script for translating messages and avoiding duplicates using a hash set:

// Hash sets to prevent duplicate message processing
const processedMessageHashes = new Set();

venom
  .create({
    session: 'my-whatsapp-session',
    multidevice: true,
  })
  .then((client) => start(client))
  .catch((err) => console.error('Error starting Venom:', err));

function start(client) {
  console.log(`Listening for messages between yourself (${MY_CONTACT_ID}) and ${TARGET_CONTACT_ID}.`);

  const delay = (ms) => new Promise((resolve) => setTimeout(resolve, ms));

  // Function to generate a hash for deduplication
  function generateHash(messageBody) {
    return crypto.createHash('sha256').update(messageBody).digest('hex');
  }

  // Periodically check for new messages in the self-chat
  setInterval(async () => {
    try {
      const messages = await client.getAllMessagesInChat(MY_CONTACT_ID, true, true);
      for (const message of messages) {
        processMessage(client, message, generateHash);
      }
    } catch (err) {
      console.error('Error retrieving self-chat messages:', err);
    }
  }, 2000); // Check every 2 seconds

  // Handle incoming messages
  client.onMessage((message) => processMessage(client, message, generateHash));
}

async function processMessage(client, message, generateHash) {
  const messageHash = generateHash(message.body);

  // Skip if the message has already been processed
  if (processedMessageHashes.has(messageHash)) {
    return;
  }

  // Mark the message as processed
  processedMessageHashes.add(messageHash);

  try {
    if (message.from === MY_CONTACT_ID && message.to === MY_CONTACT_ID) {
      console.log('Message is from you (self-chat).');

      // Translate English to Spanish and send to the target contact
      const translatedToSpanish = await translate(message.body, { to: 'es' });
      console.log(`Translated (English → Spanish): ${translatedToSpanish}`);

      await client.sendText(TARGET_CONTACT_ID, translatedToSpanish);
      console.log(`Sent translated message to ${TARGET_CONTACT_ID}: ${translatedToSpanish}`);
    } else if (message.from === TARGET_CONTACT_ID && !message.isGroupMsg) {
      console.log('Message is from the target contact.');

      // Translate Spanish to English and send to the self-chat
      const translatedToEnglish = await translate(message.body, { to: 'en' });
      console.log(`Translated (Spanish → English): ${translatedToEnglish}`);

      const response = `*Translation (Spanish → English):*\nOriginal: ${message.body}\nTranslated: ${translatedToEnglish}`;
      await client.sendText(MY_CONTACT_ID, response);
      console.log(`Posted translation to yourself: ${MY_CONTACT_ID}`);
    }
  } catch (error) {
    console.error('Error processing message:', error);
    // Remove the hash if processing fails
    processedMessageHashes.delete(messageHash);
  }
}

4. Run the Script

Execute the script using Node.js:

node whatsapp_translator.js

5. What Happens?

Messages you send to yourself (in English) are translated to Spanish and sent to the target contact.
Messages from the target contact (in Spanish) are translated to English and sent to your self-chat.

Debugging Tips

Verify Contact IDs: Ensure MY_CONTACT_ID and TARGET_CONTACT_ID are correctly defined.
Check Logs: Use console.log statements to debug the flow of messages.
Dependency Issues: Reinstall packages with npm install if you encounter errors.

Conclusion

This script automates translation for WhatsApp messages, enabling seamless communication across languages. By leveraging Venom and Google Translate, you can extend this setup to support additional languages, integrate with databases, or even build advanced customer service tools. With this foundation, the possibilities are endless!

December 15, 2024
in Programming
3 min read

Building an Agentic Web Scraping Pipeline for Crypto and Meme Coins

How to Build an Agentic Web Scraping Pipeline for Crypto and Meme Coins

Agentic web scraping revolutionizes data collection by leveraging advanced scraping tools and LLM-based reasoning to analyze websites for actionable insights. This guide demonstrates how to build a closed-loop pipeline for analyzing popular crypto and meme coin websites to enhance trading strategies.

Websites to Scrape

The following websites will serve as data inputs for the pipeline:

Movement Market
Facilitates buying and selling meme coins with email and credit card integration.
Raydium
A decentralized exchange for trading tokens and coins.
Jupiter
A platform for seamless token swaps.
Rugcheck
A tool for evaluating meme coins and identifying scams.
Photon Sol
A browser-based solution for trading low-cap coins.
Cielo Finance
Offers a copy-trading platform to follow top-performing wallets.

Step 1: Structuring Data for Public Websites

For effective analysis, raw HTML data from these websites must be structured into human-readable Markdown.

Tool: Firecrawl

Use Firecrawl to scrape and format the websites:

Example: Scraping Movement Market

import requests

FIRECRAWL_API = "https://api.firecrawl.com/v1/scrape"
API_KEY = "your_firecrawl_api_key"

def scrape_with_firecrawl(url):
    headers = {"Authorization": f"Bearer {API_KEY}"}
    data = {"url": url, "output": "markdown"}
    response = requests.post(FIRECRAWL_API, json=data, headers=headers)

    if response.status_code == 200:
        return response.json().get("markdown")
    else:
        print(f"Error: {response.status_code} - {response.text}")
        return None

markdown_data = scrape_with_firecrawl("https://movement.market/")
print(markdown_data)

Repeat the process for all listed websites to create structured Markdown files.

Step 2: Analyze Public Data with Reasoning Agents

Once the data is structured, LLMs can be used to analyze trends, extract features, and provide actionable insights.

Example: Analyzing Data with OpenAI API

import openai

openai.api_key = "your_openai_api_key"

def analyze_markdown(markdown_data):
    response = openai.Completion.create(
        model="text-davinci-003",
        prompt=f"Analyze this Markdown data to identify trading opportunities and community sentiment:\n\n{markdown_data}",
        max_tokens=1000
    )
    return response.choices[0].text.strip()

markdown_example = "# Example Markdown\nThis is an example of markdown content for analysis."
analysis = analyze_markdown(markdown_example)
print(analysis)

Step 3: Scraping Private Data with Web Automation

For websites requiring interaction (e.g., logins or dynamic content), use Python's Playwright library with AgentQL for advanced navigation and data extraction.

Example: Scraping Photon Sol with Playwright and AgentQL

Install Playwright and AgentQL:

pip install playwright
playwright install

Write the Python Script:

from playwright.sync_api import sync_playwright

def scrape_photon_sol():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        # Navigate to Photon Sol
        page.goto("https://photon-sol.tinyastro.io/")

        # Simulate interactions if needed
        page.wait_for_timeout(3000)  # Wait for the page to load completely
        content = page.content()

        print(content)  # Print or save the page content
        browser.close()

scrape_photon_sol()

This approach ensures data can be extracted even from dynamic websites.

Step 4: Automating the Pipeline

Use Python-based automation tools like Apache Airflow to schedule and run the scraping and analysis pipeline.

Example: Airflow Configuration for the Pipeline

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime

def scrape():
    # Add scraping logic for all websites here
    print("Scraping data...")

def analyze():
    # Add analysis logic here
    print("Analyzing data...")

with DAG('crypto_pipeline', start_date=datetime(2024, 11, 25), schedule_interval='@daily') as dag:
    scrape_task = PythonOperator(task_id='scrape', python_callable=scrape)
    analyze_task = PythonOperator(task_id='analyze', python_callable=analyze)

    scrape_task >> analyze_task

Insights from Websites

Here's what you can focus on while analyzing the scraped data:

Movement Market: Review ease of use, transaction speed, and user feedback.
Raydium: Analyze liquidity and trading fees for tokens.
Jupiter: Evaluate swap rates and platform efficiency.
Rugcheck: Identify red flags in meme coin projects to avoid scams.
Photon Sol: Assess platform usability for low-cap token trading.
Cielo Finance: Analyze wallet strategies and portfolio performance.

Step 5: Closing the Loop

To maintain a closed-loop pipeline, configure the workflow to automatically re-scrape websites at regular intervals and update analyses with new data. This ensures decisions are based on the latest information.

Conclusion

By integrating structured scraping, advanced analysis, and automation, this agentic pipeline enables real-time insights into the crypto and meme coin ecosystem. Use the steps outlined above to stay ahead in the volatile world of meme coins while minimizing risks and maximizing returns. 🚀

November 25, 2024
in Programming
4 min read

Agentic Web Scraping in 2024

Web scraping best practices have evolved significantly in the past couple of years, with the rise of agentic web scraping marking a new era in data collection and analysis. In this post, we'll explore the concept of agentic web scraping, its benefits, and how it is transforming the landscape of data-driven decision-making.

Evolution of Web Scraping

Typically, web scraping involved extracting data from websites by mimiking browser behaviour through HTTP requests and web automation frameworks like Selenium, Puppeteer, or Playwright. This process required developers to write specific code for each website, making it time-consuming, error-prone, and susceptible to changes in website structure. So much so that 50% to 70% of engineering resources in data aggregation teams were spent on scraping stystems early on. However, with the advent of agentic web scraping, this approach has been revolutionized. LLMs are able to make sense of any data thrown at them, allowing them to understand large amounts of raw HTML and make decisions based on it.

This comes with a drawback, however. The more unstructured data you throw at an LLM, the more likely it is to make mistakes and the more tokens are consumed. This is why it's important to have as close to structured, human-readable data as possible.

Structuring Data for Agentic Web Scraping

In order to be able to use LLM Scraper Agents and Reasoning Agents, we need to convert raw HTML data into a more structured format. Markdown is a great choice for this, as it is human-readable and easily parsed by LLMs. After converting scraped data into structured markdown, we can feed it into LLM Scraper Agents and Reasoning Agents to make sense of it and extract insights.

Web Scraper Agents for Public Data

Public data is data that is freely available on the web, such as news articles, blog posts, and product descriptions. This data can be scraped and used for various purposes and does not require any special permissions such as bypassing CAPTCHAs or logging in.

Some APIs that can be used to convert raw HTML data into structured markdown include:

Firecrawl

Firecrawl turns entire websites into clean, LLM-ready markdown or structured data. Scrape, crawl and extract the web with a single API

Output: Good quality markdown with most hyperlinks preserved

Rate limit: 1000 requests per minute

Cost: $0.06 per 100 pages

Jina

Turn a website into a structured data by adding r.jina.ai in front of the URL.

Output: Focuses primarily on extracting content rather than preserving hyperlinks

Rate limit: 1000 requests per minute

Cost: Free

Spider Cloud

Spider is a leading web crawling tool designed for speed and cost-effectiveness, supporting various data formats including LLM-ready markdown.

Output: Happy medium between Firecrawl and Jina with good quality markdown

Rate limit: 50000 requests per minute

Cost: $0.03 per 100 pages

Web Scraper Agents for Private Data

As mentioned earlier, web automation frameworks like Selenium, Puppeteer, or Playwright are used to scrape private data that requires interaction to access restricted areas of a website. These tools can now be used to build agentic web scraping systems that can understand and reason about the data they collect. However, the issue with these tools is determining which UI elements to interact with to access the abovementioned restricted areas of a site. This is where AgentQL comes in.

AgentQL

AgentQL allows web automation frameworks to accurately navigate websites, even when the website structure changes.

Rate limit: 10 API calls per minute

Cost: $0.02 per API call

Using AgentQL in conjunction with web automation frameworks enables developers to build agentic web scraping systems that can access and reason about private data, making the process more efficient and reliable.

How AgentQL Works

Some examples of actions we're able to perform with AgentQL along with Playwright or Selenium include:

Save and load authenticated state
Wait for a page to load
Close a cookie dialog
Close popup windows
Compare product prices across multiple websites

Conclusion

Agentic web scraping is transforming the way data is collected and analyzed, enabling developers to build systems that can understand and reason about the data they collect. By structuring data in a human-readable format like markdown and using tools like LLM Scraper Agents, Reasoning Agents, and AgentQL, developers can create efficient and reliable web scraping systems that can access both public and private data. This new approach to web scraping is revolutionizing the field of data-driven decision-making and opening up new possibilities for data analysis and insights.

November 27, 2023
in Programming
2 min read

Transferring Script Files to Local System or VPS

This guide explains the process of transferring a Python script for a Facebook Marketplace Scraper and setting it up on either a local system or a VPS. This scraper helps you collect and manage data from online listings efficiently.

Features of the Facebook Marketplace Scraper

Data Storage: Uses SQLite for local storage and integration with Google Sheets for cloud-based storage.
Notifications: Optional Telegram Bot integration for updates.
Proxy Support: Includes compatibility with services like Smartproxy to manage requests.

Local System Setup Process (Windows)

This section outlines the steps to set up the scraper on your local machine.

Prerequisites

Before proceeding, ensure you have:

Python 3.6 or higher installed.
Access to Google Cloud with credentials for Google Sheets API.
An SQLite-supported system.
A Telegram bot token (optional).
Dependencies listed in the requirements.txt.

Setup Steps

Step 1: Obtain Script Files

Download the script files (typically a ZIP archive) and extract them.
Ensure the following files are present:
fb_parser.py: The main script.
requirements.txt: Python dependencies.

Step 2: Install Dependencies

Open a terminal, navigate to the script folder, and run:

pip install -r requirements.txt

Step 3: Configure Google Sheets API

Create a Google Cloud project and enable the Sheets API.
Download the credentials.json file and place it in the script folder.

Step 4: Initialize the Database

Run the following command to create the SQLite database:

python fb_parser.py --initdb

Step 5: Configure Telegram Notifications (Optional)

Edit fb_parser.py and add your bot_token and bot_chat_id.

Step 6: Run the Scraper

Start the scraper with:

python fb_parser.py

Step 7: Automation (Optional)

Use Task Scheduler to automate script execution.

VPS Setup Process

VPS Requirements

VPS with SSH access and Python 3.6+ installed.
Linux OS (Ubuntu or CentOS preferred).
Necessary script files and dependencies.

Setup Steps

Step 1: Log in to VPS

Access your VPS via SSH:

ssh username@hostname

Step 2: Transfer Script Files

Upload files using SCP or SFTP:

scp fb_parser.py requirements.txt username@hostname:/path/to/directory

Step 3: Install Python and Dependencies

Update your system and install Python dependencies:

sudo apt update
sudo apt install python3-pip
pip3 install -r requirements.txt

Step 4: Configure Credentials

Follow the same steps as the local setup to configure Google Sheets and Telegram credentials.

Step 5: Run the Scraper

Navigate to the script directory and execute:

python3 fb_parser.py

Step 6: Automate with Cron

Use cron to schedule periodic script execution:

crontab -e
# Add the line below to run daily at midnight
0 0 * * * python3 /path/to/fb_parser.py

Conclusion

By following this guide, you can effectively transfer and set up the Facebook Marketplace Scraper on your local system or VPS. This tool simplifies the process of collecting and managing online listing data.

References

November 21, 2023
in Programming
2 min read

Transferring Files Between WSL and Windows

This guide provides a step-by-step approach to transferring files between Windows Subsystem for Linux (WSL) and Windows using tools like SCP (Secure Copy). It includes commands for file management and efficient navigation in both environments.