Skip to content

Blog

Harminder Singh Nijjar's Digital Art Catalog

2024-11-25: While sitting on the dining table drinking a Celsius Peach Vibe, I decided to create a quick digital drawing of the can next to a container of JIF peanut butter. The drawing was done on my MobiScribe WAVE using the stylus that came with the device. The MobiScribe WAVE is a great tool for digital art, and I enjoy using it for quick sketches and drawings. JIF + Celsius Peach Vibe

2024-11-26 20:17: Today I drew a quick sketch of two wolf pups howling at the moon. Full Moon Pups

Agentic Web Scraping in 2024

Web scraping best practices have evolved significantly in the past couple of years, with the rise of agentic web scraping marking a new era in data collection and analysis. In this post, we'll explore the concept of agentic web scraping, its benefits, and how it is transforming the landscape of data-driven decision-making.

Evolution of Web Scraping

Typically, web scraping involved extracting data from websites by mimiking browser behaviour through HTTP requests and web automation frameworks like Selenium, Puppeteer, or Playwright. This process required developers to write specific code for each website, making it time-consuming, error-prone, and susceptible to changes in website structure. So much so that 50% to 70% of engineering resources in data aggregation teams were spent on scraping stystems early on. However, with the advent of agentic web scraping, this approach has been revolutionized. LLMs are able to make sense of any data thrown at them, allowing them to understand large amounts of raw HTML and make decisions based on it.

This comes with a drawback, however. The more unstructured data you throw at an LLM, the more likely it is to make mistakes and the more tokens are consumed. This is why it's important to have as close to structured, human-readable data as possible.

Structuring Data for Agentic Web Scraping

In order to be able to use LLM Scraper Agents and Reasoning Agents, we need to convert raw HTML data into a more structured format. Markdown is a great choice for this, as it is human-readable and easily parsed by LLMs. After converting scraped data into structured markdown, we can feed it into LLM Scraper Agents and Reasoning Agents to make sense of it and extract insights.

Web Scraper Agents for Public Data

Public data is data that is freely available on the web, such as news articles, blog posts, and product descriptions. This data can be scraped and used for various purposes and does not require any special permissions such as bypassing CAPTCHAs or logging in.

Some APIs that can be used to convert raw HTML data into structured markdown include:

Firecrawl

Firecrawl turns entire websites into clean, LLM-ready markdown or structured data. Scrape, crawl and extract the web with a single API

Output: Good quality markdown with most hyperlinks preserved

Rate limit: 1000 requests per minute

Cost: $0.06 per 100 pages

Jina

Turn a website into a structured data by adding r.jina.ai in front of the URL.

Output: Focuses primarily on extracting content rather than preserving hyperlinks

Rate limit: 1000 requests per minute

Cost: Free

Spider Cloud

Spider is a leading web crawling tool designed for speed and cost-effectiveness, supporting various data formats including LLM-ready markdown.

Output: Happy medium between Firecrawl and Jina with good quality markdown

Rate limit: 50000 requests per minute

Cost: $0.03 per 100 pages

Web Scraper Agents for Private Data

As mentioned earlier, web automation frameworks like Selenium, Puppeteer, or Playwright are used to scrape private data that requires interaction to access restricted areas of a website. These tools can now be used to build agentic web scraping systems that can understand and reason about the data they collect. However, the issue with these tools is determining which UI elements to interact with to access the abovementioned restricted areas of a site. This is where AgentQL comes in.

AgentQL

AgentQL allows web automation frameworks to accurately navigate websites, even when the website structure changes.

Rate limit: 10 API calls per minute

Cost: $0.02 per API call

Using AgentQL in conjunction with web automation frameworks enables developers to build agentic web scraping systems that can access and reason about private data, making the process more efficient and reliable.

How AgentQL Works

Some examples of actions we're able to perform with AgentQL along with Playwright or Selenium include:

  • Save and load authenticated state
  • Wait for a page to load
  • Close a cookie dialog
  • Close popup windows
  • Compare product prices across multiple websites

Conclusion

Agentic web scraping is transforming the way data is collected and analyzed, enabling developers to build systems that can understand and reason about the data they collect. By structuring data in a human-readable format like markdown and using tools like LLM Scraper Agents, Reasoning Agents, and AgentQL, developers can create efficient and reliable web scraping systems that can access both public and private data. This new approach to web scraping is revolutionizing the field of data-driven decision-making and opening up new possibilities for data analysis and insights.

My First Blog Post

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Using Crosshair.AHK to Assist with Aiming on Xbox Cloud Gaming

Using Crosshair.AHK to Assist with Aiming on Xbox Cloud Gaming

Crosshair.AHK

I recently started playing games on Xbox Cloud Gaming on PC, and I noticed that the aim assist with reWASD wasn't as powerful as I had initially expected. I decided to use Crosshair.AHK to help me aim better. Crosshair.AHK is a simple script that displays a crosshair on your screen to help you aim better in games. In this post, I will show you how to use Crosshair.AHK to assist with aiming in Fortnite on Xbox Cloud Gaming.

Features of Crosshair.AHK

10 different crosshair variations, customizable colors, and fullscreen support.

Crosshair.AHK has several features that make it a great tool for improving your aim in games. Some of the key features include:

  • 10 different crosshair styles
  • Customizable crosshair colors
  • Fullscreen crosshair support

Crosshair Styles

Crosshair.AHK offers 10 different crosshair styles to choose from, allowing you to find the one that works best for you. The crosshair styles range from simple dots to more complex designs, giving you plenty of options to customize your crosshair to your liking. Crosshair styles can be easily changed by pressing the F10 key.

Customizable Crosshair Colors

Crosshair.AHK allows you to customize the color of your crosshair to suit your preferences. You can choose from a wide range of colors to find the one that stands out the most against your game's background. Crosshair colors can be easily changed by pressing the F10 key and using the color change widget to select the desired color.

Fullscreen Crosshair Support

Crosshair in fullscreen mode.

Crosshair.AHK supports fullscreen mode, allowing you to use the crosshair in games that run in fullscreen. This feature is particularly useful for games that don't have built-in crosshairs or where the crosshair is difficult to see against the game's background. To enable fullscreen mode, simply press the F11 key.

Transferring Script Files to Local System or VPS

Transferring Script Files to Local System or VPS

This guide explains the process of transferring a Python script for a Facebook Marketplace Scraper and setting it up on either a local system or a VPS. This scraper helps you collect and manage data from online listings efficiently.

Features of the Facebook Marketplace Scraper

  • Data Storage: Uses SQLite for local storage and integration with Google Sheets for cloud-based storage.
  • Notifications: Optional Telegram Bot integration for updates.
  • Proxy Support: Includes compatibility with services like Smartproxy to manage requests.

Local System Setup Process (Windows)

This section outlines the steps to set up the scraper on your local machine.

Prerequisites

Before proceeding, ensure you have:

  • Python 3.6 or higher installed.
  • Access to Google Cloud with credentials for Google Sheets API.
  • An SQLite-supported system.
  • A Telegram bot token (optional).
  • Dependencies listed in the requirements.txt.

Setup Steps

Step 1: Obtain Script Files
  • Download the script files (typically a ZIP archive) and extract them.
  • Ensure the following files are present:
  • fb_parser.py: The main script.
  • requirements.txt: Python dependencies.
Step 2: Install Dependencies

Open a terminal, navigate to the script folder, and run:

pip install -r requirements.txt
Step 3: Configure Google Sheets API
  1. Create a Google Cloud project and enable the Sheets API.
  2. Download the credentials.json file and place it in the script folder.
Step 4: Initialize the Database

Run the following command to create the SQLite database:

python fb_parser.py --initdb
Step 5: Configure Telegram Notifications (Optional)

Edit fb_parser.py and add your bot_token and bot_chat_id.

Step 6: Run the Scraper

Start the scraper with:

python fb_parser.py
Step 7: Automation (Optional)

Use Task Scheduler to automate script execution.


VPS Setup Process

VPS Requirements

  • VPS with SSH access and Python 3.6+ installed.
  • Linux OS (Ubuntu or CentOS preferred).
  • Necessary script files and dependencies.

Setup Steps

Step 1: Log in to VPS

Access your VPS via SSH:

ssh username@hostname
Step 2: Transfer Script Files

Upload files using SCP or SFTP:

scp fb_parser.py requirements.txt username@hostname:/path/to/directory
Step 3: Install Python and Dependencies

Update your system and install Python dependencies:

sudo apt update
sudo apt install python3-pip
pip3 install -r requirements.txt
Step 4: Configure Credentials

Follow the same steps as the local setup to configure Google Sheets and Telegram credentials.

Step 5: Run the Scraper

Navigate to the script directory and execute:

python3 fb_parser.py
Step 6: Automate with Cron

Use cron to schedule periodic script execution:

crontab -e
# Add the line below to run daily at midnight
0 0 * * * python3 /path/to/fb_parser.py

Conclusion

By following this guide, you can effectively transfer and set up the Facebook Marketplace Scraper on your local system or VPS. This tool simplifies the process of collecting and managing online listing data.


References

Transferring Files Between WSL and Windows

Transferring Files Between WSL and Windows

This guide provides a step-by-step approach to transferring files between Windows Subsystem for Linux (WSL) and Windows using tools like SCP (Secure Copy). It includes commands for file management and efficient navigation in both environments.


Logging into Zomro VPS using WSL in Ubuntu CLI

To access your Zomro VPS using WSL’s Ubuntu terminal:

  1. Open the Ubuntu terminal via the Windows Start menu.
  2. Use the ssh command to connect to your VPS:

ssh your_username@your_server_ip

Replace your_username with your VPS username and your_server_ip with the server’s IP address.

  1. Enter your VPS password when prompted. After logging in, you can manage your VPS from the Ubuntu CLI.

Locating File Paths in Ubuntu CLI

Navigating and identifying file paths in Ubuntu is essential for transferring files. Use these commands for efficient file management:

1. Present Working Directory (pwd)

Displays the absolute path of the current directory:

pwd

2. List Directory Contents (ls)

Shows files and directories in the current location:

ls

3. Find a File (find)

Searches for a file in the system:

find / -name example.txt

4. Change Directory (cd)

Navigates through directories:

cd /path/to/directory

5. Access Windows Files

WSL allows access to Windows files via /mnt. For example:

cd /mnt/c/Users/YourUsername/Desktop


File Transfer Methods Using SCP

Using SCP (Secure Copy)

The scp command securely copies files between WSL and Windows.

Syntax
scp username@source:/path/to/source/file /path/to/destination/
Example: Copy Files from WSL to Windows

To copy screenshots from a remote VPS to your Windows Desktop:

scp root@45.88.107.136:/root/zomro-selenium-base/screenshots/* "/mnt/c/Users/Harminder Nijjar/Desktop/"

This command transfers all files from the VPS directory to the specified Windows Desktop folder.


Conclusion

Transferring files between WSL and Windows is simple and efficient using commands like scp. Mastering these techniques will streamline your workflow and enhance your productivity across WSL and Windows environments.