Advanced Python Scripting: Scraping and Storing Facebook Marketplace Vehicle Listings in Canada

Introduction

In the realm of data collection and analysis, web scraping stands as a powerful tool to gather specific information from vast online sources. This blog post delves into a Python script I developed, focusing on scraping vehicle listings from Facebook Marketplace in Canadian cities and systematically storing the data in a SQL database. This tool exemplifies how Python's versatility can be harnessed for complex web scraping and data management tasks.

The Concept

The script addresses a common challenge faced by many - accessing structured data from unstructured web sources like Facebook Marketplace. By automating the process of data extraction and storage, the script serves a variety of purposes, from market research to personal data aggregation.

Technical Breakdown

Core Technologies

Python: A versatile programming language ideal for scripting and automation.
BeautifulSoup: A Python library for parsing HTML and XML documents.
SQLite: A lightweight, disk-based database that doesn't require a separate server process.
Asyncio: Python’s built-in library for writing concurrent code using the async/await syntax.

Script Structure

DatabaseManager Class

This class forms the backbone of the data storage mechanism, handling all interactions with the SQLite database. It ensures data integrity and efficient data handling.

    class DatabaseManager:
        def __init__(self):
            self.conn = sqlite3.connect('market_listings.db')
            self.cursor = self.conn.cursor()
            self._prepare_database()

        def _prepare_database(self):
            """Create the database table if it does not exist."""
            self.cursor.execute('''
                CREATE TABLE IF NOT EXISTS market_listings (
                    id INTEGER PRIMARY KEY AUTOINCREMENT,
                    title TEXT NOT NULL,
                    mileage REAL,
                    price REAL NOT NULL,
                    location TEXT NOT NULL,
                    url TEXT NOT NULL UNIQUE,
                    image TEXT,
                    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
                )
            ''')
            self.conn.commit()

        def listing_exists(self, url):
            self.cursor.execute("SELECT COUNT(1) FROM market_listings WHERE url = ?", (url,))
            return self.cursor.fetchone()[0] > 0

        def create_market_listing(self, title, mileage, price, location, url, image):
            if self.listing_exists(url):
                logger.info(f"Listing with URL {url} already exists. Skipping insert.")
                return None
            try:
                self.cursor.execute('''
                    INSERT INTO market_listings (title, mileage, price, location, url, image)
                    VALUES (?, ?, ?, ?, ?, ?)
                ''', (title, mileage, price, location, url, image))
                self.conn.commit()
                return self.cursor.lastrowid
            except sqlite3.IntegrityError as e:
                logger.error(f"Unique constraint failed while inserting into database: {e}")
                return None
            except Exception as e:
                logger.error(f"An error occurred while inserting into database: {e}")
                return None

        def retrieve_all_listings(self):
            self.cursor.execute("SELECT * FROM market_listings")
            return self.cursor.fetchall()

        def close_connection(self):
            self.conn.close()

FacebookMarketplaceScraper Class

The scraper class is the script's workhorse, responsible for sending HTTP requests, parsing the received HTML, and extracting the necessary data from the Facebook Marketplace.

    class FacebookMarketplaceScraper:
        def **init**(self, city, query, db_manager):
            self.city = city
            self.query = query
            self.db_manager = db_manager

        def scrape_city(self, city, query):
            """Scrape a single city."""
            url = "https://scraper-api.smartproxy.com/v2/scrape"
            logger.info(f"Scraping {city}.")
            payload = {
                "target": "universal",
                "locale": "en-US",
                "device_type": "desktop",
                "headless": "html",
                "url": f"https://www.facebook.com/marketplace/{city}/search/?query={query}&exact=false",
            }
            headers = {
                "accept": "application/json",
                "content-type": "application/json",
                "authorization": "YOUR_API_KEY",
            }

            logger.info(f"payload: {payload}")
            logger.info(f"headers: {headers}")

            response = requests.post(url, data=json.dumps(payload), headers=headers)

            # Get the enitre response
            logger.info(f"response.text: {response.text}")
            logger.info(f"response.status_code: {response.status_code}")

            # Get the JSON response
            json_response = response.json()

            if response.content == "null":
                logger.error(
                    f"Error while scraping: {response.status_code}, {response.text}"
                )
                return []

            try:
                json_response = response.json()
            except ValueError as e:
                logger.error(f"Error decoding JSON: {e}")
                return []

            listings_content = json_response.get("results", [])
            if not listings_content:
                logger.info("No results found in the response.")
                return []

            first_result_content = listings_content[0].get("content")
            if not first_result_content:
                logger.info("No content found in the first result.")
                return []

            soup = BeautifulSoup(first_result_content, "html.parser")
            soup_listings = soup.find_all(
                "div",
                class_="x9f619 x78zum5 x1r8uery xdt5ytf x1iyjqo2 xs83m0k x1e558r4 x150jy0e x1iorvi4 xjkvuk6 xnpuxes x291uyu x1uepa24",
            )

            if not soup_listings:
                logger.info("No listings found in the parsed HTML.")
                return []

            logger.info(f"Found {len(soup_listings)} listings.")
            return soup_listings

        def parse_listings(self, soup_listings):
            new_listings = []  # Initialize an empty list to collect new listings
            for soup_listing in soup_listings:
                # Extract data from each listing
                try:
                    # Extract price using regex
                    price = self.extract_price(soup_listing)

                    # Extract mileage using regex
                    mileage = self.extract_mileage(soup_listing)

                    # Extract title
                    title = self.extract_title(soup_listing)

                    # Extract image URL
                    image = self.extract_image(soup_listing)

                    # Extract location
                    location = self.extract_location(soup_listing)

                    # Extract post URL
                    post_url = self.extract_post_url(soup_listing)

                    # Validate extracted data
                    if not self.is_valid_listing(title, price, location, post_url):
                        continue

                    # Check if the listing already exists in the database
                    if self.db_manager.listing_exists(post_url):
                        continue

                    # Add new listing to the database
                    listing_id = self.db_manager.create_market_listing(
                        title, mileage, price, location, post_url, image
                    )
                    if listing_id:
                        new_listings.append(
                            (title, mileage, price, location, post_url, image)
                        )
                except Exception as e:
                    logger.error(f"Error processing listing: {e}")
                    continue
            logger.info(f"Found {len(new_listings)} new listings.")
            return new_listings

        def extract_price(self, soup_listing):
            text = soup_listing.get_text(strip=True)

            # Match prices potentially followed by a year in the range 1950-2024
            price_match = re.search(r"(\$\d{1,3}(?:,\d{3})?)(?=(1950|19[6-9]\d|20[0-1]\d|202[0-4])?)", text)
            if price_match:
                return price_match.group(1)

            # If the above pattern does not match, fall back to the original patterns
            # Canadian dollar sign, without year
            price_match = re.search(r"(\$\d+,\d+)", text)
            if price_match:
                return price_match.group(1)

            # US dollar sign, without year
            price_match = re.search(r"(\$\d+)", text)
            if price_match:
                return price_match.group(1)

            return None

        def extract_mileage(self, soup_listing):
            mileage_match = re.search(r"(\d+K) km", soup_listing.get_text(strip=True))
            return mileage_match.group(1) if mileage_match else None

        def extract_title(self, soup_listing):
            title_elem = soup_listing.find(
                "span", class_="x1lliihq x6ikm8r x10wlt62 x1n2onr6"
            )
            return title_elem.get_text(strip=True) if title_elem else None

        def extract_image(self, soup_listing):
            image_elem = soup_listing.find(
                "img", class_="xt7dq6l xl1xv1r x6ikm8r x10wlt62 xh8yej3"
            )
            return image_elem["src"] if image_elem else None

        def extract_location(self, soup_listing):
            # Extract location in the form of an Uppercase followed by Lowercase letters until a comma is found and than two uppercase letters
            location_match = re.search(
                r"([A-Z][a-z]+(?: [A-Z][a-z]+)*), [A-Z]{2}",
                soup_listing.get_text(strip=True),
            )
            location = location_match.group(1) if location_match else None
            return location

        def extract_post_url(self, soup_listing):
            url_elem = soup_listing.find(
                "a",
                class_="x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu           x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz x1heor9g x1lku1pv",
            )
            return "https://www.facebook.com" + url_elem["href"] if url_elem else None

        def is_valid_listing(self, title, price, location, url):
            # Log why a listing is invalid if it is
            if not title or not price or not location or not url:
                missing_info = []
                if not title:
                    missing_info.append("title")
                if not price:
                    missing_info.append("price")
                if not location:
                    missing_info.append("location")
                if not url:
                    missing_info.append("url")
                # If all four are missing, go to the next listing without logging
                if len(missing_info) == 4:
                    pass
            # Check if the listing is valid
            is_valid = title and price and location and url
            if not is_valid:
                pass
            return is_valid

Asynchronous Operation

Utilizing asyncio, the script is capable of performing scraping operations at predetermined intervals, allowing for up-to-date data collection without manual intervention.

    async def scrape_city_and_save_periodically(self, city, query, interval, duration):
        start_time = datetime.now()
        logger.info(f"Starting periodic scraping at {start_time}.")
        end_time = start_time + timedelta(hours=duration)
        logger.info(f"Periodic scraping will end at {end_time}.")
        while datetime.now() < end_time:
            try:
                # Scrape the city
                soup_listings = self.scrape_city(city, query)
                logger.info(f"Scraped {city} at {datetime.now()}.")
                logger.info(f"Scraped {len(soup_listings)} listings.")
                if not soup_listings:
                    logger.info("No listings found to process.")
                    continue
                # Parse the listings
                new_listings = self.parse_listings(soup_listings)
                if not new_listings:
                    logger.info("No new listings found to upload.")
                    continue
                logger.info(f"Found {len(new_listings)} new listings.")
                logger.info(f"new_listings: {new_listings}")
                # Upload the listings to SQL database
                for new_listing in new_listings:
                    title, mileage, price, location, url, image = new_listing
                    # Upload the listing to the database
                    listing_id = self.db_manager.create_market_listing(
                        title, mileage, price, location, url, image
                    )
                    if listing_id:
                        logger.info(f"Uploaded listing {listing_id} to database.")
                    else:
                        logger.error(f"Failed to upload listing {listing_id} to database.")
            except Exception as e:
                logger.error(f"Error while scraping: {e}")
                continue
            finally:
                # Wait for the specified interval
                await asyncio.sleep(interval)

Deep Dive into Code

Setting Up Logging

Effective logging is crucial for monitoring the script's performance and debugging. The script is configured to log both to a file and the console, ensuring comprehensive tracking of events.

    # Set up logging    
    logger = logging.getLogger(**name**)
    logger.setLevel(logging.INFO)

    file_handler = logging.FileHandler("scraper.log")
    log_format = logging.Formatter(
    "%(asctime)s - %(name)s - [%(levelname)s] [%(pathname)s:%(lineno)d] - %(message)s - [%(process)d:%(thread)d]"
    )
    file_handler.setFormatter(log_format)
    logger.addHandler(file_handler)

    console_handler = logging.StreamHandler()
    console_handler.setFormatter(log_format)
    logger.addHandler(console_handler)

    logger.info("Logger initialized.")

Database Operations

The script initializes the database and defines a schema for storing listing data. It includes robust error handling and data validation to maintain database integrity.

    # Initialize the database
    db_manager = DatabaseManager()
    logger.info("Database initialized.")

    # Create a table for storing market listings
    db_manager.create_market_listings_table()
    logger.info("Market listings table created.")

    # Retrieve all listings from the database
    listings = db_manager.retrieve_all_listings()
    logger.info(f"Retrieved {len(listings)} listings from the database.")

Data Scraping and Parsing

The script makes use of BeautifulSoup to parse HTML content fetched from Facebook Marketplace. It navigates the DOM structure to extract details such as title, price, and location of the vehicle listings.

    # Scrape the city
    soup_listings = scraper.scrape_city(city, query)
    logger.info(f"Scraped {city} at {datetime.now()}.")
    logger.info(f"Scraped {len(soup_listings)} listings.")
    if not soup_listings:
        logger.info("No listings found to process.")
        continue
    # Parse the listings
    new_listings = scraper.parse_listings(soup_listings)
    if not new_listings:
        logger.info("No new listings found to upload.")
        continue
    logger.info(f"Found {len(new_listings)} new listings.")
    logger.info(f"new_listings: {new_listings}")

Handling Asynchronous Tasks

The script utilizes asyncio to perform scraping operations at predetermined intervals. This allows for up-to-date data collection without manual intervention.

    # Scrape the city and save periodically
    loop = asyncio.get_event_loop()
    loop.run_until_complete(
        scraper.scrape_city_and_save_periodically(city, query, interval, duration)
    )
    loop.close()

Usage Scenario

The script can be used to collect data for a variety of purposes, from market research to personal data aggregation. It can be adapted to scrape data from other Facebook Marketplace categories and locations.

Challenges and Solutions

Challenge: Scraping Dynamic Content

Facebook Marketplace is a dynamic website that loads content dynamically using JavaScript. This makes it difficult to scrape using traditional methods.

Solution: Smartproxy

Smartproxy is a rotating residential proxy network that allows you to send requests from a pool of over 40 million residential proxies. This allows you to bypass anti-scraping measures and scrape dynamic content with ease.

Challenge: Parsing HTML Content

The script uses BeautifulSoup to parse HTML content fetched from Facebook Marketplace. It navigates the DOM structure to extract details such as title, price, and location of the vehicle listings.

Solution: Regular Expressions

Regular expressions are used to extract specific patterns from the HTML content. This allows for efficient and accurate data extraction.

Challenge: Data Storage

The script uses SQLite to store the scraped data. It ensures data integrity and efficient data handling.

Solution: DatabaseManager Class

The DatabaseManager class forms the backbone of the data storage mechanism, handling all interactions with the SQLite database. It ensures data integrity and efficient data handling.

Future Enhancements

Scrape Other Categories: The script can be adapted to scrape other categories of Facebook Marketplace, such as electronics and clothing.
Scrape Other Locations: The script can be adapted to scrape other locations of Facebook Marketplace, such as the United States and the United Kingdom.
Scrape Other Platforms: The script can be adapted to scrape other platforms, such as Kijiji and Craigslist.

Conclusion

This Python script stands as a testament to the power of programming in automating and streamlining data collection processes. It is not just a tool but a framework that can be adapted to various other scraping needs, showcasing Python's flexibility and efficiency.

GitHub Repository

For a detailed view of the code and to try it out, visit the GitHub repository.

Disclaimer

This tool is designed for educational and research purposes. Users must comply with Facebook's terms of service and applicable laws when scraping data.

Last update: February 5, 2024
Created: February 5, 2024