The Selector object cannot be serialized into JSON format

Question

The Selector object cannot be serialized into JSON format

Currently, I'm facing the challenge of scraping a dynamic website which requires the use of Selenium.

The specific links that I'm interested in scraping only become visible when clicked on. These links are generated by jQuery, and there is no href attribute or URL associated with them, so clicking is the only option.

My current approach involves:

# -*- coding: utf-8 -*-
import scrapy

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from scrapy.selector import Selector
from scrapy_selenium import SeleniumRequest

class AnofmSpider(scrapy.Spider):
    name = 'anofm'
    
    def start_requests(self):
        yield SeleniumRequest(
            url='https://www.anofm.ro/lmvw.html?agentie=Covasna&categ=3&subcateg=1',
            callback=self.parse
        )

    def parse(self, response):  
        driver = response.meta['driver'] 
        try:
            element = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.ID, "tableRepeat2"))
            )
        finally:
            html = driver.page_source
            response_obj = Selector(text=html)
            
            links = response_obj.xpath("//tbody[@id='tableRepeat2']")
            for link in links:
                driver.execute_script("arguments[0].click();", link)
                
                yield {
                    'Ocupatia': response_obj.xpath("//div[@id='print']/p/text()[1]")
                }

However, this method seems to be ineffective.

An error occurs while attempting to click on the element:

TypeError: Object of type Selector is not JSON serializable

I can grasp the concept behind this error, but I'm unsure of how to rectify it. Is there a way to convert the Selector object into a clickable button?

After scouring online resources and documentation, I haven't been able to find a solution.

Any assistance in understanding and resolving this issue would be greatly appreciated.

Thank you.

python selenium selenium-webdriver scrapy scrapy-selenium

Answer 1

Answer №1

Indeed, data is being generated through API requests and JSON responses, making it possible to easily extract information from the API. Below is a functional solution that includes pagination. Each page displays 8 items, with a total of 32 items available.

CODE:

import scrapy
import json

class AnofmSpider(scrapy.Spider):

    name = 'anofm'

    def start_requests(self):
        yield scrapy.Request(
            url='https://www.anofm.ro/dmxConnect/api/oferte_bos/oferte_bos_query2L_Test.php?offset=8&cauta=&select=Covasna&limit=8&localitate=',
            method='GET',
            callback=self.parse,
            meta= {
                'limit': 8}
                )


    def parse(self, response):
        resp = json.loads(response.body)
        hits = resp.get('lmv').get('data')
        for h in hits:
            yield {
                'Ocupatia': h.get('OCCUPATION')
            }


        total_limit = resp.get('lmv').get('total')
        next_limit = response.meta['limit'] + 8
        if next_limit <= total_limit:
            yield scrapy.Request(
                url=f'https://www.anofm.ro/dmxConnect/api/oferte_bos/oferte_bos_query2L_Test.php?offset=8&cauta=&select=Covasna&limit={next_limit}&localitate=',
                method='GET',
                callback=self.parse,
                meta= {
                    'limit': next_limit}
                    )

Answer 2

Indeed, data is being generated through API requests and JSON responses, making it possible to easily extract information from the API. Below is a functional solution that includes pagination. Each page displays 8 items, with a total of 32 items available.

CODE:

import scrapy
import json

class AnofmSpider(scrapy.Spider):

    name = 'anofm'

    def start_requests(self):
        yield scrapy.Request(
            url='https://www.anofm.ro/dmxConnect/api/oferte_bos/oferte_bos_query2L_Test.php?offset=8&cauta=&select=Covasna&limit=8&localitate=',
            method='GET',
            callback=self.parse,
            meta= {
                'limit': 8}
                )


    def parse(self, response):
        resp = json.loads(response.body)
        hits = resp.get('lmv').get('data')
        for h in hits:
            yield {
                'Ocupatia': h.get('OCCUPATION')
            }


        total_limit = resp.get('lmv').get('total')
        next_limit = response.meta['limit'] + 8
        if next_limit <= total_limit:
            yield scrapy.Request(
                url=f'https://www.anofm.ro/dmxConnect/api/oferte_bos/oferte_bos_query2L_Test.php?offset=8&cauta=&select=Covasna&limit={next_limit}&localitate=',
                method='GET',
                callback=self.parse,
                meta= {
                    'limit': next_limit}
                    )

Answer 3

Answer №2

You are encountering a problem by mixing Scrapy objects with Selenium functions. A better approach would be to solely rely on Selenium for this task.

        finally:

            links = driver.find_elements_by_xpath("//tbody[@id='tableRepeat2']/tr")
            print('len(links):', len(links))
            
            for link in links:
                # doesn't work for me - even
                #driver.execute_script("arguments[0].scrollIntoView();", link)
                #link.click()
                
                # open information
                driver.execute_script("arguments[0].click();", link)
                
                # javascript may need some time to display it
                time.sleep(1)
                
                # get data
                ocupatia = driver.find_element_by_xpath(".//div[@id='print']/p").text
                ocupatia = ocupatia.split('\n', 1)[0]        # first line
                ocupatia = ocupatia.split(':', 1)[1].strip() # text after first `:`
                print('Ocupatia -->', ocupatia)

                # close information
                driver.find_element_by_xpath('//button[text()="Inchide"]').click()

                yield {
                    'Ocupatia': ocupatia
                }

This is a fully functional code snippet that can be executed directly from a single file using the command python script.py, without the need to create a Scrapy project.

Ensure that you update the SELENIUM_DRIVER_EXECUTABLE_PATH variable with the correct path before running the code.

import scrapy

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from scrapy.selector import Selector
from scrapy_selenium import SeleniumRequest
import time

class AnofmSpider(scrapy.Spider):
    name = 'anofm'
    
    def start_requests(self):
        yield SeleniumRequest(
            url='https://www.anofm.ro/lmvw.html?agentie=Covasna&categ=3&subcateg=1',
            #callback=self.parse
        )

    def parse(self, response):  
        driver = response.meta['driver'] 
        try:
            print("try")
            element = WebDriverWait(driver, 20).until(
                EC.presence_of_element_located((By.XPATH, "//tbody[@id='tableRepeat2']/tr/td"))
            )
        finally:
            print("finally")

            links = driver.find_elements_by_xpath("//tbody[@id='tableRepeat2']/tr")
            print('len(links):', len(links))
            
            for link in links:
                #driver.execute_script("arguments[0].scrollIntoView();", link)
                #link.click()
                
                # open information
                driver.execute_script("arguments[0].click();", link)
                
                # javascript may need some time to display it
                time.sleep(1)
                
                # get data
                ocupatia = driver.find_element_by_xpath(".//div[@id='print']/p").text
                ocupatia = ocupatia.split('\n', 1)[0]        # first line
                ocupatia = ocupatia.split(':', 1)[1].strip() # text after first `:`
                print('Ocupatia -->', ocupatia)

                # close information
                driver.find_element_by_xpath('//button[text()="Inchide"]').click()

                yield {
                    'Ocupatia': ocupatia
                }

# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    # save in file CSV, JSON or XML
    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1

    'DOWNLOADER_MIDDLEWARES': {'scrapy_selenium.SeleniumMiddleware': 800},

    'SELENIUM_DRIVER_NAME': 'firefox',
    'SELENIUM_DRIVER_EXECUTABLE_PATH': '/home/furas/bin/geckodriver',
    'SELENIUM_DRIVER_ARGUMENTS': [], # ['-headless']
})
c.crawl(AnofmSpider)
c.start()

Answer 4

You are encountering a problem by mixing Scrapy objects with Selenium functions. A better approach would be to solely rely on Selenium for this task.

        finally:

            links = driver.find_elements_by_xpath("//tbody[@id='tableRepeat2']/tr")
            print('len(links):', len(links))
            
            for link in links:
                # doesn't work for me - even
                #driver.execute_script("arguments[0].scrollIntoView();", link)
                #link.click()
                
                # open information
                driver.execute_script("arguments[0].click();", link)
                
                # javascript may need some time to display it
                time.sleep(1)
                
                # get data
                ocupatia = driver.find_element_by_xpath(".//div[@id='print']/p").text
                ocupatia = ocupatia.split('\n', 1)[0]        # first line
                ocupatia = ocupatia.split(':', 1)[1].strip() # text after first `:`
                print('Ocupatia -->', ocupatia)

                # close information
                driver.find_element_by_xpath('//button[text()="Inchide"]').click()

                yield {
                    'Ocupatia': ocupatia
                }

This is a fully functional code snippet that can be executed directly from a single file using the command python script.py, without the need to create a Scrapy project.

Ensure that you update the SELENIUM_DRIVER_EXECUTABLE_PATH variable with the correct path before running the code.

import scrapy

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from scrapy.selector import Selector
from scrapy_selenium import SeleniumRequest
import time

class AnofmSpider(scrapy.Spider):
    name = 'anofm'
    
    def start_requests(self):
        yield SeleniumRequest(
            url='https://www.anofm.ro/lmvw.html?agentie=Covasna&categ=3&subcateg=1',
            #callback=self.parse
        )

    def parse(self, response):  
        driver = response.meta['driver'] 
        try:
            print("try")
            element = WebDriverWait(driver, 20).until(
                EC.presence_of_element_located((By.XPATH, "//tbody[@id='tableRepeat2']/tr/td"))
            )
        finally:
            print("finally")

            links = driver.find_elements_by_xpath("//tbody[@id='tableRepeat2']/tr")
            print('len(links):', len(links))
            
            for link in links:
                #driver.execute_script("arguments[0].scrollIntoView();", link)
                #link.click()
                
                # open information
                driver.execute_script("arguments[0].click();", link)
                
                # javascript may need some time to display it
                time.sleep(1)
                
                # get data
                ocupatia = driver.find_element_by_xpath(".//div[@id='print']/p").text
                ocupatia = ocupatia.split('\n', 1)[0]        # first line
                ocupatia = ocupatia.split(':', 1)[1].strip() # text after first `:`
                print('Ocupatia -->', ocupatia)

                # close information
                driver.find_element_by_xpath('//button[text()="Inchide"]').click()

                yield {
                    'Ocupatia': ocupatia
                }

# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    # save in file CSV, JSON or XML
    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1

    'DOWNLOADER_MIDDLEWARES': {'scrapy_selenium.SeleniumMiddleware': 800},

    'SELENIUM_DRIVER_NAME': 'firefox',
    'SELENIUM_DRIVER_EXECUTABLE_PATH': '/home/furas/bin/geckodriver',
    'SELENIUM_DRIVER_ARGUMENTS': [], # ['-headless']
})
c.crawl(AnofmSpider)
c.start()

The Selector object cannot be serialized into JSON format

Answer №1

Answer №2

Similar questions

I created this Python script, but it repeatedly throws errors

Having difficulty establishing a connection between Arduino and Python with the PySerial library

Selenium: The art of pinpointing a specific node through precise text matching

An issue with Selenium web scraping arises: WebDriverException occurs after the initial iteration of the loop

Error: Module 'generate_xml' not found

What is the process for invoking a function in a Python script right before it is terminated using the kill command?

Improving performance of bigint operations

Changing font colors with POI: A step-by-step guide

Streamlining a selenium xpath expression for improved efficiency

selenium and behat: Provider not found for session

Multitasking with Gevent pool for handling multiple nested web requests

Is there a way to execute Selenium test scripts without the need for local installation or running?

Organize the list in such a way that the argument with the greatest numerical value is at the

Utilize CUDA C library in Python to efficiently transfer GPU memory data between different functions

Encountering a null pointer exception when attempting to declare a WebElementFacade within a page object

What is the best approach to implement an explicit wait in Selenium using this code snippet?

Different means of obtaining Azure AD login records in Python without relying on the subprocess library or powershell.exe

Inconsistencies with executing scripts using Selenium

Webdriver struggling to locate elements on a remote Internet Explorer session

What is the method for incorporating a run_date into the code for my Airflow Dag?