The Selector object cannot be serialized into JSON format

Currently, I'm facing the challenge of scraping a dynamic website which requires the use of Selenium.

The specific links that I'm interested in scraping only become visible when clicked on. These links are generated by jQuery, and there is no href attribute or URL associated with them, so clicking is the only option.

My current approach involves:

# -*- coding: utf-8 -*-
import scrapy

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from scrapy.selector import Selector
from scrapy_selenium import SeleniumRequest

class AnofmSpider(scrapy.Spider):
    name = 'anofm'
    
    def start_requests(self):
        yield SeleniumRequest(
            url='https://www.anofm.ro/lmvw.html?agentie=Covasna&categ=3&subcateg=1',
            callback=self.parse
        )

    def parse(self, response):  
        driver = response.meta['driver'] 
        try:
            element = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.ID, "tableRepeat2"))
            )
        finally:
            html = driver.page_source
            response_obj = Selector(text=html)
            
            links = response_obj.xpath("//tbody[@id='tableRepeat2']")
            for link in links:
                driver.execute_script("arguments[0].click();", link)
                
                yield {
                    'Ocupatia': response_obj.xpath("//div[@id='print']/p/text()[1]")
                }

However, this method seems to be ineffective.

An error occurs while attempting to click on the element:

TypeError: Object of type Selector is not JSON serializable

I can grasp the concept behind this error, but I'm unsure of how to rectify it. Is there a way to convert the Selector object into a clickable button?

After scouring online resources and documentation, I haven't been able to find a solution.

Any assistance in understanding and resolving this issue would be greatly appreciated.

Thank you.

Answer №1

Indeed, data is being generated through API requests and JSON responses, making it possible to easily extract information from the API. Below is a functional solution that includes pagination. Each page displays 8 items, with a total of 32 items available.

CODE:

import scrapy
import json

class AnofmSpider(scrapy.Spider):

    name = 'anofm'

    def start_requests(self):
        yield scrapy.Request(
            url='https://www.anofm.ro/dmxConnect/api/oferte_bos/oferte_bos_query2L_Test.php?offset=8&cauta=&select=Covasna&limit=8&localitate=',
            method='GET',
            callback=self.parse,
            meta= {
                'limit': 8}
                )


    def parse(self, response):
        resp = json.loads(response.body)
        hits = resp.get('lmv').get('data')
        for h in hits:
            yield {
                'Ocupatia': h.get('OCCUPATION')
            }


        total_limit = resp.get('lmv').get('total')
        next_limit = response.meta['limit'] + 8
        if next_limit <= total_limit:
            yield scrapy.Request(
                url=f'https://www.anofm.ro/dmxConnect/api/oferte_bos/oferte_bos_query2L_Test.php?offset=8&cauta=&select=Covasna&limit={next_limit}&localitate=',
                method='GET',
                callback=self.parse,
                meta= {
                    'limit': next_limit}
                    )

Answer №2

You are encountering a problem by mixing Scrapy objects with Selenium functions. A better approach would be to solely rely on Selenium for this task.

        finally:

            links = driver.find_elements_by_xpath("//tbody[@id='tableRepeat2']/tr")
            print('len(links):', len(links))
            
            for link in links:
                # doesn't work for me - even
                #driver.execute_script("arguments[0].scrollIntoView();", link)
                #link.click()
                
                # open information
                driver.execute_script("arguments[0].click();", link)
                
                # javascript may need some time to display it
                time.sleep(1)
                
                # get data
                ocupatia = driver.find_element_by_xpath(".//div[@id='print']/p").text
                ocupatia = ocupatia.split('\n', 1)[0]        # first line
                ocupatia = ocupatia.split(':', 1)[1].strip() # text after first `:`
                print('Ocupatia -->', ocupatia)

                # close information
                driver.find_element_by_xpath('//button[text()="Inchide"]').click()

                yield {
                    'Ocupatia': ocupatia
                }

This is a fully functional code snippet that can be executed directly from a single file using the command python script.py, without the need to create a Scrapy project.

Ensure that you update the SELENIUM_DRIVER_EXECUTABLE_PATH variable with the correct path before running the code.

import scrapy

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from scrapy.selector import Selector
from scrapy_selenium import SeleniumRequest
import time

class AnofmSpider(scrapy.Spider):
    name = 'anofm'
    
    def start_requests(self):
        yield SeleniumRequest(
            url='https://www.anofm.ro/lmvw.html?agentie=Covasna&categ=3&subcateg=1',
            #callback=self.parse
        )

    def parse(self, response):  
        driver = response.meta['driver'] 
        try:
            print("try")
            element = WebDriverWait(driver, 20).until(
                EC.presence_of_element_located((By.XPATH, "//tbody[@id='tableRepeat2']/tr/td"))
            )
        finally:
            print("finally")

            links = driver.find_elements_by_xpath("//tbody[@id='tableRepeat2']/tr")
            print('len(links):', len(links))
            
            for link in links:
                #driver.execute_script("arguments[0].scrollIntoView();", link)
                #link.click()
                
                # open information
                driver.execute_script("arguments[0].click();", link)
                
                # javascript may need some time to display it
                time.sleep(1)
                
                # get data
                ocupatia = driver.find_element_by_xpath(".//div[@id='print']/p").text
                ocupatia = ocupatia.split('\n', 1)[0]        # first line
                ocupatia = ocupatia.split(':', 1)[1].strip() # text after first `:`
                print('Ocupatia -->', ocupatia)

                # close information
                driver.find_element_by_xpath('//button[text()="Inchide"]').click()

                yield {
                    'Ocupatia': ocupatia
                }

# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    # save in file CSV, JSON or XML
    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1

    'DOWNLOADER_MIDDLEWARES': {'scrapy_selenium.SeleniumMiddleware': 800},

    'SELENIUM_DRIVER_NAME': 'firefox',
    'SELENIUM_DRIVER_EXECUTABLE_PATH': '/home/furas/bin/geckodriver',
    'SELENIUM_DRIVER_ARGUMENTS': [], # ['-headless']
})
c.crawl(AnofmSpider)
c.start() 

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

I created this Python script, but it repeatedly throws errors

I'm facing repeated errors while trying to execute this code: import usb.util from time import sleep TOYPAD_INIT = [0x55, 0x0f, 0xb0, 0x01, 0x28, 0x63, 0x29, 0x20, 0x4c, 0x45, 0x47, 0x4f, 0x20, 0x32, 0x30, 0x31, 0x34, 0xf7, 0x00, 0x00, 0x00, 0x00, 0 ...

Having difficulty establishing a connection between Arduino and Python with the PySerial library

I am currently working on a project with Arduino that involves communication with Python. After researching online, I found a sample code for Arduino Python serial communication which lights up an LED when the number 1 is entered. Although both the Python ...

Selenium: The art of pinpointing a specific node through precise text matching

When trying to find an Element on a Web Page using text, I am aware of the method called contains. For example: tr[contains(.,'hello')]/td However, the issue arises when there are two elements with names hello and hello1 as this function does n ...

An issue with Selenium web scraping arises: WebDriverException occurs after the initial iteration of the loop

Currently, I am executing a Selenium web-scraping loop on Chrome using MacOS arm64. The goal is to iterate through a list of keywords as inputs in an input box, search for each one, and retrieve the text of an attribute from the output. A few months ago, I ...

Error: Module 'generate_xml' not found

I am currently working on a project involving real-time vehicle classification using the YOLO model. While trying to annotate vehicle images, I encountered the following error: Traceback (most recent call last): File "train.py", line 5, in <module&g ...

What is the process for invoking a function in a Python script right before it is terminated using the kill command?

I created a python script named flashscore.py. During execution, I find the need to terminate the script abruptly. To do this, I use the command line tool kill. # Locate process ID for the script $ pgrep -f flashscore.py 55033 $ kill 55033 The script ...

Improving performance of bigint operations

I've been working through the book 'Programming in D' to learn about the D programming language. Recently, I attempted to tackle a problem involving summing up the squares of numbers from 1 to 10000000. Initially, I employed a functional app ...

Changing font colors with POI: A step-by-step guide

I have been utilizing the following code to update MS Word using POI in my selenium scripts. public class WordAutomation { public static String projectpath = System.getProperty("user.dir"); public static FileOutputStream out; public ...

Streamlining a selenium xpath expression for improved efficiency

Here is an XPath expression that I am using in selenium (technically via splinter, which utilizes selenium): //label[text()="data"]/following-sibling::div/input|//label[text()="data"]/following-sibling::div/textarea I'm wondering if there is a way t ...

selenium and behat: Provider not found for session

I'm currently working on setting up selenium in a docker container to use with Behat. However, when I check the status of the hub at http://localhost:4444/status, it shows that it is not ready: { "value": { "ready": false, ...

Multitasking with Gevent pool for handling multiple nested web requests

I am working on setting up a pool with a maximum of 10 concurrent downloads for organizing web data. The goal is to download the main base URL, parse all URLs on that page, and then proceed to download each individual URL, but maintaining an overall limit ...

Is there a way to execute Selenium test scripts without the need for local installation or running?

I am in search of a way to create a central 'hub' for Selenium within my workplace so that every employee can easily access it. For example, if Tester A writes test scripts, Person B should be able to run them without manually transferring the sc ...

Organize the list in such a way that the argument with the greatest numerical value is at the

How can I rearrange my list so that the item with the largest number comes first? my_list = ["Ava 20", "Ethan 13", "Sophia 35"] print(my_list.sort()) desired output: ["Sophia 35", "Ava 20", "Ethan ...

Utilize CUDA C library in Python to efficiently transfer GPU memory data between different functions

I am attempting to develop an algorithm in Python using functions written in CUDA, but I want to avoid unnecessary data transfers between device and host. Specifically, I am trying to pass data from one dll function to another without having to copy it bac ...

Encountering a null pointer exception when attempting to declare a WebElementFacade within a page object

While attempting to implement the page object model in Serenity BDD, I encountered a null pointer exception when declaring WebElementFacade in my page object. Below is the code for my page object class: package PageObjects; import net.serenitybdd.core.an ...

What is the best approach to implement an explicit wait in Selenium using this code snippet?

Struggling to grab the correct image URLs and save them as jpgs after clicking on specific links. Instead of getting the desired image URL, I keep retrieving the URL of the page before it. The script is saving jpg files with timestamps, but they are empty ...

Different means of obtaining Azure AD login records in Python without relying on the subprocess library or powershell.exe

After researching various sources, it seems that the subprocess library in Python is commonly used to run PowerShell commands from within Python. For instance: data = subprocess.check_output(["powershell.exe", "Connect-AzureAD -AccountId < ...

Inconsistencies with executing scripts using Selenium

Here is a custom function I have created for clicking elements: private WebDriverWait wait; public void clickElement(By element) throws InterruptedException { // Wait until the element is clickable wait.until(ExpectedConditions.presenceOfElementL ...

Webdriver struggling to locate elements on a remote Internet Explorer session

I've encountered a peculiar issue with WebDriver. In my testing, I use both a local and remote environment, and while Firefox works seamlessly in both, Internet Explorer 8 presents a challenge. The tests only run successfully in the local environment ...

What is the method for incorporating a run_date into the code for my Airflow Dag?

Just starting out with Python and I'm wondering how to give my table a name that includes the date. Anyone have any suggestions? with DAG("inform_status", schedule_interval="55 17 * * 1-5", start_ ...