Guidelines for transferring Selenium WebDriver response to Scrapy's parse method

I am currently facing two challenges that I need to overcome:

First: effectively locating and interacting with an element on a website using a driver. Second: passing the link generated from this interaction to a parse method or LinkExtractor.

ad. 1.

My task involves finding and clicking the "Load more" button in order to crawl the subsequent page.

<div class="col-sm-4 col-sm-offset-4 col-md-2 col-md-offset-5 col-xs-12 col-xs-offset-0">
            <button class="btn btn-secondary">Load more</button>
        </div>

ad.2.

I have set up LinkExtractor rules that function correctly on static websites, as well as a parse method. While I have come across similar examples in my research, I struggle to piece it all together seamlessly.

This is my latest attempt:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from os.path import join as path_join
import json
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time

# Code shortened for readability...

<p>I have omitted a significant portion of the parse_page method for clarity (it does not impact functionality).</p>

<p>Here is another approach I experimented with, but did not achieve success:</p>

<pre><code>    driver = webdriver.Chrome(ChromeDriverManager().install())

    def parse_url(self, response):
        self.driver.get(response.url)

        while True:
            self.driver.implicitly_wait(2)
            next = self.driver.find_element_by_class_name('btn btn-secondary')

            try:
                next.click()
                time.sleep(2)
                # Extract and process data into Scrapy items
            except:
                break

        self.driver.close()

Any insights or guidance on either of these challenges would be greatly appreciated.

Answer №1

To integrate Selenium requests and return a HtmlResponse, you can create a custom middleware as shown below:

from scrapy import Request

class SeleniumRequest(Request):
    pass

This subclass helps the middleware determine if Selenium should be utilized. Remember to close the driver when the spider closes.

class SeleniumMiddleware:
    def __init__(self):
        self.driver = webdriver.Chrome(ChromeDriverManager().install())

    @classmethod
    def from_crawler(cls, crawler):
        middleware = cls()
        crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
        return middleware

    def process_request(self, request, spider):
        if not isinstance(request, SeleniumRequest):
            return None
        self.driver.get(request.url)

        # code snippet for clicking "load more" button on JS-page
        while True:
            try:
                button = WebDriverWait(self.driver, 10).until(
                    EC.presence_of_element_located((By.CLASS_NAME, 'btn btn-secondary'))
                )
                self.driver.execute_script("arguments[0].click();", button)
            except:
                break

        return HtmlResponse(
            self.driver.current_url,
            body=str.encode(self.driver.page_source),
            encoding='utf-8',
            request=request
        )

    def spider_closed(self):
        self.driver.quit()

Enable the middleware in your settings file:

DOWNLOADER_MIDDLEWARES = {
    'your_project.middleware_location.SeleniumMiddleware': 500
}

You can now use SeleniumRequest in your Spider's start_requests method:

def start_requests(self):
    yield SeleniumRequest(your_url, your_callback)

With this setup, all requests will be processed through Selenium, returning HtmlResponses. Make sure to utilize LinkExtractors and the parse method with these responses. For further guidance, check out this GitHub repository.

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Help! How can I prevent using Thread.sleep() multiple times in a selenium test?

Currently, I am facing an issue while writing some Selenium tests. My code includes Thread.sleep after every function call in a method, making it repetitive and messy. I want to find a more optimal solution to replace these repeating Thread.sleep calls. ...

An effective method for removing various stems from the end of a word is to utilize slicing techniques

While I am aware of tools like NLTK that can assist with this task, I am interested in learning how to efficiently extract multiple stems from a list. Let's consider the following list of words: list = ["another", "cats", "walrus", "relaxed", "annoy ...

Selenium encountered an error in retrieving the section id from the web page

I am facing an issue with this specific piece of code: import re from lxml import html from bs4 import BeautifulSoup as BS from selenium import webdriver from selenium.webdriver.firefox.firefox_binary import FirefoxBinary import requests import sys import ...

What steps are involved in including the Gradle Kotlin dependency for the compile group 'org.seleniumhq.selenium'?

Can anyone provide the syntax for adding Selenium as a dependency using the Gradle Kotlin DSL? Error: thufir@dur:~/NetBeansProjects/HelloKotlinWorld$ thufir@dur:~/NetBeansProjects/HelloKotlinWorld$ gradle clean > Configure project : e: /home/thufir/ ...

How can I update the code to remove the DeprecationWarning caused by using asyncio.get_event_loop()?

After reviewing the code below, I have come across some confusing warnings: DeprecationWarning: There is no current event loop loop = asyncio.get_event_loop() and also, DeprecationWarning: There is no current event loop loop.run_until_complete(asyncio ...

Customize the colors of a line plot for a NumPy array using matplotlib

Having a 2D numpy array (y_array) with 3 columns (and shared x values in a list, x_list), I'm looking to plot each column as a line. Using matplotlib.pyplot.plot(x_list, y_array) accomplishes this successfully. The challenge arises when it comes to c ...

A guide on incorporating background threads in Flask:

I have a complex calculation that I need to perform, and because of the gateway timeout issue, I am looking to run the calculation in a background thread. I attempted to use Python threading but encountered some difficulties. import time import threadin ...

Selenium ChromeDriver encountered an issue loading a resource: net::ERR_CONNECTION_CLOSED

When running acceptance tests with Codeception using WebDriver and a Docker Selenium standalone server, I encountered an error that resulted in the following log message: [Selenium browser Logs] 13:59:52.345 SEVERE - https://ssl.google-analytics.com/ga ...

Choose a checkbox by targeting a specific column value with XPath

While automating the testing with Selenium, I encountered an issue related to selecting a checkbox in a table row. To resolve this problem, I turned to XPath for assistance. Specifically, I needed to choose the row based on the file name. Below is the rele ...

E2E tests for Internet Explorer using Selenium and Protractor

Looking to integrate e2e tests into our CI build process, I have successfully added them for Chrome and Firefox. However, I want to include tests for various versions of Internet Explorer as well. How can this be accomplished in the build process on Linux/ ...

Error: The 'chromedriver' executable must be located in the PATH directory

Currently, I am utilizing OS X El Capitan along with Eclipse (Neo) and Python. I have written some Selenium scripts in Python. Initially, these scripts were functioning properly. However, after upgrading from OSX Sierra to El Capitan, Please note: thi ...

Best method for releasing a string returned from a C function in CFFI?

ffi = FFI() C = ffi.dlopen("mycffi.so") ffi.cdef(""" char* foo(T *t); void free_string(char *s); """) def get_foo(x): cdata = C.foo(x) s = ffi.string(cdata) ret = s[:] C.free_string(cdata) return ret If a char * is passed from a C f ...

Updating an object in Python multiprocessing

I have a vast collection of specialized objects that require independent and parallelizable tasks, such as changing object parameters. I've experimented with utilizing both Manager().dict and shared memory, but neither approach has been successful. He ...

Effective methods for accessing and updating text editor values using Selenium

I am currently working on a web page that has a text editor, and I need to populate its value using Selenium scripting in C#. I have successfully done this for a textbox following instructions from Set value in textbox. However, when attempting the same pr ...

Capturing Screenshots as Numpy Arrays Using Selenium WebDriver in Python

Can selenium webdriver capture a screenshot and transform it into a numpy array without saving it? I plan to use it with openCV. Please keep in mind that I'm looking for a solution that avoids saving the image separately before using it. ...

The issue of using an import statement outside a module arises when executing Protractor

I am facing an issue while running Protractor with my two files. When I execute the command "protractor protractor.config.js", I encounter the following error: D:\work\staru-app>protractor protractor.config.js [16:57:17] I/launcher - Running ...

Selenium: The parent window automatically closes when a child window pops up

I'm encountering a challenge with switching to a child window in a specific scenario Scenario: 1. On the Login Page, I need to enter my user id and password. 2. After entering the credentials, I click on the submit button. 3. The system then opens a ...

Leveraging the power of sqlite3 alongside the flexibility of k

conn = sqlite3.connect('business_database.db') c = conn.cursor() c.execute("INSERT INTO business VALUES(self.nob_text_input.text, self.post_text_input.text, self.descrip_text_input.text )") conn.commit() conn.close() I am attempting ...

Discovering elements in jQuery selected within Selenium

I'm completely new to Selenium. The developers on our website utilized jQuery chosen select for filling drop-downs. I am trying to input specific text and then select the matching text that I entered. Here is what I attempted: [FindsBy(How = How.XPa ...

Struggling to concentrate on a recently opened Selenium window

Having trouble focusing on a new window using Selenium and Java while running my application on Internet Explorer. The new window opens but I'm unable to interact with it. I've tried the following code: Set<String> allwindows = driver.getW ...