Navigating through js/ajax(href="#") based pagination using Scrapy - A helpful guide

I am looking to loop through all the category URLs and extract the content from each page. Although I have attempted to retrieve only the first category URL using

urls = [response.xpath('//ul[@class="flexboxesmain categorieslist"]/li/a/@href').extract()[0]]
in this code, my ultimate objective is to fetch all URLs and their corresponding content.

I am currently utilizing the scrapy_selenium library. However, the Selenium page source is not being passed to the 'scrape_it' function as expected. As a newcomer to the scrapy framework, I would appreciate it if you could review my code and point out any potential issues.

Below is the spider code snippet:

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
from scrapy import Selector
from scrapy_selenium import SeleniumRequest
from ..items import CouponcollectItem

class Couponsite6SpiderSpider(scrapy.Spider):
    name = 'couponSite6_spider'
    allowed_domains = ['www.couponcodesme.com']
    start_urls = ['https://www.couponcodesme.com/ae/categories']
    
    def parse(self, response):   
        urls = [response.xpath('//ul[@class="flexboxesmain categorieslist"]/li/a/@href').extract()[0]]
        for url in urls:
            yield SeleniumRequest(
                url=response.urljoin(url),
                wait_time=3,
                callback=self.parse_urls
            ) 

    def parse_urls(self, response):
        driver = response.meta['driver']
        while True:
            next_page = driver.find_element_by_xpath('//a[@class="category_pagination_btn next_btn bottom_page_btn"]')
            try:
                html = driver.page_source
                response_obj = Selector(text=html)
                self.scrape_it(response_obj)
                next_page.click()
            except:
                break
        driver.close()

    def scrape_it(self, response):
        items = CouponcollectItem()
        print('Hi there')
        items['store_img_src'] = response.css('#temp1 > div > div.voucher_col_left.flexbox.spaceBetween > div.vouchercont.offerImg.flexbox.column1 > div.column.column1 > div > div > a > img::attr(src)').extract()
        yield items  

The following code has been added to the settings.py file:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}

#SELENIUM
from shutil import which

SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS=['-headless']  # '--headless' if using chrome instead of firefox

I have included a link to a terminal_output screenshot for reference. Thank you for your assistance in resolving this matter.

Answer №1

One issue arises when attempting to share a driver between asynchronously running threads, as well as the inability to run multiple drivers in parallel. To resolve this, removing the yield will execute them sequentially:

Begin with:

from selenium import webdriver
import time

driver = webdriver.Chrome()

Then within your class:

def parse(self, response):
  urls = response.xpath('//ul[@class="flexboxesmain categorieslist"]/li/a/@href').extract()
  for url in urls:
    self.do_category(url)

def do_page(self):
  time.sleep(1)
  html = driver.page_source
  response_obj = Selector(text=html)
  self.scrape_it(response_obj)

def do_category(self, url):
  driver.get(url)
  self.do_page()
  next_links = driver.find_elements_by_css_selector('a.next_btn')
  while len(next_links) > 0:
    next_links[0].click()
    self.do_page()
    next_links = driver.find_elements_by_css_selector('a.next_btn')

If speed becomes an issue, consider transitioning to Puppeteer.

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Executing Python scripts from a shared directory

Can individuals without Python installed run a Python Selenium script as long as all dependencies are available in a shared directory? For example, if the entire Python folder and its libraries are placed in a shared directory, would users be able to exec ...

Is it possible to conduct a Selenium WebDriver test case utilizing IIS rather than the Visual Studio Development server?

Currently, I am utilizing Selenium 2 WebDriver. Instead of initiating it from a UnitTest project, I have set it up to run directly from a website for the following reasons: I have implemented scheduling code using System.Threading so that it automaticall ...

Retrieving the innerHTML or innerText of a structural DOM element generated by *ngFor in Selenium

Having trouble accessing the innerHTML/innerText of a structural DOM element, only getting a commented element instead of the child elements. <div class="snap-box"> <label class="snap-heading">Recommendations</label> <div class="r ...

Unable to install EasyOCR due to a PyTorch error

I am currently facing an issue while attempting to install easyocr for Python using pip. Even though I run the command pip install easyocr, it fails to install successfully. The error message displayed in the terminal is: ERROR: torchvision 0.5.0 has requ ...

Python does not support serialization of JSON data

Having a collection of objects in the specific structure shown below, I aimed to store it in a json file. {'result': [{'topleft': {'y': 103, 'x': 187}, 'confidence': 0.833129, 'bottomright': {&ap ...

Creating graphs in Python on a Linux system can be done without the need for an

As a newcomer to the world of Python, I am diving into the realm of graphs. Can someone provide guidance on whether it's possible to plot graphs using matplotlib in the console on a Linux system without an active XSERVER? Thank you. ...

I am facing an issue with OpenCV where my video is not streaming or updating. Can anyone help me figure out how

I've encountered an issue with my video feed while attempting object detection following a tutorial. The imshow windows are not updating, they just show the first frame repeatedly. Any ideas on why this might be happening? I'm using cv2.VideoCapt ...

Using Python Selenium to interact with dynamic-labeled elements

Being a Python novice, I have just one month of experience. While attempting to scrape a webpage, I can handle and interact with most elements except for two with dynamic labels. The HTML snippet from the source page is shown below: <span class="a- ...

Having trouble utilizing the "page down" function efficiently in my web scraper

I have created a small Python script using Selenium to automatically scroll down a webpage. However, my script only scrolls to a certain point because I am unsure of how to set the range parameter to reach the bottom. I currently have it set at 10 just as ...

Using selenium, you can easily download a file without needing the direct URL

I need assistance with using Selenium for website automation in Chrome using vb.net. I am trying to download files from a website that does not have direct URLs for download buttons, as the downloads are triggered by JavaScript. How can I accomplish this i ...

There was an issue trying to access the JSON file, as it seems that string indices

I am struggling with accessing items from a nested json file. Can someone provide some guidance? intents = {"intents": [ {"tag": "greeting", "patterns": ["Hi", "Hey", "Is anyone there?", "Hello", "Hay"], "responses": ["Hello", "Hi", "Hi there ...

Working with double de-referencing pointers in Python using Ctypes

One of my C functions resides in a DLL and has the following structure. ProcessAndsend(Format *out, // IN char const *reqp, // IN size_t reqLen, // IN Bool *Status, // OUT char const **r ...

Interacting with child elements of cells in Selenium WebDriver using Java

How can the "a" element be clicked by first locating the "td" that contains specific text? <table> <tbody> <tr> <td><a class="link">link</a></td> <t ...

Having difficulty in clicking the "load more" button using Selenium

I am in the process of developing a web scraper that will utilize both Selenium and BeautifulSoup. I am encountering difficulties with clicking the load more button using Selenium. While I have been able to detect the button, scroll to it, etc., I am stru ...

Pandas - Adding a fresh column and populating it with filtered values

If I have a dataframe like this: id category 1 A 2 A 3 B 4 C 5 A I need to add a new column with incremental values where category == 'A'. The desired output is: id category value 1 A 1 2 A 2 3 B ...

The Iterative Minimax Algorithm for Tic Tac Toe

I've been experimenting with creating a tic-tac-toe game using the mini-max algorithm. I started by setting up the board and two players, then converted one player into the algorithm. In my attempt, I referenced a javascript implementation. However, d ...

How to utilize method decorators effectively in C# programming

One interesting feature in python is the ability to use function decorators to enhance the functionality of functions and methods. Currently, I'm in the process of transitioning a device library from python to C#. During communication with the device ...

Struggling with the conversion of string data to integer, float, or decimal format in order to create plots using matplotlib

Having extracted data from a SQL database table, I am encountering persistent challenges in plotting a graph between two variables due to conversion issues with data types. I initially converted a list to a str, and now I'm attempting to further conve ...

Creating a matrix or table in Python to analyze overlapping data frames and count the intersections

Python seems to hold the key to a problem I encountered recently. dataframe 1 dataframe 2 dataframe 3 SID UID SID UID SID UID 123 dog 456 dog 789 monkey 123 cat 456 bat 789 ...

datetime displaying in an alternate format than the true value

I am attempting to reformat a datetime pattern within CSV files: Original date format: DAY/MONTH/YEAR Desired Outcome: YEAR/MONTH/DAY rows = df['clock_now'] contains the following data: 22/05/2022 12:16 22/05/2022 12:20 22/05/2022 12:21 22/05/ ...