Navigating through js/ajax(href="#") based pagination using Scrapy - A helpful guide

Question

Navigating through js/ajax(href="#") based pagination using Scrapy - A helpful guide

I am looking to loop through all the category URLs and extract the content from each page. Although I have attempted to retrieve only the first category URL using

urls = [response.xpath('//ul[@class="flexboxesmain categorieslist"]/li/a/@href').extract()[0]]

in this code, my ultimate objective is to fetch all URLs and their corresponding content.

I am currently utilizing the scrapy_selenium library. However, the Selenium page source is not being passed to the 'scrape_it' function as expected. As a newcomer to the scrapy framework, I would appreciate it if you could review my code and point out any potential issues.

Below is the spider code snippet:

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
from scrapy import Selector
from scrapy_selenium import SeleniumRequest
from ..items import CouponcollectItem

class Couponsite6SpiderSpider(scrapy.Spider):
    name = 'couponSite6_spider'
    allowed_domains = ['www.couponcodesme.com']
    start_urls = ['https://www.couponcodesme.com/ae/categories']
    
    def parse(self, response):   
        urls = [response.xpath('//ul[@class="flexboxesmain categorieslist"]/li/a/@href').extract()[0]]
        for url in urls:
            yield SeleniumRequest(
                url=response.urljoin(url),
                wait_time=3,
                callback=self.parse_urls
            ) 

    def parse_urls(self, response):
        driver = response.meta['driver']
        while True:
            next_page = driver.find_element_by_xpath('//a[@class="category_pagination_btn next_btn bottom_page_btn"]')
            try:
                html = driver.page_source
                response_obj = Selector(text=html)
                self.scrape_it(response_obj)
                next_page.click()
            except:
                break
        driver.close()

    def scrape_it(self, response):
        items = CouponcollectItem()
        print('Hi there')
        items['store_img_src'] = response.css('#temp1 > div > div.voucher_col_left.flexbox.spaceBetween > div.vouchercont.offerImg.flexbox.column1 > div.column.column1 > div > div > a > img::attr(src)').extract()
        yield items

The following code has been added to the settings.py file:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}

#SELENIUM
from shutil import which

SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS=['-headless']  # '--headless' if using chrome instead of firefox

I have included a link to a terminal_output screenshot for reference. Thank you for your assistance in resolving this matter.

python selenium web-scraping scrapy

Answer 1

Answer №1

One issue arises when attempting to share a driver between asynchronously running threads, as well as the inability to run multiple drivers in parallel. To resolve this, removing the yield will execute them sequentially:

Begin with:

from selenium import webdriver
import time

driver = webdriver.Chrome()

Then within your class:

def parse(self, response):
  urls = response.xpath('//ul[@class="flexboxesmain categorieslist"]/li/a/@href').extract()
  for url in urls:
    self.do_category(url)

def do_page(self):
  time.sleep(1)
  html = driver.page_source
  response_obj = Selector(text=html)
  self.scrape_it(response_obj)

def do_category(self, url):
  driver.get(url)
  self.do_page()
  next_links = driver.find_elements_by_css_selector('a.next_btn')
  while len(next_links) > 0:
    next_links[0].click()
    self.do_page()
    next_links = driver.find_elements_by_css_selector('a.next_btn')

If speed becomes an issue, consider transitioning to Puppeteer.

Answer 2

One issue arises when attempting to share a driver between asynchronously running threads, as well as the inability to run multiple drivers in parallel. To resolve this, removing the yield will execute them sequentially:

Begin with:

from selenium import webdriver
import time

driver = webdriver.Chrome()

Then within your class:

def parse(self, response):
  urls = response.xpath('//ul[@class="flexboxesmain categorieslist"]/li/a/@href').extract()
  for url in urls:
    self.do_category(url)

def do_page(self):
  time.sleep(1)
  html = driver.page_source
  response_obj = Selector(text=html)
  self.scrape_it(response_obj)

def do_category(self, url):
  driver.get(url)
  self.do_page()
  next_links = driver.find_elements_by_css_selector('a.next_btn')
  while len(next_links) > 0:
    next_links[0].click()
    self.do_page()
    next_links = driver.find_elements_by_css_selector('a.next_btn')

If speed becomes an issue, consider transitioning to Puppeteer.

Navigating through js/ajax(href="#") based pagination using Scrapy - A helpful guide

Answer №1

Similar questions

Executing Python scripts from a shared directory

Is it possible to conduct a Selenium WebDriver test case utilizing IIS rather than the Visual Studio Development server?

Retrieving the innerHTML or innerText of a structural DOM element generated by *ngFor in Selenium

Unable to install EasyOCR due to a PyTorch error

Python does not support serialization of JSON data

Creating graphs in Python on a Linux system can be done without the need for an

I am facing an issue with OpenCV where my video is not streaming or updating. Can anyone help me figure out how

Using Python Selenium to interact with dynamic-labeled elements

Having trouble utilizing the "page down" function efficiently in my web scraper

Using selenium, you can easily download a file without needing the direct URL

There was an issue trying to access the JSON file, as it seems that string indices

Working with double de-referencing pointers in Python using Ctypes

Interacting with child elements of cells in Selenium WebDriver using Java

Having difficulty in clicking the "load more" button using Selenium

Pandas - Adding a fresh column and populating it with filtered values

The Iterative Minimax Algorithm for Tic Tac Toe

How to utilize method decorators effectively in C# programming

Struggling with the conversion of string data to integer, float, or decimal format in order to create plots using matplotlib

Creating a matrix or table in Python to analyze overlapping data frames and count the intersections

datetime displaying in an alternate format than the true value