The Failure of Scrapy Pagination

Hello everyone, I am excited to share my first post!

Currently, I am working on developing a Web Spider that is capable of following links on invia.cz and extracting all the hotel titles.

import scrapy

y=0
class invia(scrapy.Spider):
    name = 'Kreta'
    start_urls = ['https://dovolena.invia.cz/?d_start_from=13.01.2017&sort=nl_sell&page=1']

    def parse(self, response):

        for x in range(1, 9):
            yield {
             'titles':response.css("#main > div > div > div > div.col.col-content > div.product-list > div > ul > li:nth-child(%d)>div.head>h2>a>span.name::text"%(x)).extract() ,
             }

        if (response.css('#main > div > div > div > div.col.col-content >   
                            div.product-list > div > p > 
                            a.next').extract_first()):
         y=y+1
         go = ["https://dovolena.invia.cz/d_start_from=13.01.2017&sort=nl_sell&page=%d" % y] 
         print go
         yield scrapy.Request(
                response.urljoin(go),
                callback=self.parse
         )

I encountered an issue with web pages loading asynchronously due to AJAX. To address this, I manually update the URL value by incrementing it whenever the next button appears.

While testing in the scrapy shell, everything functions as expected. However, during actual spider execution, it only scrapes the initial page.

This project marks my debut into the world of web spiders, so any helpful advice is greatly appreciated!

For reference, here are the error logs: Error Log1, Error Log

Answer №1

Your use of the "global" y variable is quite unusual and impractical.

Instead of relying on y to count parse calls, it's recommended to utilize the request.meta attribute within the function's scope:

def parse(self, response):
    y = response.meta.get('index', 1)  
    y += 1
    url = 'http://example.com/?p={}'.format(y)
    yield Request(url, self.parse, meta={'index':y})

In addressing the pagination issue, the css selector for the next page URL is incorrect due to missing href attributes in the selected <a> node:

def parse(self, response):
    next_page = response.css("a.next::attr(data-page)").extract_first()
    url = re.sub('page=\d+', 'page=' + next_page, response.url)
    yield Request(url, self.parse, meta={'index':y})

To ensure functionality, here is the revised spider code:

import scrapy
import re

class InviaSpider(scrapy.Spider):
    name = 'invia'
    start_urls = ['https://dovolena.invia.cz/?d_start_from=13.01.2017&sort=nl_sell&page=1']

    def parse(self, response):
        names = response.css('span.name::text').extract()
        for name in names:
            yield {'name': name}

        # navigating to the next page
        next_page = response.css("a.next::attr(data-page)").extract_first()
        url = re.sub('page=\d+', 'page=' + next_page, response.url)
        yield scrapy.Request(url, self.parse)

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Encountering AJAX Error 0 with jQueryUI Autocomplete upon pressing enter key

Currently, I am facing an issue with a search box that utilizes the jqueryUI .autocomplete feature to retrieve data through AJAX for providing suggestions. The problem arises when a user presses the enter key before the AJAX call to the source completes, r ...

What is the most effective way to invoke a particular function/method within a Python script using Javascript (specifically, jquery/ajax)?

Just to clarify my setup: I am currently running this Python script as a cgi via the Apache web server. That part is working fine without any issues. However, my question pertains to how I can specify which function within the Python script should be execu ...

Refreshing certain sections of a webpage without the need to refresh the entire page

If you want to understand better, it would be helpful if you could check out my website first at: The main part of the website is a stream from Own3D.tv displayed through an iframe (line 342). My objective is to have the ability to click on a specific str ...

It is not possible to make a comparison between two Strings within a JSP file

I have this unique JSP file that contains the following code: <SELECT name="brandsFrom" onchange="as()"> <c:forEach items="${brandsSelectedList}" var="brands"> <c:if test="${brands.name == nam}"> <option value="${b ...

Upgrade the WordPress light editor to the advanced version

After developing a script to upgrade the WordPress editor on a specific page from light mode to Advanced once a user clicks the Unlock button and confirms their desire to make the switch, an issue arose. Despite deducting 5 coins from the user's balan ...

Load the page before the images are displayed

Currently, I am utilzing Ruby threads with PhantomJS to convert multiple URLs into images. The issue is that the page only loads once all of the images have been generated. My goal is to implement a loading animation while the images are being created an ...

"Troubleshooting: Inability to send emails through PHP mail script while using jQuery's

I'm at a loss with this issue. I'm attempting to utilize the jQuery ajax function to send POST data to an email script. Below is the jQuery code snippet. $('#bas-submit-button').click(function () { var baslogo = $('input#bas- ...

Send back alternate HTML content if the request is not made via AJAX

Last time I asked this question, I received several negative responses. This time, I will try to be more clear. Here is the structure of a website: Mainpage (Containing all resources and scripts) All Other pages (HTML only consists of elements of that ...

Rails offers a unique hybrid approach that falls between Ember and traditional JavaScript responses

My current project is a standard rails application that has primarily utilized HTML without any AJAX. However, I am planning to gradually incorporate "remote" links and support for JS responses to improve the user experience. While I acknowledge that gener ...

Uncovering complete hyperlink text using Scrapy

When using scrapy to extract data from a webpage, I encountered the following issue: <li> <a href="NEW-IMAGE?type=GENE&amp;object=EG10567"> <b> man </b> X - <i> Escherichia coli </i> </a> <br> </li> ...

Is there a way to automatically update the URL to include $_GET['data'] (media.php?data=123) when a selection is made from a dropdown list?

I'm currently working on a project that involves a website in PHP and a MySQL database. I have successfully linked the two together, and now I have a combobox filled with data from the database. Below is my script for handling the selection: <scr ...

Sorting of tables does not function properly following an ajax request

I built a table that utilizes an AJAX response triggered by selecting a date from a drop-down menu. <!--datepicker--> <div class="col-md-4"> <input type="text" name="date_po_month_picker" id="date_po_month_picker" class ...

Issue occurring while trying to select an item from the dynamically generated options using AJAX

A JavaScript function is used in this code to select a specific option, with the option value being specified within a hidden element: $("select").each(function() { var id = $(this).attr('id'); var source = 'input:hidden[na ...

Perform a bash command using PHP when an HTML button is clicked

Today, my brain seems to be on vacation. Currently, I have set up a Raspberry Pi with vlc running and connected to a mounted screen on the wall. There is a web page with simple controls to manage the pi, switch between different vlc streams, or stop stream ...

The Ajax request functions flawlessly on Mozilla but encounters issues on Chrome, causing confusion as it occasionally works

I am using a PHP file with a class and function to store data in the database, accessed via AJAX. While everything works smoothly in Mozilla, Chrome seems to be causing some issues. Strangely, sometimes it works fine, but other times it fails for no appare ...

Tips for updating the color of a table row when it is chosen and a row is inserted using ajax

I have successfully added table rows dynamically through ajax in MVC C#. Now, I am looking to change the color of a selected row. This effect works well on the table with rows generated in the HTML view. <div class="col-lg-6 col-sm-12"> ...

What steps are involved in setting up a RESTful service for a web application?

After extensive searching, I'm still unclear on how to approach the question posed in the title. The concept of RESTful was introduced to me just this morning, and the more I delve into it, the more puzzled I become. A few weeks back, I undertook the ...

jquery logic for iterating through all elements in a select menu encountering issues

In search of a solution to iterate through all options in a dropdown list using code, comparing each to a variable. When a match is found, I aim to set that particular value as the selected item in the dropdown and then exit the loop. Here's what I&ap ...

"Is there a way to implement the functions of adding, editing, and deleting data in

When attempting to display a table using jQuery DataTable, I would like to add edit and delete buttons at the last column of each row. How can I ensure that the database is updated with every edit and delete action? Ajax Request Example: //Ajax to get id ...

Update the content within the Wicket AjaxLink

I recently implemented a new AjaxLink within my .java file. add(new AjaxLink("link"){ ...