Extracting data from dynamic URLs using scrapy

I've encountered a spider issue while running Python Scrapy. The spider is able to scrape all pages except those with parameters (specifically, pages containing & symbols), like this one:

http://www.amazon.co.uk/gp/product/B003ZDXHSG/ref=s9_simh_gw_p23_d0_i3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-2&pf_rd_r=1NWN2VXCA63R7TDYC3KQ&pf_rd_t=101&pf_rd_p=467128533&pf_rd_i=468294
.

The error log is showing the message

[scrapy] ERROR: xxx matching query does not exist.

To address this issue, I am making use of the CrawlSpider along with the following SgmlLinkExtractor rule

rules = (
       Rule(SgmlLinkExtractor(allow='[a-zA-Z0-9.:\/=_?&-]+$'),
            'parse',
            follow=True,
        ),
)

Your assistance in resolving this matter would be greatly appreciated. Thank you in advance for your time and help.

Answer №1

In reflection of my previous response, I discovered that all of my code was correct. The issue causing it to fail was the method in which I was invoking the scrapy function. It would break when encountering the symbol & due to my usage of single quotes. Switching to double quotes when calling the spider resolved the problem.

Answer №2

The URL expression seems to match properly with the use of re.serach(). You may want to consider utilizing r'regexpression' so Python interprets the string as a raw string. While it appears to work fine with both raw and processed strings, it is advisable to handle regex as raw strings in Python.

>>> import re
>>> url="http://www.amazon.co.uk/gp/product/B003ZDXHSG/ref=s9_simh_gw_p23_d0_i3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-2&pf_rd_r=1NWN2VXCA63R7TDYC3KQ&pf_rd_t=101&pf_rd_p=467128533&pf_rd_i=468294" 
>>> m = re.search(r'[a-zA-Z0-9.:\/=_?&-]+$', url) 
>>> m.group()
'http://www.amazon.co.uk/gp/product/B003ZDXHSG/ref=s9_simh_gw_p23_d0_i3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-2&pf_rd_r=1NWN2VXCA63R7TDYC3KQ&pf_rd_t=101&pf_rd_p=467128533&pf_rd_i=468294'

>>> m = re.search('[a-zA-Z0-9.:\/=_?&-]+$', url)
>>> m.group()
'http://www.amazon.co.uk/gp/product/B003ZDXHSG/ref=s9_simh_gw_p23_d0_i3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-2&pf_rd_r=1NWN2VXCA63R7TDYC3KQ&pf_rd_t=101&pf_rd_p=467128533&pf_rd_i=468294'

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

The input speed of text using Selenium on the search field is extremely slow with IEDriverServer

Currently, I'm utilizing selenium with Python on a Windows 7 system. This is the code snippet I am using: import os from selenium import webdriver ie_driver_path = "C:\Python36\Scripts\IEDriverServer.exe" driver = webdriver.Ie(ie_dr ...

Processing multiple CSV files using Pandas

I have a large collection of CSV files on my hard drive, each with thousands of rows and 10 columns. The data in these spreadsheets spans different dates, making it challenging to efficiently access and analyze the information I need. Is there a way to o ...

Code to retrieve text from the project issue description using Selenium

I am encountering an issue with extracting content from Gitee using Selenium in Python. Whenever I attempt to extract the text, it returns blank results. Here is the element inspection: https://i.stack.imgur.com/5aulV.png My goal is to retrieve all the t ...

Struggling to grasp the concept of callback queries in python-telegram-bot

Looking to create a unique Telegram bot that sends a hello world message upon activation, followed by another greeting when an inline button is clicked. However, the code snippet I'm currently using isn't functioning as expected. Can anyone spot ...

Automated Weekly Newsletter

I am having trouble with the code I wrote to send an auto email report weekly every Monday at 10 AM. The current code is triggering the email weekly after every hour instead of just on Mondays at 10 AM. Can someone please help me fix this issue? Here is t ...

What makes my numpy.random.choice implementation so much more efficient?

After exploring various Python modules, I decided to experiment with numpy.random.choice, specifically excluding the replace argument it offers. Here is the code snippet that resulted from my experimentation: from random import uniform from math import f ...

Using Python to locate an element based on a specific type value in the preceding element's xpath

I've been struggling to figure out a way to select an element if there is a preceding element that contains a specific name. For example, consider the following HTML: <div> <input type="text" name="name"> </div> <div> ...

Is it possible to randomly generate an arithmetic operator and ensure its validity?

Attempting to generate a list of arithmetic operations randomly, taking into account the operators. The goal is to introduce randomness by adding parentheses as an operator on both sides. For example:- (2) + (51) = However, when I try to implement this in ...

Creating flexible concatenation in Python using pandas and get_dummies

Consider the dataframe shown below: import pandas as pd cars = ["BMV", "Mercedes", "Audi"] customer = ["Juan", "Pepe", "Luis"] price = [100, 200, 300] year = [2022, 2021, 2020] df_raw = pd.DataFrame(list(zip(cars, customer, price, year)),\ ...

Search for a substring in JSON using Python

Currently, I am faced with the challenge of extracting two pieces of information from a lengthy JSON file using Python 2.7. The structure of the JSON data is as follows: { 'device': [ { 'serial': '00000000762c1d3c&apo ...

Converting SQL code to SQLAlchemy mappings

I have an unusual query that aims to retrieve all items from a parent table that do not have corresponding matches in its child table. If possible, I would like to convert this into an SQLAlchemy query. However, I am unsure of how to proceed as my experie ...

Prevent the Insertion of Duplicate Rows in the Table

Let's imagine we have a basic SQL table: CREATE TABLE CarTable ( Model CHARACTER(10), Brand CHARACTER(10) ) Now, let's say we have added the following data to this table: INSERT INTO CarTable (Model, Brand) VALUES ('Thunderbird', &ap ...

Python: Getting results from a function executed in a separate thread

I am working on a piece of code that is used to extract the title of an .MP3 file def retrieveTitle(fileName): print "getTitle" media = MP3(fileName) try: title = str(media["TIT2"]) except KeyError: title = os.path.basenam ...

The dynamic dropdown on https://www.nseindia.com/ does not display auto-suggestions when Selenium and Python are used to pass values

driver = webdriver.Chrome('C:/Workspace/Development/chromedriver.exe') driver.get('https://www.nseindia.com/companies-listing/corporate-filings-actions') inputbox = driver.find_element_by_xpath('/html/body/div[7]/div[1]/div/section ...

kombu.exceptions.SerializeError: Unable to serialize user data in JSON format

I am currently working with a Django 1.11.5 application and Celery 4.1.0, but I keep encountering the following error: kombu.exceptions.EncodeError: <User: testuser> is not JSON serializable Here are my settings in settings.py: CELERY_BROKER_URL = ...

Gradually increase the time in a dataframe column by the initial value of the column

I am facing a situation where I need to increment the timestamp of a particular column in my dataframe. Within the dataframe, there is a column that contains a series of area IDs along with a "waterDuration" column. My goal is to progressively add this d ...

Automating Click Actions on Child Elements in Selenium using Python with Parent Class Identification

Currently, I am using selenium to navigate through a website and interact with different buttons to carry out various tasks. Most of the buttons I have clicked on have unique identifiers that make it easy for me to locate and click them accurately. However ...

When attempting to activate my venv in Python, I encounter issues as the terminal notifies me that it does not recognize the term '.venvScriptsactivate' as a cmdlet

Currently working as a React JS developer, I recently encountered an issue while trying to run the backend part of my code. The backend developer advised me to download Python and execute certain commands in the Webstorm terminal. How to activate the virtu ...

Locate specific data in PostgreSQL using SQLAlchemy's primary key search feature - Deprecated notification

When trying to find an object by primary key using the code below, a warning message about deprecated features is displayed. What changes can be made to this query to resolve the deprecated feature warning? Code: def locate_by_primary_key(self, obj_id) ...

Generate a new row in the dataframe by using the values from the row above

My dataset consists of a single column labeled 'Change'. My goal is to create a new column titled NewColumn which will contain the values as described below: Index Change NewColumn 0 0.02 60 1 -0.01 59.4 2 ...