"Troubleshooting: HtmlResponse functioning correctly in Scrapy Shell, yet encountering issues in script

While working on a scraping project, I decided to use scraperAPI.com for IP rotation. In my attempt to incorporate their new post request method, I encountered an error stating 'HtmlResponse' object has no attribute 'dont_filter'. Below is the custom start_requests function code snippet:

def start_requests(self):
    S_API_KEY = {'key':'eifgvaiejfvbailefvbaiefvbialefgilabfva5465461654685312165465134654311'
             }
    url = "XXXXXXXXXXXXXX.com"
    payload={}
    headers = {
       'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
       'x-requested-with': 'XMLHttpRequest',
       'Access-Control-Allow-Origin': '*',
       'accept': 'application/json, text/javascript, */*; q=0.01',
       'referer': 'XXXXXXXXXXX.com'
       }
    client = ScraperAPIClient(S_API_KEY['key'])
    resp = client.post(url = url, body = payload, headers = headers)
    yield HtmlResponse(resp.url, body = resp.text,encoding = 'utf-8')

Interestingly, when I run this script step by step in scrapy shell, it functions properly and returns the expected data. Any help or insights regarding this issue would be incredibly helpful as I have been troubleshooting for 4 hours now.

Notes:

  • The client.post method returns a response object
  • The API key mentioned above is not real, just a placeholder
  • The client.post method does not include a body option

Answer №1

The issue you're encountering is due to returning the incorrect type (a Response).
According to the documentation for start_requests:

This method should return an iterable containing the initial Requests to be crawled by this spider.

An easy solution would be to utilize a scrapy request (such as FormRequest) to access the API url instead of relying on ScraperAPIClient.post().
It may be possible to use ScraperAPIClient.scrapyGet() to create the correct URL, although I have not verified this.

If you prefer to stick with the official API library, a slightly more complex option involves Creating your own downloader middleware.

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

I am facing an issue with the vtkTimerCallback when used together with the QVTKRenderWindowInter

Having an issue with vtk (8.1) and pyqt5 (5.10.1). When using the vtkCallBackTimer with the original vtk.vtkRenderWindowInteractor(), everything works correctly. However, when attempting to use the vtk.qt.QVTKRenderWindowInteractor.QVTKRenderWindowInteract ...

Ways to find and display a specific string from a list within a data frame column in the form of an adjacent column

My Current Data I currently have a column called 'Student' containing the names of students and their corresponding personalities. I also have a list named 'qualities' which consists of traits needed for filtering purposes. Desired Out ...

Obtain the index for a data point using a SciPy sparse array

Currently, I am working on a CSR sparse array where there are many empty elements or cells. This array needs to support both forward and backward indexing. In other words, I should be able to provide two indices and receive the corresponding element (e.g., ...

Is it acceptable to utilize the Django-Scrapy app for live production environments?

I'm currently handling multiple tasks using scrapy on our production server. My manager has requested the ability to add or remove URLs for scraping and is interested in having a web interface for this purpose. I am considering developing a web appl ...

Executing a script several times with varying data in Python

The script logs into a website (signin()) and then registers for an event (registerforevent()). Is there a way to modify this script so it can run multiple times, each time using a different email and password combination from a given list to sign in and ...

Guide to retrieving multiple book search results through the Google Books API

Currently, I am utilizing the Google Books API to retrieve search results with multiple books. Below is my implementation: def lookup(search): """Look up search for books.""" # Contact API try: url = f&apos ...

Tensorflow's SavedModelBuilder: A guide to saving your model with the highest validation accuracy

After exploring the tensorflow documentation, I was unable to find a way to save a model with the best validation accuracy using the SavedModelBuilder class. Currently, I am utilizing tflearn for modeling and have attempted a workaround by running the fit ...

What is the process for alternating the execution of script actions between terminals within a Python script?

I am looking to create a script that will run a series of commands in different terminals. For example, I want to execute the command 'pwd' in the main terminal, then open an xterm window and switch to the home directory, and finally run the &apo ...

Convert string to integer function

I need to create a function that allows the user to input any string of numbers and then converts that input into a list of integers (e.g. "12635 1657 132651627"). However, I am having trouble figuring out how to modify this piece of code so that the use ...

Exploring mammoth text data in Python

I am currently diving into Python development with my first project, which involves parsing a hefty 2GB file. After discovering that processing it line by line would be agonizingly slow, I decided to opt for the buffering method. Here's what I'm ...

Harvest data from a website with interactive mouseover features

My current challenge involves scraping dynamically generated data from mouseover events. Specifically, I aim to extract information from the Hash Rate Distribution chart found at . The data is displayed when you hover over each circle on the chart. The fo ...

Error with syntax detected in VSCode (launch.json, line 2)

How do I configure my launch.json file to run my Python program from Visual Studio Code? Typically, I go to the folder where my test file is located and run the following command: python main.py test_file.xlsx In this case, my Python script is named main ...

When performing an exact lookup in Django, the QuerySet value should always be limited to a single result by using slicing in Django

As I work on a logic to display quizzes created by teachers from the database, I am aiming to enhance precision by showing students only the quizzes related to the courses they are enrolled in. However, the filter I am trying to implement doesn't seem ...

Bringing in a document using its specific file location

My Python script includes a function that exports a file using the command shown below. It successfully exports the file, but I now need to import it and iterate through its contents. connector.save_csv(path,'_'+"GT_Weekly"+'_'+keys) ...

Python encountering errors while attempting to load JSON file

I have a text file containing the following json: { "data sources" : [ "http://www.gcmap.com/" ] , "metros" : [ { "code" : "SCL" , "name" : "Santiago" , "country" : "CL" , "continent" : "South America" , "timezone" : -4 , "coordinates" : {"S" : 33, "W" : ...

What is the best way to fill missing values with the average in a Python dataset?

https://i.stack.imgur.com/aVQeV.jpg]2]3 I am facing an issue while trying to convert float data for use in a decision tree. Every time I attempt to apply label encoder, I encounter an error stating that the argument must be a string or number. ...

I am struggling to comprehend the einsum calculation

As I attempt to migrate code from Python to R, I must admit that my knowledge of Python is far less compared to R. I am facing a challenge while trying to understand a complex einsum command in order to translate it. Referring back to an answer in a previ ...

Python-powered C/C++ Distributed Engine

Currently, I am utilizing a C based OCR engine called tesseract in conjunction with Python interface library pytesseract to leverage its primary functionalities. The library essentially accesses the local contents of the installed engine for utilization ...

Scouring the web of URLs for each request utilizing Scrapy

Currently, I am facing a challenge in storing the trail of URLs that my Spider visits whenever it accesses the target page. The issue lies in reading the starting URL and ending URL for each request. Despite going through the documentation thoroughly, I fi ...

Converting a row to a column vector based on a particular condition in Python

In this code, positive values are selected from a matrix along with their corresponding indices. The current output shows these positive values in a row vector format. However, the desired output is to have the positive values presented in a column vector ...