"Troubleshooting: HtmlResponse functioning correctly in Scrapy Shell, yet encountering issues in script

Question

"Troubleshooting: HtmlResponse functioning correctly in Scrapy Shell, yet encountering issues in script

While working on a scraping project, I decided to use scraperAPI.com for IP rotation. In my attempt to incorporate their new post request method, I encountered an error stating 'HtmlResponse' object has no attribute 'dont_filter'. Below is the custom start_requests function code snippet:

def start_requests(self):
    S_API_KEY = {'key':'eifgvaiejfvbailefvbaiefvbialefgilabfva5465461654685312165465134654311'
             }
    url = "XXXXXXXXXXXXXX.com"
    payload={}
    headers = {
       'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
       'x-requested-with': 'XMLHttpRequest',
       'Access-Control-Allow-Origin': '*',
       'accept': 'application/json, text/javascript, */*; q=0.01',
       'referer': 'XXXXXXXXXXX.com'
       }
    client = ScraperAPIClient(S_API_KEY['key'])
    resp = client.post(url = url, body = payload, headers = headers)
    yield HtmlResponse(resp.url, body = resp.text,encoding = 'utf-8')

Interestingly, when I run this script step by step in scrapy shell, it functions properly and returns the expected data. Any help or insights regarding this issue would be incredibly helpful as I have been troubleshooting for 4 hours now.

Notes:

The client.post method returns a response object
The API key mentioned above is not real, just a placeholder
The client.post method does not include a body option

python scrapy

Answer 1

Answer №1

The issue you're encountering is due to returning the incorrect type (a Response).
According to the documentation for start_requests:

This method should return an iterable containing the initial Requests to be crawled by this spider.

An easy solution would be to utilize a scrapy request (such as FormRequest) to access the API url instead of relying on ScraperAPIClient.post().
It may be possible to use ScraperAPIClient.scrapyGet() to create the correct URL, although I have not verified this.

If you prefer to stick with the official API library, a slightly more complex option involves Creating your own downloader middleware.

Answer 2

The issue you're encountering is due to returning the incorrect type (a Response).
According to the documentation for start_requests:

This method should return an iterable containing the initial Requests to be crawled by this spider.

An easy solution would be to utilize a scrapy request (such as FormRequest) to access the API url instead of relying on ScraperAPIClient.post().
It may be possible to use ScraperAPIClient.scrapyGet() to create the correct URL, although I have not verified this.

If you prefer to stick with the official API library, a slightly more complex option involves Creating your own downloader middleware.

"Troubleshooting: HtmlResponse functioning correctly in Scrapy Shell, yet encountering issues in script

Notes:

Answer №1

Similar questions

I am facing an issue with the vtkTimerCallback when used together with the QVTKRenderWindowInter

Ways to find and display a specific string from a list within a data frame column in the form of an adjacent column

Obtain the index for a data point using a SciPy sparse array

Is it acceptable to utilize the Django-Scrapy app for live production environments?

Executing a script several times with varying data in Python

Guide to retrieving multiple book search results through the Google Books API

Tensorflow's SavedModelBuilder: A guide to saving your model with the highest validation accuracy

What is the process for alternating the execution of script actions between terminals within a Python script?

Convert string to integer function

Exploring mammoth text data in Python

Harvest data from a website with interactive mouseover features

Error with syntax detected in VSCode (launch.json, line 2)

When performing an exact lookup in Django, the QuerySet value should always be limited to a single result by using slicing in Django

Bringing in a document using its specific file location

Python encountering errors while attempting to load JSON file

What is the best way to fill missing values with the average in a Python dataset?

I am struggling to comprehend the einsum calculation

Python-powered C/C++ Distributed Engine

Scouring the web of URLs for each request utilizing Scrapy

Converting a row to a column vector based on a particular condition in Python