Questions tagged [scrapy]

ScrapeMate, an innovative multi-threaded open-source high-level web data extraction and crawling framework developed using Python, empowers users to crawl websites effortlessly while efficiently extracting well-structured data from the pages. Its versatile applications span from seamless data extraction to real-time monitoring and streamline automated testing processes.

Is it acceptable to utilize the Django-Scrapy app for live production environments?

I'm currently handling multiple tasks using scrapy on our production server. My manager has requested the ability to add or remove URLs for scraping and is interested in having a web interface for this purpose. I am considering developing a web appl ...

What is the accurate Scrapy XPath for identifying <p> elements that are mistakenly nested within <h> tags?

I am currently in the process of setting up my initial Scrapy Spider, and I'm encountering some challenges with utilizing xpath to extract specific elements. My focus lies on (which is a Chinese website akin to Box Office Mojo). Extracting the Chine ...

`scraping data from tags containing <script nonce> using scrapy``

Does anyone have a solution for extracting latitude and longitude values from the website "https://pfchangsmexico.com.mx/ubicaciones/" for all restaurants where this data can be found? The Xpath for these values is /html/body/script[1]. I need help in writ ...

Pressing the Google Scholar Button using Scrapy

I have been experimenting with web scraping data from Google Scholar using the scrapy library, and here is my current code: import scrapy class TryscraperSpider(scrapy.Spider): name = 'tryscraper' start_urls = ['https://scholar.google.com/citations ...

Extracting information from an API

Hey there, good morning! I'm currently working on gathering car data from this website: My process involves sending a request through the search bar on the homepage for a specific location and date. This generates a page like this: From there, I use the ...

Python and Scrapy encounter issues with locating website links

I've been searching for the URLs of all events listed on this page: https://www.eventshigh.com/delhi/food?src=exp However, I can only locate the URL in JSON format: { "@context":"http://schema.org", "@type":"Event", "name":"DANDIYA NIGHT 20 ...

Is it possible to utilize Scrapy for automating form submissions and performing all the functionalities of a web browser?

Currently, I am faced with a task that requires submitting a form to a website without the availability of an API. My current approach using WebDriver has been problematic due to the asynchronous nature between my code and the browser. I am in search of a ...

Retrieving URLs for CrawlSpider using Scrapy

I have developed a script to systematically catalog all URLs on a website. Currently, I am utilizing CrawlSpider with a rules handler to manage the scraped URLs. The "filter_links" function checks an existing table for each URL and writes a new entry if i ...

How can a custom format structure be established for the json export feature in Scrapy? If it is possible, what is the process for doing so

As a beginner in the world of Python and Scrapy, I am struggling with the complexities of Scrapy documentation. Despite successfully creating a spider for my school project to scrape data, I am facing issues with the formatting in JSON export. Here is a sn ...

Scrapy is adept at gathering visible content that may appear intermittently

Currently, I am extracting information from zappos.com, specifically targeting the section on the product details page that showcases what other customers who viewed the same item have also looked at. One of the item listings I am focusing on is found her ...

Uncovering complete hyperlink text using Scrapy

When using scrapy to extract data from a webpage, I encountered the following issue: <li> <a href="NEW-IMAGE?type=GENE&amp;object=EG10567"> <b> man </b> X - <i> Escherichia coli </i> </a> <br> </li> ...

Should I employ Scrapy Selenium to scrape the initial request page?

I have successfully implemented a solution using scrapy_selenium to scrape a website that uses JavaScript for loading content. In the code snippet provided below, you can see that I am using SeleniumRequest when yielding detailPage with parseDetails. Howe ...

Scrapy: Verifying the data in a CSV document prior to incorporation

My objective is to verify the title of an item listed in a csv file, and if it does not already exist, append it to the file. I have extensively researched various solutions for handling duplicate values but most of them pertain to DuplicatesPipeline which ...

What is the process for providing arguments to a class at the time of object instantiation?

I am in the process of developing a graphical user interface that features two buttons, "Choose Input File" and "Execute". Upon clicking on "Open Input File", users have the ability to select a file from their computer containing URLs in a single column. S ...

Automatically include fields in scrapy when they are detected

Currently facing a challenge while attempting to write a spider for web scraping. The issue arises when trying to extract the href Attribute from all the <li> tags within the <ul> tag and storing them in incrementally named variables like Field ...

Extracting data from dynamic URLs using scrapy

I've encountered a spider issue while running Python Scrapy. The spider is able to scrape all pages except those with parameters (specifically, pages containing & symbols), like this one: http://www.amazon.co.uk/gp/product/B003ZDXHSG/ref=s9_simh_g ...

TikTok pages are failing to load with Selenium

I'm currently working on a TikTok crawler project that uses both selenium and scrapy start_urls = ['https://www.tiktok.com/trending'] .... def parse(self, response): options = webdriver.ChromeOptions() from fake_useragent import UserAgent ua = ...

Using Scrapy 1.0.3, data was successfully scraped, extracting data from <value> tags by utilizing xpath and the

I've written some scrapy code to scrape embedded YouTube videos from certain pages. For example: item['video'] = response.xpath['//div[@class="imobile-body"]/iframe').extract() However, when I output the scraped data to an XML f ...

Using Scrapy and Selenium to scrape a website that needs authentication

I've encountered a challenge while attempting to extract data from a webpage that heavily relies on AJAX calls and javascript for its rendering. My approach involves using scrapy in combination with selenium to achieve this task. Here's the metho ...

Spider login page

My attempt to automate a log in form using Scrapy's formrequest method is running into some issues. The website I am working with does not have a simple HTML form "fieldset" containing separate "divs" for the username and password fields. I need to identif ...

Python can be used to extract data from Highcharts through scraping techniques

I have been attempting to extract information from the chart located at . I made an effort to gather the data by utilizing the corresponding XPaths for the data in the sections, but unfortunately, it was not successful. I experimented with using Scrapy: d ...

Navigating through js/ajax(href="#") based pagination using Scrapy - A helpful guide

I am looking to loop through all the category URLs and extract the content from each page. Although I have attempted to retrieve only the first category URL using urls = [response.xpath('//ul[@class="flexboxesmain categorieslist"]/li/a/@href').ex ...

Is it possible to save and utilize cookies in a program without relying on the selenium driver.add_cookie

In the midst of a project, I find myself faced with the task of extracting URLs for all products on a given page and utilizing Scrapy to sift through each URL for product data. The challenge arises when a pop-up emerges 3-5 seconds after loading every URL, ...

At what point should I end the session with the webdriver?

My web scraping process involves using both Scrapy and Selenium. When I run my spider, I notice the instance of my webdriver in htop. I am unsure about when exactly I should close the webdriver in my code. Would it be best to do so after processing each l ...

Tips for creating a Scrapy spider to extract .CSS information from a table

Currently, I am utilizing Python.org version 2.7 64-bit on Windows Vista 64-bit for building a recursive web scraper using Scrapy. Below is the snippet of code that successfully extracts data from a table on a specific page: rows = sel.xpath('//table ...

Scrapy utilizes AJAX to send a request in order to receive the response of the dynamically generated

Is there a way to efficiently extract data from websites like this? To display all available offers, the "Show More Results" button at the bottom of the page needs to be clicked multiple times until all offers are shown. Each click triggers an AJAX reques ...

Getting the h1 text from a div class using scrapy or selenium

I am working with Python using Scrapy and Selenium. My goal is to extract the text from the h1 tag that is within a div class. For instance: <div class = "example"> <h1> This is an example </h1> </div> Here is my implementat ...

What is the process for moving to the following page with crawlspider?

Currently, I am using Scrapy crawlspider to scrape data from . Can someone advise me on how to configure and set up the LinkExtractor to successfully scrape all pages? class SephoraSpider(CrawlSpider): name = "sephora" # custom_settings = {"IMAGES_STORE" ...

Troubleshooting issues with loading scripts in a scrapy (python) web scraper for a react/typescript application

I've been facing challenges with web scraping a specific webpage (beachvolleyball.nrw). In the past couple of days, I've experimented with various libraries but have struggled to get the script-tags to load properly. Although using the developer tools all ...

Scrapy fails to retrieve closing prices from Yahoo! Finance

When attempting to extract closing prices and percentage changes for three tickers from Yahoo! Finance using Scrapy, I am encountering an issue where no data is being retrieved. I have verified that my XPaths are correct and successfully navigate to the de ...

Utilizing Scrapy to Fetch Data from a MySQL Database and Web Scraping

I currently have a spider and pipeline set up to extract data from the web and insert it into MySQL. This code is up and running smoothly. class AmazonAllDepartmentSpider(scrapy.Spider): name = "amazon" allowed_domains = ["amazon.com"] start_ ...

Navigating infinite scroll pages with scrapy and selenium: a comprehensive guide

Struggling with using scrapy +selenium to extract data from a webpage that loads content dynamically as we scroll down? Take a look at the code snippet below where I encounter an issue with getting the page source and end up stuck in a loop. import scrap ...

The Scrapy CSS selector is not fetching any prices from the list

Having trouble with the price CSS-selector while scraping an interactive website using Scrapy. Check out the HTML screenshot I captured: https://i.stack.imgur.com/oxULz.png Here are a few selectors that I've already experimented with: price = respon ...

Is there a way to extract the text that is displayed when I hover over a specific element?

When I hover over a product on the e-commerce webpage (), the color name is displayed. I was able to determine the new line in the HTML code that appears when hovering, but I'm unsure how to extract the text ('NAVY'). <div class="ui top left popu ...

Encountering insurmountable obstacles in accessing AliExpress

I'm looking to extract specific information from aliexpress using scrapy and selenium. However, I've encountered an issue where the HTML code appears differently when inspecting with Chrome compared to viewing the source. It seems that the content is load ...

Acquire session cookies using scrapy

I've been utilizing scrapy for web scraping on sites that require login, but I'm unsure about the specific fields needed to save and load in order to maintain the session. With selenium, I am saving the cookies like so: import pickle import sele ...

Scrapy spider encountering issues during the crawling process

Having trouble crawling coupons on the Cuponation website. Whenever I try to run the crawler, it shows an error. Can someone please assist me? Thanks. import scrapy from scrapy.http import Request from scrapy.selector import HtmlXPathSelector from scrap ...

How many objects have been collected per initial URL?

With the help of scrapy, I have been able to crawl through 1000 different URLs and store the scraped items in a mongodb database. However, I am interested in knowing how many items have been found for each URL individually. Currently, from the scrapy stats ...

"Troubleshooting: HtmlResponse functioning correctly in Scrapy Shell, yet encountering issues in script

While working on a scraping project, I decided to use scraperAPI.com for IP rotation. In my attempt to incorporate their new post request method, I encountered an error stating 'HtmlResponse' object has no attribute 'dont_filter'. Below is the custom start ...

Need help with redirecting after a form post using the scrapy_splash package?

I am working with Python, Scrapy, Splash, and the scrapy_splash package to extract data from a website. After successfully logging in using the SplashRequest object in scrapy_splash, I obtain a cookie that grants me access to a portal page. Everything has ...

Issues with Selector Scrapy Selenium functionality

My goal is to extract rank numbers from a website like Currently, I am utilizing python with selenium and scrapy. However, the following code does not yield any output. What could be the reason for this? sel=Selector(response) rank=sel.xpath('//span[@cla ...

What is the most effective Xpath for retrieving text from <td> elements when there is text present in both?

I have the following XML that needs to be extracted: <div class="tab_product_details"> <table> <tbody> <tr>...</tr> <tr>...</tr> <tr>...</tr> <tr> ...

Using a custom filename for image downloads with Scrapy

In my current project with scrapy, I am using the ImagesPipeline to handle downloaded images. The Images are stored in the file system using a SHA1 hash of their URL as the file names. Is there a way to customize the file names for storage? If I want to ...

Utilizing Scrapy for Extracting Size Information obscured by Ajax Requests

I am currently trying to extract information about the sizes available for a specific product from this URL: However, I am encountering difficulty in locating the details hidden within the Select Size Dropdown on the webpage (e.g., 7 - In Stock, 7.5 - In ...

Tips for utilizing Scrapy and Selenium to extract information from a website employing javascript and php features

I've been trying to collect data from a website that provides details on accidents. I attempted using Scrapy and Selenium for this task, but unfortunately, it's not working as expected. As a beginner in this field, I'm struggling to grasp wh ...

When utilizing Beautiful Soup and Scrapy, I encountered an error indicating a reference issue prior to assignment

I encountered an issue while trying to scrape data which resulted in the error message UnboundLocalError: local variable 'd3' referenced before assignment . Can anyone provide a solution to resolve this error? I have searched extensively for a so ...

Tips for getting data for the following page with Scrapy

Recently, I created a script with two essential functions: parse: It pulls out URLs from the main URL and directs them to parse_city() for further extraction of details. Afterwards, parse() moves on to extract the next page and recursively calls itself t ...

Guidelines for transferring Selenium WebDriver response to Scrapy's parse method

I am currently facing two challenges that I need to overcome: First: effectively locating and interacting with an element on a website using a driver. Second: passing the link generated from this interaction to a parse method or LinkExtractor. ad. 1. My ...

Boost the efficiency of my code by implementing multithreading/multiprocessing to speed up the scraping process

Is there a way to optimize my scrapy code using multithreading or multiprocessing? I'm not well-versed in threading with Python and would appreciate any guidance on how to implement it. import scrapy import logging domain = 'https://www.spdigit ...

Unable to render page with scrapy and javascript using splash

I am currently trying to crawl this specific page. Following a guide on Stack Overflow to complete this task, I attempted to render the webpage but faced issues. How can I resolve this problem? This is the command I used: scrapy shell 'http://localhost: ...

Having trouble with XPath in Scrapy?

Looking to extract data from the XPath provided below: /html/body/div[2]/div[2]/div/div/div[4]/ul[2]/li/div Currently testing this with Scrapy Shell using the following commands: scrapy shell "https://www.rentler.com/listing/520583" and running: hxs.s ...

Scrapy spider malfunctioning when trying to crawl the homepage

I'm currently using a scrapy scrawler I wrote to collect data from from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import Selector from .. import items clas ...

Scrapy with integrated Selenium is experiencing difficulties

I currently have a scrapy Crawlspider set up to parse links and retrieve html content successfully. However, I encountered an issue when trying to scrape javascript pages, so I decided to use Selenium to access the 'hidden' content. The problem arises wh ...

"Executing a Scrapy spider to scrape data from various sources using multiple

Currently, I am facing an issue while trying to extract email IDs. I have a list of email IDs and I need to execute multiple search queries consecutively. However, when trying to use a list, it shows me an indentation error. Can anyone assist me in resolvi ...

Scraping from the web: How to selectively crawl and eliminate duplicate items

What is the most effective method for ensuring that Scrapy does not store duplicate items in a database when running periodically to retrieve new content? Would assigning items a hash help prevent this issue? Your advice on avoiding duplicates would be g ...

Error message: IndexError: list index out of range while attempting to trigger a click event using selenium

There are a total of 41 category checkboxes on the page, with only 12 initially visible while the rest are hidden. To reveal the hidden checkboxes, one must click on "show more." This simple code accomplishes that: [step 1] Loop through all checkboxes >> ...

When attempting to extract the image URL using XPath, the results obtained are disorganized and cluttered

Upon attempting to retrieve the url of an image using xpath @src from the following link: https://www.amazon.com/dp/B07FK8SQDQ/ref=twister_B00WS2T4ZA?_encoding=UTF8&th=1 I expected it to provide a html element url However, what I received was a jumb ...

Unable to load Environment Variables in ScrapingHub Configuration

Recently, I encountered a puzzling situation while deploying spiders on ScrapingHub. The spider itself functions properly, but I am facing challenges in changing the output feed based on whether the spider is running locally or on ScrapingHub. My goal is t ...

Each start URL in Scrapy generates its own unique CSV file as output

I need help figuring out how to output multiple CSV files for each start_url in my Scrapy pipeline. Currently, I am only able to generate one file with information from all the URLs. pipeline.py class CSVPipeline(object): def __init__(self): self.fi ...

What is the best way to extract data from a website that shuffles its media files every time it is refreshed?

Trying to extract media files from a specific website with notes has been quite the challenge. Despite easily downloading the files, they are not in the correct order. It seems that the website makes an Ajax call after scrolling to page 30 and then loads ...

Is it possible to integrate Scrapy with the Chrome Browser?

Scraping a web page that utilizes javascript-rendered AngularJS can be tricky. The developers of the site have implemented a feature to detect Safari/Firefox in private browsing mode and prevent scraping. Interestingly, this warning does not appear when us ...

The Python code encountered a "stale element reference" error, indicating that the element is no longer connected to the

My current project involves scraping Amazon products, specifically focusing on clicking through various categories. However, I am encountering an issue where the code only works for the first category in the loop and throws an error. Despite researching so ...

Tips for implementing an Item loader in my Scrapy spider script?

After working on a Scrapy spider to extract news articles and data from a website, I encountered an issue with excessive whitespace in one of the items. Seeking a solution, I came across recommendations for using an Item Loader in the Scrapy documentation ...

Tips for extracting the URL of a fresh webpage using Selenium and Scrapy

I'm currently working on a web-scraping project to extract data from a platform known as "Startup India," which facilitates connections with various startups. I have set up filters to select specific criteria and then click on each startup to access detail ...

Having trouble selecting content in an HTML document using Xpath and response.css with Scrapy?

Something is quite puzzling me and I've been pondering over it for almost a week now. Perhaps the solution is staring right at me and I just can't see it clearly... Any hints for alternative approaches would be greatly appreciated. I have no control ...

Learn the steps to automate clicking on the "next" button using Selenium or Scrapy in Python

While attempting to gather data from flipkart.com using scrapy, I successfully collected everything except for navigating to the next page. Initially, I attempted to use scrapy followed by selenium. Interestingly, a class contains two links - one for the p ...

The soft time limit for celery was not activated

I am facing an issue with a celery task where the soft limit is set at 10 and the hard limit at 32: from celery.exceptions import SoftTimeLimitExceeded from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings @app.ta ...

Leveraging the Content-Length Header in Scrapy Requests

This is the specific page I am attempting to crawl, and this is the corresponding AJAX request that fetches the necessary data. I have replicated the same AJAX request with identical headers and request payload. While the request does not error out, it re ...

Scraping Secrets: Unraveling the Art of Procuring User Links

I need assistance with extracting links from the group members' page on Meetup: response.css('.text--ellipsisOneLine::attr(href)').getall() Can you please help me understand why this code is not functioning correctly? This is the HTML structure I am work ...

The Selector object cannot be serialized into JSON format

Currently, I'm facing the challenge of scraping a dynamic website which requires the use of Selenium. The specific links that I'm interested in scraping only become visible when clicked on. These links are generated by jQuery, and there is no hr ...

Navigate one level up or down from the current tag that contains a specified value by utilizing Scrapy

To extract the price text from within the custom-control / label / font style, I must use the data-number attribute data-number="025.00286R". This unique identifier differentiates between control section divs based on the letter at the end. <d ...

Querying HTML wrapped in a JSON response using Scrapy: a step-by-step guide

I'm currently in the process of scraping a website that relies on dynamically loaded content through JavaScript. In my attempts to request the data source, I received a JSON response where a key 'results_html' holds all the HTML necessary for querying an ...

The Failure of Scrapy Pagination

Hello everyone, I am excited to share my first post! Currently, I am working on developing a Web Spider that is capable of following links on invia.cz and extracting all the hotel titles. import scrapy y=0 class invia(scrapy.Spider): name = 'Kreta' ...

Transferring cookie data between requests in CrawlSpider

My current project involves scraping a bridge website to gather data from recent tournaments. I have previously asked for help on this issue here. Thanks to assistance from @alecxe, the scraper now successfully logs in while rendering JavaScript with Phant ...

Python Scrapy | Techniques for transferring the response data to the main function within the spider

I have been searching extensively for a solution using various search engines, but I may not be entering the correct keywords. Although I am aware that I can use the shell to manipulate CSS and XPath selectors right away, I am curious to know if it is poss ...