Utilizing Scrapy to Fetch Data from a MySQL Database and Web Scraping

I currently have a spider and pipeline set up to extract data from the web and insert it into MySQL. This code is up and running smoothly.

class AmazonAllDepartmentSpider(scrapy.Spider):

    name = "amazon"
    allowed_domains = ["amazon.com"]
    start_urls = [
        "http://www.amazon.com/gp/site-directory/ref=nav_sad/187-3757581-3331414"
    ]
    def parse(self, response):
        for sel in response.xpath('//ul[@class="nav_cat_links"]/li'):
            item = AmazoncrawlerItem()
            # pop()  removes [u'']  tag from 
            item['title'] = sel.xpath('a/text()').extract().pop()
            item['link'] = sel.xpath('a/@href').extract().pop()
            item['desc'] = sel.xpath('text()').extract()
            yield item

and

class AmazoncrawlerPipeline(object):
    host = 'qwerty.com'
    user = 'qwerty'
    password = 'qwerty123'
    db = 'amazon_project'

    def __init__(self):
        self.connection = MySQLdb.connect(self.host, self.user, self.password, self.db)
        self.cursor = self.connection.cursor()

    def process_item(self, item, spider):    
        try:
            self.cursor.execute("""INSERT INTO amazon_project.ProductDepartment (ProductTitle,ProductDepartmentLilnk)
                            VALUES (%s,%s)""", 
                           (item['title'],'amazon.com' + str(item.get('link'))))


            self.connection.commit()

        except MySQLdb.Error as e:
            print(f"Error {e.args[0]}: {e.args[1]")
        return item

Now I am looking to retrieve the data that contains URLs and call the spider once again to extract additional information. If anyone can provide assistance on how to achieve this, it would be greatly appreciated. Thank you!

Answer №1

Handling this issue should occur at the spider level.

To navigate through the links, you can utilize the yield function to generate a Request after yielding an item instance:

def parse(self, response):
    for sel in response.xpath('//ul[@class="nav_cat_links"]/li'):
        item = AmazoncrawlerItem()
        item['title'] = sel.xpath('a/text()').extract().pop()
        item['link'] = sel.xpath('a/@href').extract().pop()
        item['desc'] = sel.xpath('text()').extract()
        yield item
        yield Request(item['link'], callback=self.parse_link)

Alternatively, you can modify your approach and transition to using Link Extractors.


UPDATE (following discussions in comments):

If the links are already stored in a database, you will need to initiate another spider, retrieve the links from the database within start_requests(), and then use yield to create requests:

from scrapy.http import Request

class AmazonSpider(scrapy.Spider):
    name = "amazon"
    allowed_domains = ["amazon.com"]

    def start_requests(self):
         connection = MySQLdb.connect(<connection params here>)
         cursor = connection.cursor()

         cursor.execute("SELECT ProductDepartmentLilnk FROM amazon_project.ProductDepartment")
         links = cursor.fetchall()

         for link in links:
              yield Request(link, callback=self.parse)

         cursor.close()

     ...

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Tips for integrating traceback and debugging features into a Python-based programming language

Utilizing Python, I am working on implementing a new programming language called 'foo'. The entire code of foo will be converted into Python and executed within the same Python interpreter, allowing it to Just-In-Time (JIT) translate to Python. ...

What can I do to alter this Fibonacci Sequence Challenge?

I'm currently exploring ways to modify the code below in order to address the given question. My goal is to enhance the functionality so that I can also take 3 steps at a time, in addition to 1 or 2 steps. You are faced with a ladder consisting of N ...

The Python script concludes in the space between two for loops

Recently diving into Python, I'm exploring the process of copying files from one location to another and storing some information within them in a MySQL database. Interestingly, I've noticed that when running my script, it only works after being ...

How can I utilize the order_by function in Django to define the specific order in which the results are fetched?

I have noticed that my order_by statements only result in ascending or descending sorting. I am trying to use order_by(Risk), but I want the results to be returned in the specific order of High, Med, Low as they are listed in the field, rather than alphabe ...

Using Python Selenium to interact with dynamic-labeled elements

Being a Python novice, I have just one month of experience. While attempting to scrape a webpage, I can handle and interact with most elements except for two with dynamic labels. The HTML snippet from the source page is shown below: <span class="a- ...

Scraping Websites with SimpleHTMLDom in PHP

I am struggling to extract specific data from a table on a website page, particularly the columns name, level, and experience. The table's alternating row colors (zebra pattern) are complicating this task. I have implemented SimpleHTMLDom for this pu ...

The function password_verify() consistently yields a negative result

As a beginner, I recently attempted to develop a login system using PHP and MySQL. I successfully created a registration form and added a few users, but encountered issues when trying to create the login form. Every time I tried to log in, it displayed an ...

Is there a way to extract the complete table from a website and import it into an excel spreadsheet?

I am attempting to extract the complete table data from the following website: Note that upon clicking the link, a public login button will need to be clicked first. I have already set up a bot to handle the login process and navigate through the site, so ...

Describe the remarks on leetcode regarding the explanation of a binary tree

When I work on solving problems on leetcode, I always get confused by the first 6 commented lines. Can someone please explain what they mean? # Definition for a binary tree node. # class TreeNode: # def __init__(self, x): # self.val = x # ...

Guide to using Python and Selenium to extract the titles of each search result

I am currently learning Python and attempting to extract search results from python.org using the Selenium library. Here are the steps I want to follow: Open python.org Search for the term "array" (which will display the results) Paste the list of searc ...

What is it about PHP7 that enables it to outperform Python3 in running this basic loop?

Running a simple benchmark test, I decided to compare the execution times of the same code in PHP 7.0.19-1 and Python 3.5.3 on my Raspberry Pi 3 model B. To my surprise, Python's performance was significantly slower than PHP's (74 seconds vs 1.4 ...

List out all the items present in the Selenium Python bindings specific to the Appium framework

As I begin my journey with Appium to test my company's mobile applications, I have decided to use the Python bindings for scripting, starting with Android apps. Successfully running the Appium examples with grunt android, and executing the android.py ...

Combining data from various lists to populate a dictionary and generate a JSON file

I am currently working with a .dat file that I need to convert to json format. The challenge is that the data in the first half of the file has an awkward format that I must handle myself, so I cannot rely on pre-existing source code. After restructuring ...

What is the best way to obtain the ID following a fetch request?

I'm trying to target the ID specifically, but when I console log the data retrieved after making a fetch request, it contains more information than just the ID. I want to know how to extract and print only the ID from the response in the console. fetc ...

Attempting to transform a mysql timestamp into standard time by utilizing the strtotime function

strtotime($_SESSION['starting_timestamp']) I attempted to convert the MySQL timestamp stored as $_SESSION['starting_timestamp'], which has a value of 1422094831. However, when I applied strtotime function on it, it did not return any va ...

Generating Python Variables

Whenever I need to use a particular piece of code, I often find myself creating a function to avoid redundancy. def setVar(): try: x = int(input()) except: print("The input is not an integer. Please try again.") setVar() Here is the improved v ...

What methods can programming languages use to direct the Lua (or Python) interpreter to run commands specified by the program?

I am currently running on an operating system called Cent OS 7. I have a desire to develop a program, which may be written in Java or another language, that can communicate with the Lua interpreter. It is my goal for this program to send commands to the L ...

PHP and AJAX allow for seamless data retrieval without the need for page refreshing, and the data can be easily displayed in a modal window

I am currently encountering an issue with sending data to another page without refreshing. I am able to send the data as text, but for some reason, I am unable to send it as a modal. Why might this be happening? Here is an image of my current page https:/ ...

Strategy to implement when a web element is appearing and disappearing twice

I am currently facing a challenge and I need help resolving it. Using Selenium and Helium, I am conducting tests on a web application. During the testing process, the web app displays a screen blocker that appears and disappears twice. I need to wait fo ...

Selenium Python to obtain tooltip text

Currently, I am attempting to extract dynamic content that only appears when hovering over certain elements. Despite utilizing ActionChains from Selenium for mouse movement and hover actions, I have been unable to capture the desired text. The main issue ...