Utilizing Scrapy to Fetch Data from a MySQL Database and Web Scraping

Question

Utilizing Scrapy to Fetch Data from a MySQL Database and Web Scraping

I currently have a spider and pipeline set up to extract data from the web and insert it into MySQL. This code is up and running smoothly.

class AmazonAllDepartmentSpider(scrapy.Spider):

    name = "amazon"
    allowed_domains = ["amazon.com"]
    start_urls = [
        "http://www.amazon.com/gp/site-directory/ref=nav_sad/187-3757581-3331414"
    ]
    def parse(self, response):
        for sel in response.xpath('//ul[@class="nav_cat_links"]/li'):
            item = AmazoncrawlerItem()
            # pop()  removes [u'']  tag from 
            item['title'] = sel.xpath('a/text()').extract().pop()
            item['link'] = sel.xpath('a/@href').extract().pop()
            item['desc'] = sel.xpath('text()').extract()
            yield item

and

class AmazoncrawlerPipeline(object):
    host = 'qwerty.com'
    user = 'qwerty'
    password = 'qwerty123'
    db = 'amazon_project'

    def __init__(self):
        self.connection = MySQLdb.connect(self.host, self.user, self.password, self.db)
        self.cursor = self.connection.cursor()

    def process_item(self, item, spider):    
        try:
            self.cursor.execute("""INSERT INTO amazon_project.ProductDepartment (ProductTitle,ProductDepartmentLilnk)
                            VALUES (%s,%s)""", 
                           (item['title'],'amazon.com' + str(item.get('link'))))


            self.connection.commit()

        except MySQLdb.Error as e:
            print(f"Error {e.args[0]}: {e.args[1]")
        return item

Now I am looking to retrieve the data that contains URLs and call the spider once again to extract additional information. If anyone can provide assistance on how to achieve this, it would be greatly appreciated. Thank you!

python mysql python-2.7 web-scraping scrapy

Answer 1

Answer №1

Handling this issue should occur at the spider level.

To navigate through the links, you can utilize the yield function to generate a Request after yielding an item instance:

def parse(self, response):
    for sel in response.xpath('//ul[@class="nav_cat_links"]/li'):
        item = AmazoncrawlerItem()
        item['title'] = sel.xpath('a/text()').extract().pop()
        item['link'] = sel.xpath('a/@href').extract().pop()
        item['desc'] = sel.xpath('text()').extract()
        yield item
        yield Request(item['link'], callback=self.parse_link)

Alternatively, you can modify your approach and transition to using Link Extractors.

UPDATE (following discussions in comments):

If the links are already stored in a database, you will need to initiate another spider, retrieve the links from the database within start_requests(), and then use yield to create requests:

from scrapy.http import Request

class AmazonSpider(scrapy.Spider):
    name = "amazon"
    allowed_domains = ["amazon.com"]

    def start_requests(self):
         connection = MySQLdb.connect(<connection params here>)
         cursor = connection.cursor()

         cursor.execute("SELECT ProductDepartmentLilnk FROM amazon_project.ProductDepartment")
         links = cursor.fetchall()

         for link in links:
              yield Request(link, callback=self.parse)

         cursor.close()

     ...

Answer 2

Handling this issue should occur at the spider level.

To navigate through the links, you can utilize the yield function to generate a Request after yielding an item instance:

def parse(self, response):
    for sel in response.xpath('//ul[@class="nav_cat_links"]/li'):
        item = AmazoncrawlerItem()
        item['title'] = sel.xpath('a/text()').extract().pop()
        item['link'] = sel.xpath('a/@href').extract().pop()
        item['desc'] = sel.xpath('text()').extract()
        yield item
        yield Request(item['link'], callback=self.parse_link)

Alternatively, you can modify your approach and transition to using Link Extractors.

UPDATE (following discussions in comments):

If the links are already stored in a database, you will need to initiate another spider, retrieve the links from the database within start_requests(), and then use yield to create requests:

from scrapy.http import Request

class AmazonSpider(scrapy.Spider):
    name = "amazon"
    allowed_domains = ["amazon.com"]

    def start_requests(self):
         connection = MySQLdb.connect(<connection params here>)
         cursor = connection.cursor()

         cursor.execute("SELECT ProductDepartmentLilnk FROM amazon_project.ProductDepartment")
         links = cursor.fetchall()

         for link in links:
              yield Request(link, callback=self.parse)

         cursor.close()

     ...

Utilizing Scrapy to Fetch Data from a MySQL Database and Web Scraping

Answer №1

Similar questions

Tips for integrating traceback and debugging features into a Python-based programming language

What can I do to alter this Fibonacci Sequence Challenge?

The Python script concludes in the space between two for loops

How can I utilize the order_by function in Django to define the specific order in which the results are fetched?

Using Python Selenium to interact with dynamic-labeled elements

Scraping Websites with SimpleHTMLDom in PHP

The function password_verify() consistently yields a negative result

Is there a way to extract the complete table from a website and import it into an excel spreadsheet?

Describe the remarks on leetcode regarding the explanation of a binary tree

Guide to using Python and Selenium to extract the titles of each search result

What is it about PHP7 that enables it to outperform Python3 in running this basic loop?

List out all the items present in the Selenium Python bindings specific to the Appium framework

Combining data from various lists to populate a dictionary and generate a JSON file

What is the best way to obtain the ID following a fetch request?

Attempting to transform a mysql timestamp into standard time by utilizing the strtotime function

Generating Python Variables

What methods can programming languages use to direct the Lua (or Python) interpreter to run commands specified by the program?

PHP and AJAX allow for seamless data retrieval without the need for page refreshing, and the data can be easily displayed in a modal window

Strategy to implement when a web element is appearing and disappearing twice

Selenium Python to obtain tooltip text