Python Web Scraping: Issues with Duplication and Displaying Outputs

I have encountered a problem in my code that is causing issues with the output of loops and inserting data into my database. Despite attempting to troubleshoot, I am unable to pinpoint the exact source of the problem. What I am striving for is to have each scraped line of data printed as an output and then inserted into a table in my database. However, the current result is a single incorrect entry being duplicated numerous times.

Current Output:

Ford C-MAX 2019 1.1 Petrol 0
Ford C-MAX 2019 1.1 Petrol 0
Ford C-MAX 2019 1.1 Petrol 0
...

Desired Output Based on Website Listings (example only as it varies):

Ford C-MAX 2019 1.1 Petrol 15950
Ford C-MAX 2014 1.6 Diesel 12000
Ford C-MAX 2011 1.6 Diesel 9000
...

Code:

from __future__ import print_function
import requests
import re
import locale
import time
from time import sleep
from random import randint
from currency_converter import CurrencyConverter
c = CurrencyConverter()
from bs4 import BeautifulSoup
import pandas as pd
from datetime import date, datetime, timedelta
import mysql.connector
import numpy as np
import itertools

locale.setlocale( locale.LC_ALL, 'en_US.UTF-8' )

pages = np.arange(0, 210, 30)

entered = datetime.now()
make = "Ford"
model = "C-MAX"


def insertvariablesintotable(make, model, year, liter, fuel, price, entered):
    try:
        cnx = mysql.connector.connect(user='root', password='', database='FYP', host='127.0.0.2', port='8000')
        cursor = cnx.cursor()

        cursor.execute('CREATE TABLE IF NOT EXISTS ford_cmax ( make VARCHAR(15), model VARCHAR(20), '
                       'year INT(4), liter VARCHAR(3), fuel VARCHAR(6), price INT(6), entered TIMESTAMP) ')

        insert_query = """INSERT INTO ford_cmax (make, model, year, liter, fuel, price, entered) VALUES (%s,%s,%s,%s,%s,%s,%s)"""
        record = (make, model, year, liter, fuel, price, entered)

        cursor.execute(insert_query, record)

        cnx.commit()

    finally:
        if (cnx.is_connected()):
            cursor.close()
            cnx.close()

for response in pages:

    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
    response = requests.get("https://www.donedeal.ie/cars/Ford/C-MAX?start=" + str(response), headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    cnx = mysql.connector.connect(user='root', password='', database='FYP', host='127.0.0.2', port='8000')
    cursor = cnx.cursor()

    for details in soup.findAll('ul', attrs={'class': 'card__body-keyinfo'}):

        details = details.text
        #print(details)
        year = details[:4]
        liter = details[4:7]
        fuel = details[8:14] #exludes electric which has 2 extra
        mileage = re.findall("[0-9]*,[0-9][0-9][0-9]..." , details)
        mileage = ''.join(mileage)
        mileage = mileage.replace(",", "")
        if "mi" in mileage:
            mileage = mileage.rstrip('mi')
            mileage = round(float(mileage) * 1.609)
        mileage = str(mileage)
        if "km" in mileage:
            mileage = mileage.rstrip('km')
        mileage = mileage.replace("123" or "1234" or "12345" or "123456", "0")

    for price in soup.findAll('p', attrs={'class': 'card__price'}):

        price = price.text
        price = price.replace("No Price", "0")
        price = price.replace("123" or "1234" or "12345" or "123456", "0")
        price = price.replace(",","")
        price = price.replace("€", "")
        if "p/m" in price:
            #price = price[:-3]
            price = price.rstrip('p/m')
            price = "0"
        if "£" in price:
            price = price.replace("£", "")
            price = c.convert(price, 'GBP', 'EUR')
            price = round(price)

    print(make, model, year, liter, fuel, price)

    #insertvariablesintotable(make, model, year, liter, fuel, price, entered) #same result as above

Answer №1

Upon reviewing your code and the target website from which you are trying to fetch data, it appears that there are issues with how you handle the retrieved information. Specifically, you seem to be overwriting variables such as price and details within your loops, leading to incorrect data extraction.

To address this issue, consider the following approach:

make = "Ford"
model = "C-MAX"
price_list = [] # store prices here
details_list = [] # store details (year, liter, mileage) here
for response in range(1,60,30): 

    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"
    }
    response = requests.get(
        "https://www.donedeal.ie/cars/Ford/C-MAX?start=" + str(response),
        headers=headers,
    )
    soup = BeautifulSoup(response.text, "html.parser")
    count = 0
    for details in soup.findAll("ul", attrs={"class": "card__body-keyinfo"}):
        if count == 30:
            break 
        details = details.text
        year = details[:4]
        liter = details[4:7]
        fuel = details[8:14]  
        mileage = re.findall("[0-9]*,[0-9][0-9][0-9]...", details)
        mileage = "".join(mileage)
        mileage = mileage.replace(",", "")
        # Further processing of mileage data
        details_list.append((year, liter, fuel, mileage)) 
        count += 1 
        
    count = 0
    for price in soup.findAll("p", attrs={"class": "card__price"}):
        if count == 30:
            break 
        price = price.text
        # Further processing of price data
        price_list.append(price) 
        count += 1 

for i in range(len(price_list)):
    print(
    make,
    model,
    details_list[i][0],
    details_list[i][1],
    details_list[i][2],
    price_list[i],
)

Please note that p/m prices were excluded to maintain consistency in list lengths between details and prices. Adjustments may be needed if p/m prices are required. Additionally, ensure that the extracted data corresponds specifically to Ford C-MAX vehicles to avoid mixing with other models or manufacturers at the end of the page.

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Leveraging axes_grid1 without altering the subplots

I need to add a colorbar next to one of my subplots using axes_grid1. However, when I use the make_axes_locatable function, it alters the size of my subplot compared to the other plots. Below is a simple demonstration showcasing my problem: from mpl_tool ...

Is there a way to halt the rendering of variables in Django?

Is there a way to prevent Django from rendering specific template variables? The scenario is that I wanted to experiment with implementing Vue.js in a Django application, which was somewhat successful. The issue arises because both Vue.js and Django use ...

Wrapper for establishing database connections

Here is a class I have designed to simplify my life: import pymssql class DatabaseConnection: def __init__(self): self.connection = pymssql.connect(host='...', user='...', password='...', database='...' ...

Error in Python: The specified directory does not exist

As a newcomer to python, I recently attempted to send an email with an attachment through Gmail. However, the path provided for the attachment could not be located by Python. class Email(object): def __init__(self, from_, to, subject, message, message ...

Determining the Specific Element for Sending Keys in Selenium: A Guide

I constantly encounter the same issue while using Selenium with Python - I frequently struggle to locate the element to perform send_keys() on. For instance, when trying to search for an item on the eBay homepage. Every time, I find myself experimenting w ...

Failure to update hit counter in MySQL database using Ajax request

Below is the code snippet for loading a list of videos from a database on the index page: <?php ini_set('display_errors', 1); mysqli_set_charset($link, 'utf8mb4'); $query="SELECT * FROM videos"; $result=mysqli_query($lin ...

Creating a personalized aggregation function in a MySQL query

Presenting the data in tabular format: id | module_id | rating 1 | 421 | 3 2 | 421 | 5 3. | 5321 | 4 4 | 5321 | 5 5 | 5321 | 4 6 | 641 | 2 7 | ...

Transform an array of JSON data into a structured dataframe

I am looking to transform the owid Covid-19 JSON data found here into a dataframe. This JSON contains daily records in the data column, and I aim to merge this with the country index to create the desired dataframe. {"AFG":{"continent": ...

Which database is favored by Node.js developers, PostgreSQL or MySQL?

Postgres boasts a wide range of features and has been successfully utilized by platforms such as Instagram, among others. On the other hand, MySQL has garnered a larger user base and has been implemented on popular sites like Facebook and Quora. How do the ...

I'm looking for recommendations on how to retrieve the element that comes directly before the smallest one in a Python list

Here is the code that allows you to input numbers: my_array = [] count = int(input("Please enter how many numbers you would like to add: ")) for i in range(1, count + 1): my_array.append(int(input("Enter number {} : ".format(i)))) print("Your Input N ...

Discord.py - Utilizing Optional Arguments

Struggling with setting up a command that includes an optional argument to reload my cogs. Initially, I encountered issues where the command was not recognized and even after making some tweaks, it ended up reloading all arguments regardless of specifying ...

Encountering a TypeError while attempting to utilize Django and Pandas for displaying data in an HTML format

import pandas as pd from django.shortcuts import render # Define the home view function def home(): # Read data from a CSV file and select only the 'name' column data = pd.read_csv("pandadjangoproject/nmdata.csv", nrows=11) only_city ...

Encountered an issue extracting data accurately from the table

Could someone help me determine the user ID of a user who referred another user based on the sample data below? | user_id | refered | |---------+---------| | 780 | 1 | | 781 | 780 | | 782 | 781 | | 783 | | | 784 | ...

Converting Nested JSON into a Pandas Data Frame

How can we efficiently convert the following JSON dataset snapshot into a Pandas Data Frame? Importing the file directly results in a format that is not easily manageable. Presently, I am utilizing json_normalize to separate location and sensor into diff ...

Python Error: 'str' Type Object Does Not Have Attribute 'string'

I've been looking for answers everywhere, but can't seem to find a solution to my problem. I'm trying to create a shape with a specific size by multiplying the "-" character by a certain number and displaying it within classes. Here is the c ...

determine the highest y-coordinate in a bar graph

Can anyone provide tips on determining the highest y-value in a histogram? #Seeking advice on finding the maximum value of y in a histogram import matplotlib.pyplot as plt hdata = randn(500) x = plt.hist(hdata) y = plt.hist(hdata, bins=40) ...

What is the reason behind numpy.angle() not being a universal function (ufunc)?

What makes numpy.angle() different from other numpy universal functions (ufuncs)? Although it seems to meet the criteria outlined in the numpy documentation, why is it not officially categorized as a ufunc? I initially speculated that its conversion of c ...

Django is unable to fetch a .docx file using Ajax requests

The issue I am experiencing involves a script that directs to the badownload function in order to download a .docx file. However, the result of this function is a downloaded .docx file that does not work as expected. Below is the script: <script> ...

What is the method to initiate a Python thread from C++?

In my current project, I am specifically limited to using Python 2.6. The software involves a Python 2.6 application that interfaces with a C++ multi-threaded API library created with boost-python. I have been trying to implement a Python function callback ...

Failure to submit form in Bootstrap modal

Can someone please help me troubleshoot why the form in the Bootstrap modal is not submitting? This is the HTML code for the modal (sidebar.php) <!-- Beginning of Joel's modal --> <div class="modal fade" id="myModal" tabindex="-1" role="dia ...