What is the best method to retrieve information from two tables on a webpage that have identical classes?

I am facing a challenge in retrieving or selecting data from two different tables that share the same class.

My attempts to access this information using 'soup.find_all' have proven to be difficult due to the formatting of the data.

There are multiple tables with identical classes, and I only need to extract the values (without labels) from each table.

URL:

TABLE 1:

<div class="bh_collapsible-body" style="display: none;">
    <table border="0" cellpadding="2" cellspacing="2" class="prop-list">
        <tbody>
            <tr>
                <td class="item">
                    <table>
                        <tbody>
                            <tr>
                                <td class="label">Rim Material</td>
                                <td class="value">Alloy</td>
                            </tr>
                        </tbody>
                    </table>
                </td>
                <td class="item">
                    <table>
                        <tbody>
                            <tr>
                                <td class="label">Front Tyre Description</td>
                                <td class="value">215/55 R16</td>
                            </tr>
                        </tbody>
                    </table>
                </td>
            </tr>

            <tr>
                <td class="item">
                    <table>
                        <tbody>
                            <tr>
                                <td class="label">Front Rim Description</td>
                                <td class="value">16x7.0</td>
                            </tr>
                        </tbody>
                    </table>
                </td>
                <td class="item">
                    <table>
                        <tbody>
                            <tr>
                                <td class="label">Rear Tyre Description</td>
                                <td class="value">215/55 R16</td>
                            </tr>
                        </tbody>
                    </table>
                </td>
            </tr>

            <tr>
                <td class="item">
                    <table>
                        <tbody>
                            <tr>
                                <td class="label">Rear Rim Description</td>
                                <td class="value">16x7.0</td>
                            </tr>
                        </tbody>
                    </table>
                </td>
                <td></td>
            </tr>
        </tbody>
    </table>
</div>

TABLE 2:

<div class="bh_collapsible-body" style="display: none;">
    <table border="0" cellpadding="2" cellspacing="2" class="prop-list">
        <tbody>
            <tr>
                <td class="item">
                    <table>
                        <tbody>
                            <tr>
                                <td class="label">Steering</td>
                                <td class="value">Rack and Pinion</td>
                            </tr>
                        </tbody>
                    </table>
                </td>
                <td></td>
            </tr>
        </tbody>
    </table>
</div>

<p>What I have attempted:</p>

<p>I tried to retrieve the contents of the first table using Xpath, but it returned both values and labels together.</p>

<pre><code>table1 = driver.find_element_by_xpath("//*[@id='features']/div/div[5]/div[2]/div[1]/div[1]/div/div[2]/table/tbody/tr[1]/td[1]/table/tbody/tr/td[2]")

I also attempted to split the data, but unfortunately, my efforts were unsuccessful. I provided the page URL in case you would like to investigate further.

Answer №1

If you're up for a bit of data exploration, one approach could be to utilize the read_html function in pandas.

By using pandas' read_html function, you can extract all html tables from a webpage and convert them into an array of pandas dataframes.

The following code snippet fetches all 82 table elements from the provided URL:

import pandas as pd
import requests

url = "https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/"

#Adding a user-agent header to prevent 403 forbidden error
header = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
        "X-Requested-With": "XMLHttpRequest"
        }

resp = requests.get(url, headers=header)

table_dataframes = pd.read_html(resp.text)


for i, df in enumerate(table_dataframes):
    print(f"================Table {i}=================\n")
    print(df)

This script will display all 82 tables found on the webpage. However, it requires manual identification and manipulation of the desired table, which appears to be either table 71 or table 74 based on your request.

To automate this process, additional logic would need to be implemented.

Answer №2

Working with the tables in question can be a bit tricky due to their nested structure. I utilized a CSS selector

table:has(td:contains("Rim Material")):has(table) tr:not(:has(tr))
to target the first table, and the same selector with the text "Steering" to target the second table:

from bs4 import BeautifulSoup
import requests

url = 'https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/'

headers = {'User-Agent':'Mozilla/5.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).text, 'lxml')

rows = []
for tr in soup.select('table:has(td:contains("Rim Material")):has(table) tr:not(:has(tr)), table:has(td:contains("Steering")):has(table) tr:not(:has(tr))'):
    rows.append([td.get_text(strip=True) for td in tr.select('td')])

for label, text in rows:
    print('{: <30}: {}'.format(label, text))

This will output:

Steering                      : Rack and Pinion
Rim Material                  : Alloy
Front Tyre Description        : 215/55 R16
Front Rim Description         : 16x7.0
Rear Tyre Description         : 215/55 R16
Rear Rim Description          : 16x7.0

Note: To extract data from multiple URLs, you can use the following script:

from bs4 import BeautifulSoup
import requests

headers = {'User-Agent':'Mozilla/5.0'}

urls = ['https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/',
        'https://www.redbook.com.au/cars/details/2019-genesis-g80-38-ultimate-auto-my19/SPOT-ITM-520697/']

for url in urls:
    soup = BeautifulSoup(requests.get(url, headers=headers).text, 'lxml')

    rows = []
    for tr in soup.select('table:has(td:contains("Rim Material")):has(table) tr:not(:has(tr)), table:has(td:contains("Steering")):has(table) tr:not(:has(tr))'):
        rows.append([td.get_text(strip=True) for td in tr.select('td')])

    print('{: <30}: {}'.format('Title', soup.h1.text))
    print('-' * (len(soup.h1.text.strip())+32))
    for label, text in rows:
        print('{: <30}: {}'.format(label, text))

    print('*' * 80)

When executed, this will display:

Title                         : 2019 Honda Civic 50 Years Edition Auto MY19
---------------------------------------------------------------------------
Steering                      : Rack and Pinion
Rim Material                  : Alloy
Front Tyre Description        : 215/55 R16
Front Rim Description         : 16x7.0
Rear Tyre Description         : 215/55 R16
Rear Rim Description          : 16x7.0
********************************************************************************
Title                         : 2019 Genesis G80 3.8 Ultimate Auto MY19
-----------------------------------------------------------------------
Steering                      : Rack and Pinion
Rim Material                  : Alloy
Front Tyre Description        : 245/40 R19
Front Rim Description         : 19x8.5
Rear Tyre Description         : 275/35 R19
Rear Rim Description          : 19x9.0
********************************************************************************

Answer №3

You don't have to retrieve all the data in a single xpath query. Instead, you can use multiple queries to extract specific information from different elements on the page. For example, you could first select all tables with the class 'prop-list' and then target individual tables by index to obtain values from them using another xpath expression.

In my case, I used BeautifulSoup for this task, but the process should be similar when using xpath.

import requests
from bs4 import BeautifulSoup as BS

url = 'https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/'

text = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}).text

soup = BS(text, 'html.parser')

all_tables = soup.find_all('table', {'class': 'prop-list'}) # Similar to xpath: '//table[@class="prop-list"]'

print("\n--- Engine ---\n")
all_labels = all_tables[3].find_all('td', {'class': 'label'}) 
all_values = all_tables[3].find_all('td', {'class': 'value'}) 
for label, value in zip(all_labels, all_values):
    print('{}: {}'.format(label.text, value.text))

print("\n--- Fuel ---\n")
all_labels = all_tables[4].find_all('td', {'class': 'label'})
all_values = all_tables[4].find_all('td', {'class': 'value'})
for label, value in zip(all_labels, all_values):
    print('{}: {}'.format(label.text, value.text))

print("\n--- Steering ---\n")
all_labels = all_tables[7].find_all('td', {'class': 'label'})
all_values = all_tables[7].find_all('td', {'class': 'value'})
for label, value in zip(all_labels, all_values):
    print('{}: {}'.format(label.text, value.text))

print("\n--- Wheels ---\n")
all_labels = all_tables[8].find_all('td', {'class': 'label'})
all_values = all_tables[8].find_all('td', {'class': 'value'})
for label, value in zip(all_labels, all_values):
    print('{}: {}'.format(label.text, value.text))

Result:

--- Engine ---

Engine Type: Piston
Valves/Ports per Cylinder: 4
Engine Location: Front
Compression ratio: 10.6
Engine Size (cc) (cc): 1799
Engine Code: R18Z1
Induction: Aspirated
Power: 104kW @ 6500rpm
Engine Configuration: In-line
Torque: 174Nm @ 4300rpm
Cylinders: 4
Power to Weight Ratio (W/kg): 82.6
Camshaft: OHC with VVT & Lift

--- Fuel ---

Fuel Type: Petrol - Unleaded ULP
Fuel Average Distance (km): 734
Fuel Capacity (L): 47
Fuel Maximum Distance (km): 940
RON Rating: 91
Fuel Minimum Distance (km): 540
Fuel Delivery: Multi-Point Injection
CO2 Emission Combined (g/km): 148
Method of Delivery: Electronic Sequential
CO2 Extra Urban (g/km): 117
Fuel Consumption Combined (L/100km): 6.4
CO2 Urban (g/km): 202
Fuel Consumption Extra Urban (L/100km): 5
Emission Standard: Euro 5
Fuel Consumption Urban (L/100km): 8.7

--- Steering ---

Steering: Rack and Pinion

--- Wheels ---

Rim Material: Alloy
Front Tyre Description: 215/55 R16
Front Rim Description: 16x7.0
Rear Tyre Description: 215/55 R16
Rear Rim Description: 16x7.0

I am assuming that all pages follow the same structure with consistent table numbers.

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Tips for Displaying and Concealing Tables Using Radio Buttons

Does anyone know how to refactor the jQuery code to toggle between two selection options (Yes and No)? This is the jQuery code I have tried: $(document).ready(function() { $("#send_to_one").hide(); $("input:radio[name='decision']").chan ...

Trigger the click event on the ul element instead of the li element using jQuery

Is there a way to make the click event only affect ul tags and not all li elements using jQuery? <!-- HTML --> <ul class="wrap"> <li>test1</li> <li>test2</li> <li>test3</li> </ul> I attemp ...

Is it possible for me to integrate the Selenium jar library and connect it with the code on Github?

I'm new to the world of GitHub and recently created an open source project. I've successfully linked my GitHub project with Eclipse. The code has been committed, but it relies on a Selenium jar file that is not available on GitHub. To avoid co ...

What steps can be taken to avoid the div sidebar overlapping with the content on the

I am currently facing an issue with my website where the div sidebar scrolls with the page and overlaps the page content whenever the window is resized. To see a demonstration of the problem, you can visit this link: Below is the CSS code for the menubar ...

Is there a way to confirm that the results are organized correctly in Selenium IDE, for example, in alphabetical order or descending by ID?

Recently delving into the world of Selenium IDE, I find myself in need of verifying the correct order of a list. This list is populated with records pulled from a database. While I understand I can create two unique records with given values and utilize me ...

update content of input field using Python Selenium

Hey, I have a snippet of code that looks like this: <input id="txt_search" class="search-box tp-co-1 tp-pa-rl-5 tp-re tp-bo-bo" type="text" placeholder="Search Stocks" onmouseup="this.select();" autoco ...

Search field in DataTables appears to be misaligned

I'm in the process of developing a small website using JSP and DataTables (currently only for the first table). Here's what I have so far: As you can observe, there seems to be an alignment issue with the search field position. I'm n ...

Problem with traversing from parent to children elements using jQuery selectors

<form data-v-c4600f50="" novalidate="novalidate" class="v-form"> <div data-v-c4600f50="" class="pr-2" question="Top Secret4"> <div data-v-c4600f50="" f ...

Fill your HTML form effortlessly using data from Google Sheets

I am relatively new to this topic, but I'm seeking a solution to populate an Apps Script Web App HTML dropdown form with names directly from a Google Spreadsheet. At the moment, I've managed to retrieve an array of names from column A in my sprea ...

converting the names of files in a specific directory to a JavaScript array

Currently working on a local HTML document and trying to navigate through a folder, gathering all the file names and storing them in a JavaScript array Let's say I have a folder named Videos with files like: - VideoA.mp4 - VideoB.mp4 How can I cre ...

Unable to locate and interact with a concealed item in a dropdown menu using Selenium WebDriver

Snippet: <select class="select2 ddl visible select2-hidden-accessible" data-allow-clear="true" id="Step1Model_CampaignAdditionalDataTypeId" multiple="" name="Step1Model.CampaignAdditionalDataTypeId" tabindex="-1" aria-hidden="true"> <option value ...

Ways to troubleshoot and resolve the jQuery error with the message "TypeError: 'click' called"

I am currently developing a project for managing Minecraft servers, focusing on a configuration panel. I have set up a form that users need to fill out in order to configure the settings and send the values using Ajax. However, I encountered an error: Type ...

The conditional rendering logic in the ng-if directive does not seem to be synchronizing

Currently, I am delving into AngularJS and working on a basic application to gain familiarity with it. Within my app, there are four tabs: List, Create, Update, and Delete. However, my goal is to only display the Update and Delete tabs when I press the b ...

Integrating dual Google Maps onto a single HTML page

I'm facing an issue with implementing two Google maps on a single page where the second map seems to be malfunctioning. Below is the code I am currently using: <style> #map-london { width: 500px; height: 400px; } #map-belgium { wi ...

What is the accurate Scrapy XPath for identifying <p> elements that are mistakenly nested within <h> tags?

I am currently in the process of setting up my initial Scrapy Spider, and I'm encountering some challenges with utilizing xpath to extract specific elements. My focus lies on (which is a Chinese website akin to Box Office Mojo). Extracting the Chine ...

Troubleshooting a C Library issue causing problems in a Python project

     Brand new to Python here, and my C skills are a bit rusty. I'm running macOS Lion on my Mac and trying to work with NFCpy, which utilizes USBpy that in turn uses libUSB. The issue I'm facing is that libUSB keeps crashing due to a null ...

Managing errors in jQuery's .ajax function

Having some issues with jQuery.ajax() while trying to fetch an html code snippet from table_snippet.html and replacing the element in my html code. The error handler in jQuery.ajax() gets triggered instead of the expected success handler. <!DOCTYPE H ...

Transfer or duplicate an SVG image from the bottom div to the top div

Is there a way to move a div from the chart_div to the movehere div? I've tried duplicating it below the current svg, but I'm having trouble targeting just the header row ('g' element) specifically. Here is the HTML: <script type= ...

selecting a limited number of elements in a Django model

Here we have a straightforward query where we aim to retrieve and filter two elements sorted in reverse order by date. The model structure is as follows: Class ModelName(models.Model): usr = models.ForeignKey(UserProfile) created = models.DateTim ...

Achieving priority for style.php over style.css

Having trouble with theme options overriding default CSS settings in style.css. Here's what I have in my header.php: <link rel='stylesheet' type='text/css' media='all' href="<?php bloginfo( 'stylesheet_url&apo ...