What is the best method to retrieve information from two tables on a webpage that have identical classes?

Question

What is the best method to retrieve information from two tables on a webpage that have identical classes?

I am facing a challenge in retrieving or selecting data from two different tables that share the same class.

My attempts to access this information using 'soup.find_all' have proven to be difficult due to the formatting of the data.

There are multiple tables with identical classes, and I only need to extract the values (without labels) from each table.

URL:

TABLE 1:

<div class="bh_collapsible-body" style="display: none;">
    <table border="0" cellpadding="2" cellspacing="2" class="prop-list">
        <tbody>
            <tr>
                <td class="item">
                    <table>
                        <tbody>
                            <tr>
                                <td class="label">Rim Material</td>
                                <td class="value">Alloy</td>
                            </tr>
                        </tbody>
                    </table>
                </td>
                <td class="item">
                    <table>
                        <tbody>
                            <tr>
                                <td class="label">Front Tyre Description</td>
                                <td class="value">215/55 R16</td>
                            </tr>
                        </tbody>
                    </table>
                </td>
            </tr>

            <tr>
                <td class="item">
                    <table>
                        <tbody>
                            <tr>
                                <td class="label">Front Rim Description</td>
                                <td class="value">16x7.0</td>
                            </tr>
                        </tbody>
                    </table>
                </td>
                <td class="item">
                    <table>
                        <tbody>
                            <tr>
                                <td class="label">Rear Tyre Description</td>
                                <td class="value">215/55 R16</td>
                            </tr>
                        </tbody>
                    </table>
                </td>
            </tr>

            <tr>
                <td class="item">
                    <table>
                        <tbody>
                            <tr>
                                <td class="label">Rear Rim Description</td>
                                <td class="value">16x7.0</td>
                            </tr>
                        </tbody>
                    </table>
                </td>
                <td></td>
            </tr>
        </tbody>
    </table>
</div>

TABLE 2:

<div class="bh_collapsible-body" style="display: none;">
    <table border="0" cellpadding="2" cellspacing="2" class="prop-list">
        <tbody>
            <tr>
                <td class="item">
                    <table>
                        <tbody>
                            <tr>
                                <td class="label">Steering</td>
                                <td class="value">Rack and Pinion</td>
                            </tr>
                        </tbody>
                    </table>
                </td>
                <td></td>
            </tr>
        </tbody>
    </table>
</div>

<p>What I have attempted:</p>

<p>I tried to retrieve the contents of the first table using Xpath, but it returned both values and labels together.</p>

<pre><code>table1 = driver.find_element_by_xpath("//*[@id='features']/div/div[5]/div[2]/div[1]/div[1]/div/div[2]/table/tbody/tr[1]/td[1]/table/tbody/tr/td[2]")

I also attempted to split the data, but unfortunately, my efforts were unsuccessful. I provided the page URL in case you would like to investigate further.

python html selenium selenium-webdriver beautifulsoup

Answer 1

Answer №1

If you're up for a bit of data exploration, one approach could be to utilize the read_html function in pandas.

By using pandas' read_html function, you can extract all html tables from a webpage and convert them into an array of pandas dataframes.

The following code snippet fetches all 82 table elements from the provided URL:

import pandas as pd
import requests

url = "https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/"

#Adding a user-agent header to prevent 403 forbidden error
header = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
        "X-Requested-With": "XMLHttpRequest"
        }

resp = requests.get(url, headers=header)

table_dataframes = pd.read_html(resp.text)


for i, df in enumerate(table_dataframes):
    print(f"================Table {i}=================\n")
    print(df)

This script will display all 82 tables found on the webpage. However, it requires manual identification and manipulation of the desired table, which appears to be either table 71 or table 74 based on your request.

To automate this process, additional logic would need to be implemented.

Answer 2

If you're up for a bit of data exploration, one approach could be to utilize the read_html function in pandas.

By using pandas' read_html function, you can extract all html tables from a webpage and convert them into an array of pandas dataframes.

The following code snippet fetches all 82 table elements from the provided URL:

import pandas as pd
import requests

url = "https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/"

#Adding a user-agent header to prevent 403 forbidden error
header = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
        "X-Requested-With": "XMLHttpRequest"
        }

resp = requests.get(url, headers=header)

table_dataframes = pd.read_html(resp.text)


for i, df in enumerate(table_dataframes):
    print(f"================Table {i}=================\n")
    print(df)

This script will display all 82 tables found on the webpage. However, it requires manual identification and manipulation of the desired table, which appears to be either table 71 or table 74 based on your request.

To automate this process, additional logic would need to be implemented.

Answer 3

Answer №2

Working with the tables in question can be a bit tricky due to their nested structure. I utilized a CSS selector

table:has(td:contains("Rim Material")):has(table) tr:not(:has(tr))

to target the first table, and the same selector with the text "Steering" to target the second table:

from bs4 import BeautifulSoup
import requests

url = 'https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/'

headers = {'User-Agent':'Mozilla/5.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).text, 'lxml')

rows = []
for tr in soup.select('table:has(td:contains("Rim Material")):has(table) tr:not(:has(tr)), table:has(td:contains("Steering")):has(table) tr:not(:has(tr))'):
    rows.append([td.get_text(strip=True) for td in tr.select('td')])

for label, text in rows:
    print('{: <30}: {}'.format(label, text))

This will output:

Steering                      : Rack and Pinion
Rim Material                  : Alloy
Front Tyre Description        : 215/55 R16
Front Rim Description         : 16x7.0
Rear Tyre Description         : 215/55 R16
Rear Rim Description          : 16x7.0

Note: To extract data from multiple URLs, you can use the following script:

from bs4 import BeautifulSoup
import requests

headers = {'User-Agent':'Mozilla/5.0'}

urls = ['https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/',
        'https://www.redbook.com.au/cars/details/2019-genesis-g80-38-ultimate-auto-my19/SPOT-ITM-520697/']

for url in urls:
    soup = BeautifulSoup(requests.get(url, headers=headers).text, 'lxml')

    rows = []
    for tr in soup.select('table:has(td:contains("Rim Material")):has(table) tr:not(:has(tr)), table:has(td:contains("Steering")):has(table) tr:not(:has(tr))'):
        rows.append([td.get_text(strip=True) for td in tr.select('td')])

    print('{: <30}: {}'.format('Title', soup.h1.text))
    print('-' * (len(soup.h1.text.strip())+32))
    for label, text in rows:
        print('{: <30}: {}'.format(label, text))

    print('*' * 80)

When executed, this will display:

Title                         : 2019 Honda Civic 50 Years Edition Auto MY19
---------------------------------------------------------------------------
Steering                      : Rack and Pinion
Rim Material                  : Alloy
Front Tyre Description        : 215/55 R16
Front Rim Description         : 16x7.0
Rear Tyre Description         : 215/55 R16
Rear Rim Description          : 16x7.0
********************************************************************************
Title                         : 2019 Genesis G80 3.8 Ultimate Auto MY19
-----------------------------------------------------------------------
Steering                      : Rack and Pinion
Rim Material                  : Alloy
Front Tyre Description        : 245/40 R19
Front Rim Description         : 19x8.5
Rear Tyre Description         : 275/35 R19
Rear Rim Description          : 19x9.0
********************************************************************************

Answer 4

Working with the tables in question can be a bit tricky due to their nested structure. I utilized a CSS selector

table:has(td:contains("Rim Material")):has(table) tr:not(:has(tr))

to target the first table, and the same selector with the text "Steering" to target the second table:

from bs4 import BeautifulSoup
import requests

url = 'https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/'

headers = {'User-Agent':'Mozilla/5.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).text, 'lxml')

rows = []
for tr in soup.select('table:has(td:contains("Rim Material")):has(table) tr:not(:has(tr)), table:has(td:contains("Steering")):has(table) tr:not(:has(tr))'):
    rows.append([td.get_text(strip=True) for td in tr.select('td')])

for label, text in rows:
    print('{: <30}: {}'.format(label, text))

This will output:

Steering                      : Rack and Pinion
Rim Material                  : Alloy
Front Tyre Description        : 215/55 R16
Front Rim Description         : 16x7.0
Rear Tyre Description         : 215/55 R16
Rear Rim Description          : 16x7.0

Note: To extract data from multiple URLs, you can use the following script:

from bs4 import BeautifulSoup
import requests

headers = {'User-Agent':'Mozilla/5.0'}

urls = ['https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/',
        'https://www.redbook.com.au/cars/details/2019-genesis-g80-38-ultimate-auto-my19/SPOT-ITM-520697/']

for url in urls:
    soup = BeautifulSoup(requests.get(url, headers=headers).text, 'lxml')

    rows = []
    for tr in soup.select('table:has(td:contains("Rim Material")):has(table) tr:not(:has(tr)), table:has(td:contains("Steering")):has(table) tr:not(:has(tr))'):
        rows.append([td.get_text(strip=True) for td in tr.select('td')])

    print('{: <30}: {}'.format('Title', soup.h1.text))
    print('-' * (len(soup.h1.text.strip())+32))
    for label, text in rows:
        print('{: <30}: {}'.format(label, text))

    print('*' * 80)

When executed, this will display:

Title                         : 2019 Honda Civic 50 Years Edition Auto MY19
---------------------------------------------------------------------------
Steering                      : Rack and Pinion
Rim Material                  : Alloy
Front Tyre Description        : 215/55 R16
Front Rim Description         : 16x7.0
Rear Tyre Description         : 215/55 R16
Rear Rim Description          : 16x7.0
********************************************************************************
Title                         : 2019 Genesis G80 3.8 Ultimate Auto MY19
-----------------------------------------------------------------------
Steering                      : Rack and Pinion
Rim Material                  : Alloy
Front Tyre Description        : 245/40 R19
Front Rim Description         : 19x8.5
Rear Tyre Description         : 275/35 R19
Rear Rim Description          : 19x9.0
********************************************************************************

Answer 5

Answer №3

You don't have to retrieve all the data in a single xpath query. Instead, you can use multiple queries to extract specific information from different elements on the page. For example, you could first select all tables with the class 'prop-list' and then target individual tables by index to obtain values from them using another xpath expression.

In my case, I used BeautifulSoup for this task, but the process should be similar when using xpath.

import requests
from bs4 import BeautifulSoup as BS

url = 'https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/'

text = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}).text

soup = BS(text, 'html.parser')

all_tables = soup.find_all('table', {'class': 'prop-list'}) # Similar to xpath: '//table[@class="prop-list"]'

print("\n--- Engine ---\n")
all_labels = all_tables[3].find_all('td', {'class': 'label'}) 
all_values = all_tables[3].find_all('td', {'class': 'value'}) 
for label, value in zip(all_labels, all_values):
    print('{}: {}'.format(label.text, value.text))

print("\n--- Fuel ---\n")
all_labels = all_tables[4].find_all('td', {'class': 'label'})
all_values = all_tables[4].find_all('td', {'class': 'value'})
for label, value in zip(all_labels, all_values):
    print('{}: {}'.format(label.text, value.text))

print("\n--- Steering ---\n")
all_labels = all_tables[7].find_all('td', {'class': 'label'})
all_values = all_tables[7].find_all('td', {'class': 'value'})
for label, value in zip(all_labels, all_values):
    print('{}: {}'.format(label.text, value.text))

print("\n--- Wheels ---\n")
all_labels = all_tables[8].find_all('td', {'class': 'label'})
all_values = all_tables[8].find_all('td', {'class': 'value'})
for label, value in zip(all_labels, all_values):
    print('{}: {}'.format(label.text, value.text))

Result:

--- Engine ---

Engine Type: Piston
Valves/Ports per Cylinder: 4
Engine Location: Front
Compression ratio: 10.6
Engine Size (cc) (cc): 1799
Engine Code: R18Z1
Induction: Aspirated
Power: 104kW @ 6500rpm
Engine Configuration: In-line
Torque: 174Nm @ 4300rpm
Cylinders: 4
Power to Weight Ratio (W/kg): 82.6
Camshaft: OHC with VVT & Lift

--- Fuel ---

Fuel Type: Petrol - Unleaded ULP
Fuel Average Distance (km): 734
Fuel Capacity (L): 47
Fuel Maximum Distance (km): 940
RON Rating: 91
Fuel Minimum Distance (km): 540
Fuel Delivery: Multi-Point Injection
CO2 Emission Combined (g/km): 148
Method of Delivery: Electronic Sequential
CO2 Extra Urban (g/km): 117
Fuel Consumption Combined (L/100km): 6.4
CO2 Urban (g/km): 202
Fuel Consumption Extra Urban (L/100km): 5
Emission Standard: Euro 5
Fuel Consumption Urban (L/100km): 8.7

--- Steering ---

Steering: Rack and Pinion

--- Wheels ---

Rim Material: Alloy
Front Tyre Description: 215/55 R16
Front Rim Description: 16x7.0
Rear Tyre Description: 215/55 R16
Rear Rim Description: 16x7.0

I am assuming that all pages follow the same structure with consistent table numbers.

Answer 6

You don't have to retrieve all the data in a single xpath query. Instead, you can use multiple queries to extract specific information from different elements on the page. For example, you could first select all tables with the class 'prop-list' and then target individual tables by index to obtain values from them using another xpath expression.

In my case, I used BeautifulSoup for this task, but the process should be similar when using xpath.

import requests
from bs4 import BeautifulSoup as BS

url = 'https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/'

text = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}).text

soup = BS(text, 'html.parser')

all_tables = soup.find_all('table', {'class': 'prop-list'}) # Similar to xpath: '//table[@class="prop-list"]'

print("\n--- Engine ---\n")
all_labels = all_tables[3].find_all('td', {'class': 'label'}) 
all_values = all_tables[3].find_all('td', {'class': 'value'}) 
for label, value in zip(all_labels, all_values):
    print('{}: {}'.format(label.text, value.text))

print("\n--- Fuel ---\n")
all_labels = all_tables[4].find_all('td', {'class': 'label'})
all_values = all_tables[4].find_all('td', {'class': 'value'})
for label, value in zip(all_labels, all_values):
    print('{}: {}'.format(label.text, value.text))

print("\n--- Steering ---\n")
all_labels = all_tables[7].find_all('td', {'class': 'label'})
all_values = all_tables[7].find_all('td', {'class': 'value'})
for label, value in zip(all_labels, all_values):
    print('{}: {}'.format(label.text, value.text))

print("\n--- Wheels ---\n")
all_labels = all_tables[8].find_all('td', {'class': 'label'})
all_values = all_tables[8].find_all('td', {'class': 'value'})
for label, value in zip(all_labels, all_values):
    print('{}: {}'.format(label.text, value.text))

Result:

--- Engine ---

Engine Type: Piston
Valves/Ports per Cylinder: 4
Engine Location: Front
Compression ratio: 10.6
Engine Size (cc) (cc): 1799
Engine Code: R18Z1
Induction: Aspirated
Power: 104kW @ 6500rpm
Engine Configuration: In-line
Torque: 174Nm @ 4300rpm
Cylinders: 4
Power to Weight Ratio (W/kg): 82.6
Camshaft: OHC with VVT & Lift

--- Fuel ---

Fuel Type: Petrol - Unleaded ULP
Fuel Average Distance (km): 734
Fuel Capacity (L): 47
Fuel Maximum Distance (km): 940
RON Rating: 91
Fuel Minimum Distance (km): 540
Fuel Delivery: Multi-Point Injection
CO2 Emission Combined (g/km): 148
Method of Delivery: Electronic Sequential
CO2 Extra Urban (g/km): 117
Fuel Consumption Combined (L/100km): 6.4
CO2 Urban (g/km): 202
Fuel Consumption Extra Urban (L/100km): 5
Emission Standard: Euro 5
Fuel Consumption Urban (L/100km): 8.7

--- Steering ---

Steering: Rack and Pinion

--- Wheels ---

Rim Material: Alloy
Front Tyre Description: 215/55 R16
Front Rim Description: 16x7.0
Rear Tyre Description: 215/55 R16
Rear Rim Description: 16x7.0

I am assuming that all pages follow the same structure with consistent table numbers.

What is the best method to retrieve information from two tables on a webpage that have identical classes?

Answer №1

Answer №2

Answer №3

Similar questions

Tips for Displaying and Concealing Tables Using Radio Buttons

Trigger the click event on the ul element instead of the li element using jQuery

Is it possible for me to integrate the Selenium jar library and connect it with the code on Github?

What steps can be taken to avoid the div sidebar overlapping with the content on the

Is there a way to confirm that the results are organized correctly in Selenium IDE, for example, in alphabetical order or descending by ID?

update content of input field using Python Selenium

Search field in DataTables appears to be misaligned

Problem with traversing from parent to children elements using jQuery selectors

Fill your HTML form effortlessly using data from Google Sheets

converting the names of files in a specific directory to a JavaScript array

Unable to locate and interact with a concealed item in a dropdown menu using Selenium WebDriver

Ways to troubleshoot and resolve the jQuery error with the message "TypeError: 'click' called"

The conditional rendering logic in the ng-if directive does not seem to be synchronizing

Integrating dual Google Maps onto a single HTML page

What is the accurate Scrapy XPath for identifying <p> elements that are mistakenly nested within <h> tags?

Troubleshooting a C Library issue causing problems in a Python project

Managing errors in jQuery's .ajax function

Transfer or duplicate an SVG image from the bottom div to the top div

selecting a limited number of elements in a Django model

Achieving priority for style.php over style.css