What is the best way to extract data from a nested web table with rows within rows?

Question

What is the best way to extract data from a nested web table with rows within rows?

Currently, I am attempting to extract data from a table where each row contains multiple rows of information within a single column.

My goal is to scrape each individual row inside the main row and create a data frame with it. Additionally, I want to use the content enclosed in <strong> </strong> tags as a header for the entire data frame.

Is there a way to achieve this using Python? I have been experimenting with selenium and pandas read_html, but I seem to have hit a roadblock. Ultimately, my objective is to combine all this scraped information into a comprehensive data frame.

The structure of the HTML resembles the following:

<td>
    <strong>    Important Information1  </strong>
    <br>    Some information
    <br>    Some information
    <br>    Some information
    <br>    Some information
    <br>    Some information
    <br>    Some information
</td>
<td>
    <strong>    Important Information 2 </strong>
    <br>    Some information 2
    <br>    Some information 2
    <br>    Some information 2
    <br>    Some information 2
    <br>    Some information 2  
    <br>    Some information 2
    <br>    Some information 2  
    <br>    Some information 2
    <br>    Some information 2  
    <br>    Some information 2  
</td>
<td>
    <strong>    Important Information 3 </strong>
    <br>    Some information 3
    <br>    Some information 3
    <br>    Some information 3
    <br>    Some information 3  
</td>

Desired Output:

           Important Header Some Information Header
0   Important Information1         Some information
1   Important Information1         Some information
2   Important Information1         Some information
3   Important Information1         Some information
4   Important Information1         Some information
5   Important Information1         Some information
6    Important Information2      Some information 2
7    Important Information2      Some information 2
8    Important Information2      Some information 2
9    Important Information2      Some information 2
10   Important Information2      Some information 2
11   Important Information2      Some information 2
12   Important Information2      Some information 2
13   Important Information2      Some information 2
14   Important Information2      Some information 2
15   Important Information2      Some information 2
16   Important Information3      Some information 3
17   Important Information3      Some information 3
18   Important Information3      Some information 3
19   Important Information3      Some information 3

python selenium web-scraping beautifulsoup

Answer 1

Answer №1

If you're looking to transform the HTML content into a pandas DataFrame with three columns:

import pandas as pd
from bs4 import BeautifulSoup

html_doc = """
<td>
    <strong>    Important Information1  </strong>
    <br>    Some information
    <br>    Some information
    <br>    Some information
    <br>    Some information
    <br>    Some information
    <br>    Some information
</td>
<td>
    <strong>    Important Information 2 </strong>
    <br>    Some information 2
    <br>    Some information 2
    <br>    Some information 2
    <br>    Some information 2
    <br>    Some information 2  
    <br>    Some information 2
    <br>    Some information 2  
    <br>    Some information 2
    <br>    Some information 2  
    <br>    Some information 2  
</td>
<td>
    <strong>    Important Information 3 </strong>
    <br>    Some information 3
    <br>    Some information 3
    <br>    Some information 3
    <br>    Some information 3  
</td>
"""

soup = BeautifulSoup(html_doc, "html.parser")

columns = []
for td in soup.select("td"):
    col_name, *data = td.get_text(strip=True, separator="|").split("|")
    columns.append(pd.Series(data, name=col_name))

print(pd.concat(columns, axis=1))

This will display:

  Important Information1 Important Information 2 Important Information 3
0       Some information      Some information 2      Some information 3
1       Some information      Some information 2      Some information 3
2       Some information      Some information 2      Some information 3
3       Some information      Some information 2      Some information 3
4       Some information      Some information 2                     NaN
5       Some information      Some information 2                     NaN
6                    NaN      Some information 2                     NaN
7                    NaN      Some information 2                     NaN
8                    NaN      Some information 2                     NaN
9                    NaN      Some information 2                     NaN

Answer 2

If you're looking to transform the HTML content into a pandas DataFrame with three columns:

import pandas as pd
from bs4 import BeautifulSoup

html_doc = """
<td>
    <strong>    Important Information1  </strong>
    <br>    Some information
    <br>    Some information
    <br>    Some information
    <br>    Some information
    <br>    Some information
    <br>    Some information
</td>
<td>
    <strong>    Important Information 2 </strong>
    <br>    Some information 2
    <br>    Some information 2
    <br>    Some information 2
    <br>    Some information 2
    <br>    Some information 2  
    <br>    Some information 2
    <br>    Some information 2  
    <br>    Some information 2
    <br>    Some information 2  
    <br>    Some information 2  
</td>
<td>
    <strong>    Important Information 3 </strong>
    <br>    Some information 3
    <br>    Some information 3
    <br>    Some information 3
    <br>    Some information 3  
</td>
"""

soup = BeautifulSoup(html_doc, "html.parser")

columns = []
for td in soup.select("td"):
    col_name, *data = td.get_text(strip=True, separator="|").split("|")
    columns.append(pd.Series(data, name=col_name))

print(pd.concat(columns, axis=1))

This will display:

  Important Information1 Important Information 2 Important Information 3
0       Some information      Some information 2      Some information 3
1       Some information      Some information 2      Some information 3
2       Some information      Some information 2      Some information 3
3       Some information      Some information 2      Some information 3
4       Some information      Some information 2                     NaN
5       Some information      Some information 2                     NaN
6                    NaN      Some information 2                     NaN
7                    NaN      Some information 2                     NaN
8                    NaN      Some information 2                     NaN
9                    NaN      Some information 2                     NaN

Answer 3

Answer №2

To determine the most effective method for scraping elements, we need a specific example of your approach. If starting from scratch, I recommend retrieving an element and then accessing its children.

It is important to implement error handling for robustness. While many prefer using CSS selectors as identifiers, I personally advocate for xpaths.

An implementation could resemble the following:

elements_you_want = driver.find_elements_by_xpath('xpath to parent')
for child in element:
     # do something

The logic for selecting each parent element will vary based on the webpage being scraped.

This process is further explained in detail in this Stack Overflow post: Get all child elements

Answer 4

To determine the most effective method for scraping elements, we need a specific example of your approach. If starting from scratch, I recommend retrieving an element and then accessing its children.

It is important to implement error handling for robustness. While many prefer using CSS selectors as identifiers, I personally advocate for xpaths.

An implementation could resemble the following:

elements_you_want = driver.find_elements_by_xpath('xpath to parent')
for child in element:
     # do something

The logic for selecting each parent element will vary based on the webpage being scraped.

This process is further explained in detail in this Stack Overflow post: Get all child elements

Answer 5

Answer №3

Ensure the programming environment is imported correctly:

# >> Prepare: Importing the Programming Environment Package You Are Using
import os
from selenium import webdriver

# >> Setting up the chrome browser
chromedriver = "C:\Program Files\Python39\Scripts\chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)

Snippet of code for web scraping:

# - Programming: Web Scraping
element_list = driver.find_elements_by_tag_name('td')
_i_ = 0
Data = [[]]
for _item_ in element_list:
    _i_ += 1
    Title = _item_.find_element_by_xpath('//td['+str(_i_)+']/strong').text.strip()
    Data.append([_i_, Title])
    for _element_ in _item_.find_elements_by_xpath('//td['+str(_i_)+']/br'):
        Value = _element_.text.strip()
        Data[_i_ + 1].extend(Value) #or Try if the fill array data program not true: Data[_i_].extend(Value)
    
# - Display results:
print('- Data[1] = ', Data[0])
print('- Data[2] = ', Data[1])
print('- Data[3] = ', Data[2])

Update: Code to export data as CSV

import csv

def pad(data):
    max_n = max([len(x) for x in data.values()])
    for field in data:
        data[field] += [''] * (max_n - len(data[field]))
    return data

def merge_dicts(*dict_args):
    """
    Given any number of dictionaries, shallow copy and merge into a new dict,
    precedence goes to key-value pairs in latter dictionaries.
    """
    result = {}
    for dictionary in dict_args:
        result.update(dictionary)
    return result

Data_1 = Data[0]
Data_2 = Data[0]
Data_3 = Data[0]

sdata_1 = {"Data_1":Data_1, "Data_2":Data_2}
sdata_2 = { "Data_3":Data_3}
data = merge_dicts(sdata_1, sdata_2)
print(data)

import pandas as pd
df = pd.DataFrame(pad(data))
df.to_csv("output.csv", index=False)

print('>> Completed exporting to CSV')

Answer 6

Ensure the programming environment is imported correctly:

# >> Prepare: Importing the Programming Environment Package You Are Using
import os
from selenium import webdriver

# >> Setting up the chrome browser
chromedriver = "C:\Program Files\Python39\Scripts\chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)

Snippet of code for web scraping:

# - Programming: Web Scraping
element_list = driver.find_elements_by_tag_name('td')
_i_ = 0
Data = [[]]
for _item_ in element_list:
    _i_ += 1
    Title = _item_.find_element_by_xpath('//td['+str(_i_)+']/strong').text.strip()
    Data.append([_i_, Title])
    for _element_ in _item_.find_elements_by_xpath('//td['+str(_i_)+']/br'):
        Value = _element_.text.strip()
        Data[_i_ + 1].extend(Value) #or Try if the fill array data program not true: Data[_i_].extend(Value)
    
# - Display results:
print('- Data[1] = ', Data[0])
print('- Data[2] = ', Data[1])
print('- Data[3] = ', Data[2])

Update: Code to export data as CSV

import csv

def pad(data):
    max_n = max([len(x) for x in data.values()])
    for field in data:
        data[field] += [''] * (max_n - len(data[field]))
    return data

def merge_dicts(*dict_args):
    """
    Given any number of dictionaries, shallow copy and merge into a new dict,
    precedence goes to key-value pairs in latter dictionaries.
    """
    result = {}
    for dictionary in dict_args:
        result.update(dictionary)
    return result

Data_1 = Data[0]
Data_2 = Data[0]
Data_3 = Data[0]

sdata_1 = {"Data_1":Data_1, "Data_2":Data_2}
sdata_2 = { "Data_3":Data_3}
data = merge_dicts(sdata_1, sdata_2)
print(data)

import pandas as pd
df = pd.DataFrame(pad(data))
df.to_csv("output.csv", index=False)

print('>> Completed exporting to CSV')

What is the best way to extract data from a nested web table with rows within rows?

Answer №1

Answer №2

Answer №3

Similar questions

GeckoDriver and Selenium encounter Exec format error on MacOS resulting in OSError: [Errno 8]

Generate a large quantity of interconnected objects to themselves

What alternatives exist for replacing torch.norm with another PyTorch function?

Enter into a namespace

Gather information that is dynamic upon the selection of a "li" option using Python Selenium

The "Login" button with the ID "loginbutton" and name "login" cannot be clicked at the coordinates (600, 341). It seems there is an issue with clicking on the login button on Facebook

Best Practices for Iterating through a Python Stack

What is the best way to shift the bits of two numbers to the right in a numpy array?

Finding the li element within the ul container

What is the best way to transfer Selenium WebDriver instances between functions in Python?

Having trouble resolving the issue of connection timeout with Chrome while utilizing chrome version 78 alongside chrome driver version 78.0.3904.70? Here's a step-by-step guide on how to troubleshoot and fix this error

What is the best method to take a high-quality screenshot of a website?

Experiencing PInvokeStackImbalance issue while utilizing webdriver version 2.5.1

Decrease the index's level

What is the best way to create a function that can accept a variable number of arguments?

Utilizing conditional statements to count within a loop

Dealing with Errors in Selenium Using Python 2.7

What is causing my constant failures in Jenkins / Ant builds?

Calculating bigrams without the use of any additional modules

Transfer the folder from the DBFS location to the user's workspace directory within Azure Databricks