What is the best way to extract data from a nested web table with rows within rows?

Currently, I am attempting to extract data from a table where each row contains multiple rows of information within a single column.

My goal is to scrape each individual row inside the main row and create a data frame with it. Additionally, I want to use the content enclosed in <strong> </strong> tags as a header for the entire data frame.

Is there a way to achieve this using Python? I have been experimenting with selenium and pandas read_html, but I seem to have hit a roadblock. Ultimately, my objective is to combine all this scraped information into a comprehensive data frame.

The structure of the HTML resembles the following:

<td>
    <strong>    Important Information1  </strong>
    <br>    Some information
    <br>    Some information
    <br>    Some information
    <br>    Some information
    <br>    Some information
    <br>    Some information
</td>
<td>
    <strong>    Important Information 2 </strong>
    <br>    Some information 2
    <br>    Some information 2
    <br>    Some information 2
    <br>    Some information 2
    <br>    Some information 2  
    <br>    Some information 2
    <br>    Some information 2  
    <br>    Some information 2
    <br>    Some information 2  
    <br>    Some information 2  
</td>
<td>
    <strong>    Important Information 3 </strong>
    <br>    Some information 3
    <br>    Some information 3
    <br>    Some information 3
    <br>    Some information 3  
</td>

Desired Output:

           Important Header Some Information Header
0   Important Information1         Some information
1   Important Information1         Some information
2   Important Information1         Some information
3   Important Information1         Some information
4   Important Information1         Some information
5   Important Information1         Some information
6    Important Information2      Some information 2
7    Important Information2      Some information 2
8    Important Information2      Some information 2
9    Important Information2      Some information 2
10   Important Information2      Some information 2
11   Important Information2      Some information 2
12   Important Information2      Some information 2
13   Important Information2      Some information 2
14   Important Information2      Some information 2
15   Important Information2      Some information 2
16   Important Information3      Some information 3
17   Important Information3      Some information 3
18   Important Information3      Some information 3
19   Important Information3      Some information 3

Answer №1

If you're looking to transform the HTML content into a pandas DataFrame with three columns:

import pandas as pd
from bs4 import BeautifulSoup

html_doc = """
<td>
    <strong>    Important Information1  </strong>
    <br>    Some information
    <br>    Some information
    <br>    Some information
    <br>    Some information
    <br>    Some information
    <br>    Some information
</td>
<td>
    <strong>    Important Information 2 </strong>
    <br>    Some information 2
    <br>    Some information 2
    <br>    Some information 2
    <br>    Some information 2
    <br>    Some information 2  
    <br>    Some information 2
    <br>    Some information 2  
    <br>    Some information 2
    <br>    Some information 2  
    <br>    Some information 2  
</td>
<td>
    <strong>    Important Information 3 </strong>
    <br>    Some information 3
    <br>    Some information 3
    <br>    Some information 3
    <br>    Some information 3  
</td>
"""

soup = BeautifulSoup(html_doc, "html.parser")

columns = []
for td in soup.select("td"):
    col_name, *data = td.get_text(strip=True, separator="|").split("|")
    columns.append(pd.Series(data, name=col_name))

print(pd.concat(columns, axis=1))

This will display:

  Important Information1 Important Information 2 Important Information 3
0       Some information      Some information 2      Some information 3
1       Some information      Some information 2      Some information 3
2       Some information      Some information 2      Some information 3
3       Some information      Some information 2      Some information 3
4       Some information      Some information 2                     NaN
5       Some information      Some information 2                     NaN
6                    NaN      Some information 2                     NaN
7                    NaN      Some information 2                     NaN
8                    NaN      Some information 2                     NaN
9                    NaN      Some information 2                     NaN

Answer №2

To determine the most effective method for scraping elements, we need a specific example of your approach. If starting from scratch, I recommend retrieving an element and then accessing its children.

It is important to implement error handling for robustness. While many prefer using CSS selectors as identifiers, I personally advocate for xpaths.

An implementation could resemble the following:

elements_you_want = driver.find_elements_by_xpath('xpath to parent')
for child in element:
     # do something

The logic for selecting each parent element will vary based on the webpage being scraped.

This process is further explained in detail in this Stack Overflow post: Get all child elements

Answer №3

  • Ensure the programming environment is imported correctly:

    # >> Prepare: Importing the Programming Environment Package You Are Using
    import os
    from selenium import webdriver
    
    # >> Setting up the chrome browser
    chromedriver = "C:\Program Files\Python39\Scripts\chromedriver"
    os.environ["webdriver.chrome.driver"] = chromedriver
    driver = webdriver.Chrome(chromedriver)

  • Snippet of code for web scraping:

    # - Programming: Web Scraping
    element_list = driver.find_elements_by_tag_name('td')
    _i_ = 0
    Data = [[]]
    for _item_ in element_list:
        _i_ += 1
        Title = _item_.find_element_by_xpath('//td['+str(_i_)+']/strong').text.strip()
        Data.append([_i_, Title])
        for _element_ in _item_.find_elements_by_xpath('//td['+str(_i_)+']/br'):
            Value = _element_.text.strip()
            Data[_i_ + 1].extend(Value) #or Try if the fill array data program not true: Data[_i_].extend(Value)
        
    # - Display results:
    print('- Data[1] = ', Data[0])
    print('- Data[2] = ', Data[1])
    print('- Data[3] = ', Data[2])

  • Update: Code to export data as CSV

import csv

def pad(data):
    max_n = max([len(x) for x in data.values()])
    for field in data:
        data[field] += [''] * (max_n - len(data[field]))
    return data

def merge_dicts(*dict_args):
    """
    Given any number of dictionaries, shallow copy and merge into a new dict,
    precedence goes to key-value pairs in latter dictionaries.
    """
    result = {}
    for dictionary in dict_args:
        result.update(dictionary)
    return result

Data_1 = Data[0]
Data_2 = Data[0]
Data_3 = Data[0]

sdata_1 = {"Data_1":Data_1, "Data_2":Data_2}
sdata_2 = { "Data_3":Data_3}
data = merge_dicts(sdata_1, sdata_2)
print(data)

import pandas as pd
df = pd.DataFrame(pad(data))
df.to_csv("output.csv", index=False)

print('>> Completed exporting to CSV')

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

GeckoDriver and Selenium encounter Exec format error on MacOS resulting in OSError: [Errno 8]

While developing a bot using Firefox Gecko Driver, I keep encountering error messages originating from the following lines of code: from selenium import webdriver browser= webdriver.Firefox() Despite adding all the necessary files to my path, including ...

Generate a large quantity of interconnected objects to themselves

Need help with creating objects in a self-referential relationship model. class MyTable(Model): close = ManyToManyField("MyTable") How can I bulk create objects in this relation? For tables that are not related to themselves, one could use ...

What alternatives exist for replacing torch.norm with another PyTorch function?

I need to find an alternative to the torch.norm function in Pytorch. I was successful in replacing torch.norm when dealing with a single tensor, as shown below: import torch x = torch.randn(9) out1 = torch.norm(x) out2 = sum(abs(x)**2)**(1./2) out1 == out ...

Enter into a namespace

Recently, I added a new folder to my project in order to utilize methods from files in another project. The process involved adding just the first 3 lines: import sys import os sys.path.insert(0, '/path/to/another/dir') from file_methods import ...

Gather information that is dynamic upon the selection of a "li" option using Python Selenium

I need to extract data from this website (disregard the Hebrew text). To begin, I must choose one of the options from the initial dropdown menu below: https://i.stack.imgur.com/qvIyN.png Next, click the designated button: https://i.stack.imgur.com/THb ...

The "Login" button with the ID "loginbutton" and name "login" cannot be clicked at the coordinates (600, 341). It seems there is an issue with clicking on the login button on Facebook

from selenium import webdriver from selenium.webdriver.common.keys import Keys import time PATH = "/Users/khizarm/Downloads/chromedriver" chrome_options = webdriver.ChromeOptions() chrome_options.add_argument("--incognito") drive ...

Best Practices for Iterating through a Python Stack

Can you explain the recommended way to iterate over a stack in Python? Is it advisable to use a for loop similar to how you would iterate over a list? ...

What is the best way to shift the bits of two numbers to the right in a numpy array?

Currently, I am in the process of developing a script to transfer BMP images to Delta Electronics HMI, which is an industrial automation touch-panel. The challenge lies in the fact that HMI has a unique pixel format that resembles 16-bit RGB555, but with s ...

Finding the li element within the ul container

Is there a way to determine the count of li elements that have been created within a ul element with a specific id? Below is an example of the structure: <ul id="someid"> <li class="someclass"></li> <li class="someclass"></l ...

What is the best way to transfer Selenium WebDriver instances between functions in Python?

Currently, I am working with a function that is responsible for logging into a specific website. Once logged in, I return the webdriver instance to be used by another function in order to retrieve important information. However, I have encountered an issu ...

Having trouble resolving the issue of connection timeout with Chrome while utilizing chrome version 78 alongside chrome driver version 78.0.3904.70? Here's a step-by-step guide on how to troubleshoot and fix this error

After updating my Chrome browser to version 78, I encountered an error when trying to run any automation code. To resolve the issue of timeout while connecting to ChromeDriver and related test frameworks, ensure that ports are protected from access by mal ...

What is the best method to take a high-quality screenshot of a website?

Looking to capture high-resolution screenshots of websites for text recognition or saving high-quality images? I experimented with Python 2.7 and tried the following code using the website as an example. from selenium import webdriver import time driver ...

Experiencing PInvokeStackImbalance issue while utilizing webdriver version 2.5.1

While debugging in VS2010 with Selenium WebDriver 2.5.1 DLLs, I encountered a PInvokeStackImbalance error. Interestingly, switching to the older version (2.4) of the DLLs eliminates the issue completely. I'm puzzled - am I overlooking something? Er ...

Decrease the index's level

After running a pivot table, I have the result below indicating the customer grades that visited my stores. Using the 'droplevel' method, I managed to flatten the column header into one layer. Now, I am looking to do the same for the index - remo ...

What is the best way to create a function that can accept a variable number of arguments?

I am looking to create a function that is able to, for instance, add up all given parameters: def sum(#elements): return(a+...#all elements) print(sum(1,3,4)) ...

Utilizing conditional statements to count within a loop

Currently, I'm working with the data repository found at: https://github.com/fivethirtyeight/data/blob/master/avengers/avengers.csv As part of an exercise from DataQuest, my task involves calculating the accuracy of the 'Years since joining&apo ...

Dealing with Errors in Selenium Using Python 2.7

After transitioning from Python 3.5 to Python 2.7 due to py2exe compatibility issues, I encountered an error in my script. Can someone help me resolve this problem? Any assistance would be greatly appreciated. from selenium import webdriver import time ...

What is causing my constant failures in Jenkins / Ant builds?

Trying to understand Jenkins has been a challenge for me. Despite the success of building and running my suite of selenium tests in Eclipse and through the command line with ant, I encounter failures when attempting to use Jenkins. Interestingly, the conso ...

Calculating bigrams without the use of any additional modules

Recently, I encountered a challenging question on Bigrams at the Grok Academy. This particular question required me to identify and count the bigrams and their frequencies for various lines of input until no more input was provided. For example: Line: The ...

Transfer the folder from the DBFS location to the user's workspace directory within Azure Databricks

I am in need of transferring a group of files (Python or Scala) from a DBFS location to my user workspace directory for testing purposes. Uploading each file individually to the user workspace directory is quite cumbersome. Is there a way to easily move f ...