What's the best way to loop through a complete web table using Beautiful Soup?

Question

What's the best way to loop through a complete web table using Beautiful Soup?

I need assistance in scraping a web table using Selenium and BeautifulSoup. In the table, there are 10 instances of 'resultMainRow' and 4 instances of 'resultMainCell'. Each 4th resultMainCell contains 8 spans, each with an img src attribute. Below is an excerpt from the HTML representing one of the table rows, showing only relevant parts of the source code. How can I iterate through the entire web table along with extracting the img src attributes?

<div class="resultMainTable">
   <div class="resultMainRow">
      <div class="resultMainCell_1 tableResult2">
           <a href="javascript:genResultDetails(2);" 
           title="Best of the date">20/006 </a></div>
      <div class="resultMainCell_2 tableResult2">21/01/2020</div>
      <div class="resultMainCell_3 tableResult2"></div>
      <div class="resultMainCell_4 tableResult2">
          <span class="resultMainCellInner"> 
              <img height="25" src="/info/images/icon/no_3abc”> </span>
          <span class="resultMainCellInner"> 
              <img height="25" src = "/info/images/icon/no_14 " ></span>
          <span class="resultMainCellInner"> 
               <img height="25" src "/info/images/icon/no_21 " ></span>
          <span class="resultMainCellInner">
               <img height="25" src="/info/images/icon/no_28 " ></span>
          <span class="resultMainCellInner">
               <img height="25" src=" /info/images/icon/no_37 "></span>
          <span class="resultMainCellInner">   
               <img height="25" src="/info/images/icon/no_44 "></span>
          <span class="resultMainCellInner">             
               <img height="6" src="/info/images/icon_happy " ></span>
          <span class="resultMainCellInner" 
               <img height="25" src="/info/images/icon/smile "></span>
    </div>
       </div>

The table consists of 10 'resultMainRow' elements and 4 'resultMainCell' elements. Within the 4th resultMainCell, there are 8 span classes, each containing an img src attribute.

My current code snippet for this task is:

soup = BeautifulSoup(driver.page_source, 'lxml')
         sixsix = soup.findAll("div", {"class": "resultMainTable"})
         print (sixsix)

        for row in sixsix:
            images = soup.findAll('img')
            for image in images:
                if len(images) == 8:
                aaa = images[1].find('src')
                bbb = images[2].find('src')
                ccc = images[3].find('src')
                ddd = images[4].find('src')
                eee = images[5].find('src')
                fff = images[6].find('src')
                ggg = images[7].find('src')
                hhh = images[8].find('src')
                print ((row.text), (image('src')))

python selenium web-scraping beautifulsoup html-table

Answer 1

Answer №1

Here is a script that can help you go through each row of a table, extract text from the first three cells, and collect URLs from the src attributes of 8 images:

from bs4 import BeautifulSoup

html_code = '''
<div class="resultMainTable">
    <div class="resultMainRow">
       <div class="resultMainCell">text1</div>
       <div class="resultMainCell">text2</div>
       <div class="resultMainCell">text3</div>
       <div class="resultMainCell">
            <div>
                 <div>
                      <span>
                           <img src="1" />
                           <img src="2" />
                           <img src="3" />
                           <img src="4" />
                           <img src="5" />
                           <img src="6" />
                           <img src="7" />
                           <img src="8" />
                      </span>
                 </div>
            </div>
       </div>
    </div>
    <div class="resultMainRow">
       <div class="resultMainCell">text3</div>
       <div class="resultMainCell">text4</div>
       <div class="resultMainCell">text5</div>
       <div class="resultMainCell">
            <div>
                 <div>
                      <span>
                           <img src="9" />
                           <img src="10" />
                           <img src="11" />
                           <img src="12" />
                           <img src="13" />
                           <img src="14" />
                           <img src="15" />
                           <img src="16" />
                      </span>
                 </div>
            </div>
       </div>
    </div>
</div>'''

soup = BeautifulSoup(html_code, 'html.parser')

for row in soup.select('div.resultMainTable .resultMainRow'):
    cell1, cell2, cell3, cell4 = row.select('div.resultMainCell')
    image_sources = [img['src'] for img in cell4.select('img')]
    print(cell1.text, cell2.text, cell3.text, *image_sources)

The output will be:

text1 text2 text3 1 2 3 4 5 6 7 8
text3 text4 text5 9 10 11 12 13 14 15 16

EDIT (Using actual HTML code from your updated question):

from bs4 import BeautifulSoup

html_code = '''<div class="resultMainTable">
   <div class="resultMainRow">
      <div class="resultMainCell_1 tableResult2">
           <a href="javascript:genResultDetails(2);" 
           title="Best of the date">20/006 </a></div>
      <div class="resultMainCell_2 tableResult2">21/01/2020</div>
      <div class="resultMainCell_3 tableResult2"></div>
      <div class="resultMainCell_4 tableResult2">
          <span class="resultMainCellInner"> 
              <img height="25" src="/info/images/icon/no_3abc"> </span>
          <span class="resultMainCellInner"> 
              <img height="25" src = "/info/images/icon/no_14 " ></span>
          <span class="resultMainCellInner"> 
               <img height="25" src "/info/images/icon/no_21 " ></span>
          <span class="resultMainCellInner">
               <img height="25" src="/info/images/icon/no_28 " ></span>
          <span class="resultMainCellInner">
               <img height="25" src=" /info/images/icon/no_37 "></span>
          <span class="resultMainCellInner">   
               <img height="25" src="/info/images/icon/no_44 "></span>
          <span class="resultMainCellInner">             
               <img height="6" src="/info/images/icon_happy " ></span>
          <span class="resultMainCellInner" 
               <img height="25" src="/info/images/icon/smile "></span>
    </div>
       </div>'''


soup = BeautifulSoup(html_code, 'html.parser')

for row in soup.select('div.resultMainTable .resultMainRow'):
    cell1, cell2, cell3, cell4 = row.select('div[class^="resultMainCell"]')
    image_sources = [img['src'] for img in cell4.select('img')]
    print(cell1.text, cell2.text, cell3.text, *image_sources)

The updated output would be:

20/006  21/01/2020  /info/images/icon/no_3abc /info/images/icon/no_14   /info/images/icon/no_28   /info/images/icon/no_37  /info/images/icon/no_44  /info/images/icon_happy

Answer 2

Here is a script that can help you go through each row of a table, extract text from the first three cells, and collect URLs from the src attributes of 8 images:

from bs4 import BeautifulSoup

html_code = '''
<div class="resultMainTable">
    <div class="resultMainRow">
       <div class="resultMainCell">text1</div>
       <div class="resultMainCell">text2</div>
       <div class="resultMainCell">text3</div>
       <div class="resultMainCell">
            <div>
                 <div>
                      <span>
                           <img src="1" />
                           <img src="2" />
                           <img src="3" />
                           <img src="4" />
                           <img src="5" />
                           <img src="6" />
                           <img src="7" />
                           <img src="8" />
                      </span>
                 </div>
            </div>
       </div>
    </div>
    <div class="resultMainRow">
       <div class="resultMainCell">text3</div>
       <div class="resultMainCell">text4</div>
       <div class="resultMainCell">text5</div>
       <div class="resultMainCell">
            <div>
                 <div>
                      <span>
                           <img src="9" />
                           <img src="10" />
                           <img src="11" />
                           <img src="12" />
                           <img src="13" />
                           <img src="14" />
                           <img src="15" />
                           <img src="16" />
                      </span>
                 </div>
            </div>
       </div>
    </div>
</div>'''

soup = BeautifulSoup(html_code, 'html.parser')

for row in soup.select('div.resultMainTable .resultMainRow'):
    cell1, cell2, cell3, cell4 = row.select('div.resultMainCell')
    image_sources = [img['src'] for img in cell4.select('img')]
    print(cell1.text, cell2.text, cell3.text, *image_sources)

The output will be:

text1 text2 text3 1 2 3 4 5 6 7 8
text3 text4 text5 9 10 11 12 13 14 15 16

EDIT (Using actual HTML code from your updated question):

from bs4 import BeautifulSoup

html_code = '''<div class="resultMainTable">
   <div class="resultMainRow">
      <div class="resultMainCell_1 tableResult2">
           <a href="javascript:genResultDetails(2);" 
           title="Best of the date">20/006 </a></div>
      <div class="resultMainCell_2 tableResult2">21/01/2020</div>
      <div class="resultMainCell_3 tableResult2"></div>
      <div class="resultMainCell_4 tableResult2">
          <span class="resultMainCellInner"> 
              <img height="25" src="/info/images/icon/no_3abc"> </span>
          <span class="resultMainCellInner"> 
              <img height="25" src = "/info/images/icon/no_14 " ></span>
          <span class="resultMainCellInner"> 
               <img height="25" src "/info/images/icon/no_21 " ></span>
          <span class="resultMainCellInner">
               <img height="25" src="/info/images/icon/no_28 " ></span>
          <span class="resultMainCellInner">
               <img height="25" src=" /info/images/icon/no_37 "></span>
          <span class="resultMainCellInner">   
               <img height="25" src="/info/images/icon/no_44 "></span>
          <span class="resultMainCellInner">             
               <img height="6" src="/info/images/icon_happy " ></span>
          <span class="resultMainCellInner" 
               <img height="25" src="/info/images/icon/smile "></span>
    </div>
       </div>'''


soup = BeautifulSoup(html_code, 'html.parser')

for row in soup.select('div.resultMainTable .resultMainRow'):
    cell1, cell2, cell3, cell4 = row.select('div[class^="resultMainCell"]')
    image_sources = [img['src'] for img in cell4.select('img')]
    print(cell1.text, cell2.text, cell3.text, *image_sources)

The updated output would be:

20/006  21/01/2020  /info/images/icon/no_3abc /info/images/icon/no_14   /info/images/icon/no_28   /info/images/icon/no_37  /info/images/icon/no_44  /info/images/icon_happy

What's the best way to loop through a complete web table using Beautiful Soup?

Answer №1

Similar questions

Discovering the optimal linear segment within a dataset

A Python function that utilizes a randomized approach to cycling through if statements

Retrieve search results from Bing using Python

Python is a powerful language that can be used to capture and analyze

Issue with lack of data refresh upon button click in tkinter with Python 3

Scrapy is adept at gathering visible content that may appear intermittently

Develop a query language using Python

Exploring the power of simplejson in Python by parsing JSON data

Converting a JSON dataset into various languages of the world

How can Selenium retrieve a distinct value?

Issues arise within the Docker container due to an error stating "unable to initiate a new session thread"

Having trouble extracting a specific field within a JSON on a webpage using VBA

List of tuples indicating the index of tuples that are incomplete

transferring files without a user interface

The Unicode feature in Python stands out for its versatility and robust

Python Script Running in Docker Encounters Errors with Missing Files, Failing to Execute

Guide on extracting the text from the <script> tag using python

Locating an element in a table using the index number with XPath

Tips for organizing a list and displaying it in Python

The ZabbixAPI class instance in Python cannot be pickled