What's the best way to loop through a complete web table using Beautiful Soup?

I need assistance in scraping a web table using Selenium and BeautifulSoup. In the table, there are 10 instances of 'resultMainRow' and 4 instances of 'resultMainCell'. Each 4th resultMainCell contains 8 spans, each with an img src attribute. Below is an excerpt from the HTML representing one of the table rows, showing only relevant parts of the source code. How can I iterate through the entire web table along with extracting the img src attributes?

<div class="resultMainTable">
   <div class="resultMainRow">
      <div class="resultMainCell_1 tableResult2">
           <a href="javascript:genResultDetails(2);" 
           title="Best of the date">20/006 </a></div>
      <div class="resultMainCell_2 tableResult2">21/01/2020</div>
      <div class="resultMainCell_3 tableResult2"></div>
      <div class="resultMainCell_4 tableResult2">
          <span class="resultMainCellInner"> 
              <img height="25" src="/info/images/icon/no_3abc”> </span>
          <span class="resultMainCellInner"> 
              <img height="25" src = "/info/images/icon/no_14 " ></span>
          <span class="resultMainCellInner"> 
               <img height="25" src "/info/images/icon/no_21 " ></span>
          <span class="resultMainCellInner">
               <img height="25" src="/info/images/icon/no_28 " ></span>
          <span class="resultMainCellInner">
               <img height="25" src=" /info/images/icon/no_37 "></span>
          <span class="resultMainCellInner">   
               <img height="25" src="/info/images/icon/no_44 "></span>
          <span class="resultMainCellInner">             
               <img height="6" src="/info/images/icon_happy " ></span>
          <span class="resultMainCellInner" 
               <img height="25" src="/info/images/icon/smile "></span>
    </div>
       </div>

The table consists of 10 'resultMainRow' elements and 4 'resultMainCell' elements. Within the 4th resultMainCell, there are 8 span classes, each containing an img src attribute.

My current code snippet for this task is:

soup = BeautifulSoup(driver.page_source, 'lxml')
         sixsix = soup.findAll("div", {"class": "resultMainTable"})
         print (sixsix)

        for row in sixsix:
            images = soup.findAll('img')
            for image in images:
                if len(images) == 8:
                aaa = images[1].find('src')
                bbb = images[2].find('src')
                ccc = images[3].find('src')
                ddd = images[4].find('src')
                eee = images[5].find('src')
                fff = images[6].find('src')
                ggg = images[7].find('src')
                hhh = images[8].find('src')
                print ((row.text), (image('src')))

Answer №1

Here is a script that can help you go through each row of a table, extract text from the first three cells, and collect URLs from the src attributes of 8 images:

from bs4 import BeautifulSoup

html_code = '''
<div class="resultMainTable">
    <div class="resultMainRow">
       <div class="resultMainCell">text1</div>
       <div class="resultMainCell">text2</div>
       <div class="resultMainCell">text3</div>
       <div class="resultMainCell">
            <div>
                 <div>
                      <span>
                           <img src="1" />
                           <img src="2" />
                           <img src="3" />
                           <img src="4" />
                           <img src="5" />
                           <img src="6" />
                           <img src="7" />
                           <img src="8" />
                      </span>
                 </div>
            </div>
       </div>
    </div>
    <div class="resultMainRow">
       <div class="resultMainCell">text3</div>
       <div class="resultMainCell">text4</div>
       <div class="resultMainCell">text5</div>
       <div class="resultMainCell">
            <div>
                 <div>
                      <span>
                           <img src="9" />
                           <img src="10" />
                           <img src="11" />
                           <img src="12" />
                           <img src="13" />
                           <img src="14" />
                           <img src="15" />
                           <img src="16" />
                      </span>
                 </div>
            </div>
       </div>
    </div>
</div>'''

soup = BeautifulSoup(html_code, 'html.parser')

for row in soup.select('div.resultMainTable .resultMainRow'):
    cell1, cell2, cell3, cell4 = row.select('div.resultMainCell')
    image_sources = [img['src'] for img in cell4.select('img')]
    print(cell1.text, cell2.text, cell3.text, *image_sources)

The output will be:

text1 text2 text3 1 2 3 4 5 6 7 8
text3 text4 text5 9 10 11 12 13 14 15 16

EDIT (Using actual HTML code from your updated question):

from bs4 import BeautifulSoup

html_code = '''<div class="resultMainTable">
   <div class="resultMainRow">
      <div class="resultMainCell_1 tableResult2">
           <a href="javascript:genResultDetails(2);" 
           title="Best of the date">20/006 </a></div>
      <div class="resultMainCell_2 tableResult2">21/01/2020</div>
      <div class="resultMainCell_3 tableResult2"></div>
      <div class="resultMainCell_4 tableResult2">
          <span class="resultMainCellInner"> 
              <img height="25" src="/info/images/icon/no_3abc"> </span>
          <span class="resultMainCellInner"> 
              <img height="25" src = "/info/images/icon/no_14 " ></span>
          <span class="resultMainCellInner"> 
               <img height="25" src "/info/images/icon/no_21 " ></span>
          <span class="resultMainCellInner">
               <img height="25" src="/info/images/icon/no_28 " ></span>
          <span class="resultMainCellInner">
               <img height="25" src=" /info/images/icon/no_37 "></span>
          <span class="resultMainCellInner">   
               <img height="25" src="/info/images/icon/no_44 "></span>
          <span class="resultMainCellInner">             
               <img height="6" src="/info/images/icon_happy " ></span>
          <span class="resultMainCellInner" 
               <img height="25" src="/info/images/icon/smile "></span>
    </div>
       </div>'''


soup = BeautifulSoup(html_code, 'html.parser')

for row in soup.select('div.resultMainTable .resultMainRow'):
    cell1, cell2, cell3, cell4 = row.select('div[class^="resultMainCell"]')
    image_sources = [img['src'] for img in cell4.select('img')]
    print(cell1.text, cell2.text, cell3.text, *image_sources)

The updated output would be:

20/006  21/01/2020  /info/images/icon/no_3abc /info/images/icon/no_14   /info/images/icon/no_28   /info/images/icon/no_37  /info/images/icon/no_44  /info/images/icon_happy 

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Discovering the optimal linear segment within a dataset

I am currently analyzing some scientific data and looking for the optimal region to fit a straight line in. The data should ideally have a uniform gradient, but there are non-linear sections present due to external influences. My attempts so far have invo ...

A Python function that utilizes a randomized approach to cycling through if statements

Currently, I am building a rock-paper-scissors game and encountering an issue with the decisioncycle(). My aim is to prompt the user to input a choice in usercycle(), have the computer pick a random choice in gamecycle(), determine the winner of each round ...

Retrieve search results from Bing using Python

I am currently working on a project to develop a Python-based chatbot that can retrieve search results from Bing. However, my efforts have been hindered by the outdated Python 2 code and reliance on Google API in most available resources online. The catch ...

Python is a powerful language that can be used to capture and analyze

I am using a WebDriver through Selenium to automate opening a browser, directing it to an IP address, performing various tasks, and then closing it. My goal is to track all URLs accessed during this process. This includes any ads that are displayed, CSS c ...

Issue with lack of data refresh upon button click in tkinter with Python 3

How can I modify this code to make it functional? import os import time import random import sys import string import tkinter as tk root = tk.Tk() root.geometry("400x400") root.title("RandomPass v1.0") welcome = tk.Label(root, text="Welcome to RandomPas ...

Scrapy is adept at gathering visible content that may appear intermittently

Currently, I am extracting information from zappos.com, specifically targeting the section on the product details page that showcases what other customers who viewed the same item have also looked at. One of the item listings I am focusing on is found her ...

Develop a query language using Python

I'm seeking a method to make filtering capabilities accessible to fellow developers and potentially clients within my workplace. Challenge I aim to introduce a basic query language for my data (stored as python dicts) that can be used by other devel ...

Exploring the power of simplejson in Python by parsing JSON data

After sending a URL request with urllib2, I am struggling to read JSON data. Here is my code: request = urllib2.Request("https://127.0.0.1:443/myAPI", data=form_data, headers=headers) response = urllib2.open(request) The issue arises when attempting to p ...

Converting a JSON dataset into various languages of the world

I am in possession of a large JSON dataset containing English conversations and I am curious about potential tools or methods that could facilitate translating them into Arabic. Are there any suggestions? ...

How can Selenium retrieve a distinct value?

Currently, I am utilizing selenium to automate the completion of forms. One requirement is for it to produce a distinct value for my username field. What steps should I take to accomplish this task? Here is what I have so far: Command: type Target: id_ ...

Issues arise within the Docker container due to an error stating "unable to initiate a new session thread"

My web crawling solution is built using Python and Selenium, running in a Docker container on an m4.2xlarge EC2 instance with multiprocessing implemented using the Pool method. with Pool(processes=(config.no_of_cpus)) as pool: pool.map(func, items) pool. ...

Having trouble extracting a specific field within a JSON on a webpage using VBA

My current project involves extracting property data from this specified link, which returns a JSON response. I've utilized a combination of JSON and VBA converter tools for this task. However, upon executing the script provided below, I consistently ...

List of tuples indicating the index of tuples that are incomplete

Issue at hand: I currently possess a series of tuples (a,b,c) and I am seeking to locate the index of the first tuple commencing with specified 'a' and 'b'. For instance: list = [(1,2,3), (3,4,5), (5,6,7)] In this case, if a = 1 and ...

transferring files without a user interface

Looking for a solution to upload a file by clicking on an element: <a class="gray_box" href="#">Choose world folder<span>Select the world folder, we'll do the rest</span></a> The issue arises when the file ...

The Unicode feature in Python stands out for its versatility and robust

Can someone help me with the following Django code snippet? from django.db import models class Category(models.Model): name = models.CharField(max_length=200) def _unicode_(self): return self.name class Item(models.Model): category ...

Python Script Running in Docker Encounters Errors with Missing Files, Failing to Execute

For the past year, I've been successfully running a python script within a Docker environment without any major issues. However, after updating to Docker version 4.25.0 about a month ago, my python script stopped working properly. I managed to fix s ...

Guide on extracting the text from the <script> tag using python

I'm attempting to extract the script element content from a generic website using Selenium. <script>...</script> . url = 'https://unminify.com/' browser.get(url) elements = browser.find_elements_by_xpath('/html/body/script[ ...

Locating an element in a table using the index number with XPath

Seeking an element's location in Python Selenium by its index number. This is the xpath for the specific element: /html/body/div[4]/div[1]/div[3]/div/div[2]/div/div/div/div/div/div/div[1]/div[2]/div/div/div/div[2]/div/div/div/div[2]/div/div[3]/div[1] ...

Tips for organizing a list and displaying it in Python

I am trying to create a function called f3Groups() that takes in one argument. The function should include a list named cgList, which represents the entire class and consists of 3 lists representing different groups. It is important that cgList cannot be ...

The ZabbixAPI class instance in Python cannot be pickled

I encountered an issue while trying to pickle a ZabbixAPI object of the pyzabbix library using the code snippet below: from pyzabbix import ZabbixAPI from pickle import dumps api = ZabbixAPI('http://platform.autuitive.com/monitoring/') print d ...