Can HTML comments prevent Selenium from running?

Question

Can HTML comments prevent Selenium from running?

Currently, I am attempting to scrape a website with the following layout:

<div>
<!---->
<!---->
<!---->
   <div class="block">
      <h3 class="subtitle is-4"></h3>
      ...
      ...
   </div>
<!---->
</div>

However, I am facing challenges in extracting the content within the  block. Does Selenium encounter issues with this specific HTML structure?

You can find the website here: Link

Below is my Python script:

def scrape():
        driver = _get_driver()
        driver.get(self.URL)
        driver.maximize_window()
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located(
                (By.CSS_SELECTOR, "div.submission-viewer-card-content div")
            )
        )
        fetch(driver=driver)

def fetch(driver: webdriver):
        css_exp = "div.submission-viewer-card-content div"
        rows = driver.find_elements(By.CSS_SELECTOR, css_exp)  # type: ignore
        print("rows", rows)
        for row in rows:
            driver.execute_script(  
                "arguments[0].scrollIntoView(false);", row
            )
            print(
                "- text: \n",
                row.text,
                "\n- tag_name: \n",
                row.tag_name,
                "\n- inner HTML: \n",
                row.get_attribute("innerHTML"),
                "\n- element: \n",
                row,
            )
            print("Script execution: ", driver.execute_script("return arguments[0].innerHTML;", row))

Here is the resulting output:

rows [<selenium.webdriver.remote.webelement.WebElement (session="bcee6497731314cd7030e6f9b9b2f823", element="2E7B7723420E954F7D13C37F2AF0EF31_element_45")>, <selenium.webdriver.remote.webelement.WebElement (session="bcee6497731314cd7030e6f9b9b2f823", element="2E7B7723420E954F7D13C37F2AF0EF31_element_46")>]
- text: 
 Nombre: Planillas
Descripción: Listado del personal de la entidad con su respectivo salario, código de cargo y otros.
Periodo: 4/2023
Entregado en:
- tag_name:
 div
- inner HTML:
 <p><strong>Nombre:</strong> Planillas</p><p><strong>Descripción:</strong> Listado del personal de la entidad con su respectivo salario, código de cargo y otros.</p><p><strong>Periodo:</strong> 4/2023</p><p><strong>Entregado en:</strong> </p>
- element:
 <selenium.webdriver.remote.webelement.WebElement (session="bcee6497731314cd7030e6f9b9b2f823", element="2E7B7723420E954F7D13C37F2AF0EF31_element_45")>
Script execution:  <p><strong>Nombre:</strong> Planillas</p><p><strong>Descripción:</strong> Listado del personal de la entidad con su respectivo salario, código de cargo y otros.</p><p><strong>Periodo:</strong> 4/2023</p><p><strong>Entregado en:</strong> </p>
- text: 
  
- tag_name:
 div
- inner HTML:
 <!----><!----><!----><!---->
- element:
 <selenium.webdriver.remote.webelement.WebElement (session="bcee6497731314cd7030e6f9b9b2f823", element="2E7B7723420E954F7D13C37F2AF0EF31_element_46")>
Script execution:  <!----><!----><!----><!---->

I have also attempted to download the HTML and utilize BeautifulSoup for scraping, but unfortunately, it did not yield successful results.

python selenium-webdriver web-scraping selenium-chromedriver webdriver

Answer 1

Answer №1

If you want to extract comments from a webpage, you can do so using an XPath expression like //comment()

comment_element = driver.find_element(By.XPATH, '(your_path_to_comment_in_dom)//comment()')
comment_content = comment_element.get_attribute('textContent')

Here is a helpful reference for more information.

Answer 2

If you want to extract comments from a webpage, you can do so using an XPath expression like //comment()

comment_element = driver.find_element(By.XPATH, '(your_path_to_comment_in_dom)//comment()')
comment_content = comment_element.get_attribute('textContent')

Here is a helpful reference for more information.

Answer 3

Answer №2

The code has been modified to include the use of input() in order to control when data is fetched. This is necessary because

WebDriverWait(self.driver, 10).until(EC.presence_of_element_located

does not always work as expected. You will need to wait until the page is fully loaded and then press enter in the Python terminal:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

class Scraper:
    def __init__(self, url):
        self.URL = url
        self.driver = webdriver.Chrome()
    
    def scrape(self):
        self.driver.get(self.URL)
        self.driver.maximize_window()
        WebDriverWait(self.driver, 10).until(
            EC.presence_of_element_located(
                (By.CSS_SELECTOR, "div.submission-viewer-card-content div")
            )
        )
        input("Wait until page loads completely, then press enter...")
        self.fetch()

    def fetch(self):
        css_exp = "div.submission-viewer-card-content div"
        rows = self.driver.find_elements(By.CSS_SELECTOR, css_exp)
        for row in rows:
            self.driver.execute_script("arguments[0].scrollIntoView(false);", row)
            print("- text:\n", row.text,
                  "\n- tag_name:\n", row.tag_name,
                  "\n- inner HTML:\n", row.get_attribute("innerHTML"),
                  "\n- element:\n", row,
                  "\nScript execution:", self.driver.execute_script("return arguments[0].innerHTML;", row))

    def close(self):
        self.driver.quit()

if __name__ == "__main__":
    url = "https://monitoreo.antai.gob.pa/transparencia/97/6-2023/entregas/2320"
    scraper = Scraper(url)
    scraper.scrape()
    scraper.close()

Answer 4

The code has been modified to include the use of input() in order to control when data is fetched. This is necessary because

WebDriverWait(self.driver, 10).until(EC.presence_of_element_located

does not always work as expected. You will need to wait until the page is fully loaded and then press enter in the Python terminal:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

class Scraper:
    def __init__(self, url):
        self.URL = url
        self.driver = webdriver.Chrome()
    
    def scrape(self):
        self.driver.get(self.URL)
        self.driver.maximize_window()
        WebDriverWait(self.driver, 10).until(
            EC.presence_of_element_located(
                (By.CSS_SELECTOR, "div.submission-viewer-card-content div")
            )
        )
        input("Wait until page loads completely, then press enter...")
        self.fetch()

    def fetch(self):
        css_exp = "div.submission-viewer-card-content div"
        rows = self.driver.find_elements(By.CSS_SELECTOR, css_exp)
        for row in rows:
            self.driver.execute_script("arguments[0].scrollIntoView(false);", row)
            print("- text:\n", row.text,
                  "\n- tag_name:\n", row.tag_name,
                  "\n- inner HTML:\n", row.get_attribute("innerHTML"),
                  "\n- element:\n", row,
                  "\nScript execution:", self.driver.execute_script("return arguments[0].innerHTML;", row))

    def close(self):
        self.driver.quit()

if __name__ == "__main__":
    url = "https://monitoreo.antai.gob.pa/transparencia/97/6-2023/entregas/2320"
    scraper = Scraper(url)
    scraper.scrape()
    scraper.close()

Can HTML comments prevent Selenium from running?

Answer №1

Answer №2

Similar questions

Updating a session or global variable cannot be read by an AJAX request

Shifting the SQLite table in a setwise fashion similar to SQL operations

having difficulty obtaining the 'g-recaptcha-response' for Recaptchav2 using Selenium

Is it possible to input a Python StringIO() object into a ZipFile() function, or is this functionality not available?

Is there a flaw in the Django migrations build_graph function?

Is it possible to achieve compatibility between Python and PHP in web development?

Creating a nested field serializer for both reading and writing in Django Rest Framework

Developing a Monitoring-Frontend Application with backbone.js

Setting up Spark with Jupyter Notebook and Anaconda for seamless integration

Learn how to add a "-" symbol to elements within a pandas column that have specific characters. Encountering a "ValueError"? We've got you covered

Is there a different way we can achieve the functionality of tf.data.Dataset.zip?

Encountering Issues When Asserting Text with Selenium

The conflicting dependencies between pypyodbc and blpapi are causing compatibility issues

What is the best way to locate a web element in Selenium using Python when it does not have an ID

My code is in dire need of some TLC - it's extremely delicate

Comparing C-style and Python-style approaches to string formatting in the Python logger application

Create bar chart labels and values simultaneously by utilizing the plt.bar() function

Issue: The object of type 'list' does not possess the 'lower' attribute

Value of list element based on a condition

Unable to launch Anaconda 4.3.1 Navigator on macOS Sierra version 10.12.4