Can HTML comments prevent Selenium from running?

Currently, I am attempting to scrape a website with the following layout:

<div>
<!---->
<!---->
<!---->
   <div class="block">
      <h3 class="subtitle is-4"></h3>
      ...
      ...
   </div>
<!---->
</div>

However, I am facing challenges in extracting the content within the <!----> block. Does Selenium encounter issues with this specific HTML structure?

You can find the website here: Link

Below is my Python script:

def scrape():
        driver = _get_driver()
        driver.get(self.URL)
        driver.maximize_window()
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located(
                (By.CSS_SELECTOR, "div.submission-viewer-card-content div")
            )
        )
        fetch(driver=driver)

def fetch(driver: webdriver):
        css_exp = "div.submission-viewer-card-content div"
        rows = driver.find_elements(By.CSS_SELECTOR, css_exp)  # type: ignore
        print("rows", rows)
        for row in rows:
            driver.execute_script(  
                "arguments[0].scrollIntoView(false);", row
            )
            print(
                "- text: \n",
                row.text,
                "\n- tag_name: \n",
                row.tag_name,
                "\n- inner HTML: \n",
                row.get_attribute("innerHTML"),
                "\n- element: \n",
                row,
            )
            print("Script execution: ", driver.execute_script("return arguments[0].innerHTML;", row))

Here is the resulting output:

rows [<selenium.webdriver.remote.webelement.WebElement (session="bcee6497731314cd7030e6f9b9b2f823", element="2E7B7723420E954F7D13C37F2AF0EF31_element_45")>, <selenium.webdriver.remote.webelement.WebElement (session="bcee6497731314cd7030e6f9b9b2f823", element="2E7B7723420E954F7D13C37F2AF0EF31_element_46")>]
- text: 
 Nombre: Planillas
Descripción: Listado del personal de la entidad con su respectivo salario, código de cargo y otros.
Periodo: 4/2023
Entregado en:
- tag_name:
 div
- inner HTML:
 <p><strong>Nombre:</strong> Planillas</p><p><strong>Descripción:</strong> Listado del personal de la entidad con su respectivo salario, código de cargo y otros.</p><p><strong>Periodo:</strong> 4/2023</p><p><strong>Entregado en:</strong> </p>
- element:
 <selenium.webdriver.remote.webelement.WebElement (session="bcee6497731314cd7030e6f9b9b2f823", element="2E7B7723420E954F7D13C37F2AF0EF31_element_45")>
Script execution:  <p><strong>Nombre:</strong> Planillas</p><p><strong>Descripción:</strong> Listado del personal de la entidad con su respectivo salario, código de cargo y otros.</p><p><strong>Periodo:</strong> 4/2023</p><p><strong>Entregado en:</strong> </p>
- text: 
  
- tag_name:
 div
- inner HTML:
 <!----><!----><!----><!---->
- element:
 <selenium.webdriver.remote.webelement.WebElement (session="bcee6497731314cd7030e6f9b9b2f823", element="2E7B7723420E954F7D13C37F2AF0EF31_element_46")>
Script execution:  <!----><!----><!----><!---->

I have also attempted to download the HTML and utilize BeautifulSoup for scraping, but unfortunately, it did not yield successful results.

Answer №1

If you want to extract comments from a webpage, you can do so using an XPath expression like //comment()

comment_element = driver.find_element(By.XPATH, '(your_path_to_comment_in_dom)//comment()')
comment_content = comment_element.get_attribute('textContent')

Here is a helpful reference for more information.

Answer №2

The code has been modified to include the use of input() in order to control when data is fetched. This is necessary because

WebDriverWait(self.driver, 10).until(EC.presence_of_element_located
does not always work as expected. You will need to wait until the page is fully loaded and then press enter in the Python terminal:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

class Scraper:
    def __init__(self, url):
        self.URL = url
        self.driver = webdriver.Chrome()
    
    def scrape(self):
        self.driver.get(self.URL)
        self.driver.maximize_window()
        WebDriverWait(self.driver, 10).until(
            EC.presence_of_element_located(
                (By.CSS_SELECTOR, "div.submission-viewer-card-content div")
            )
        )
        input("Wait until page loads completely, then press enter...")
        self.fetch()

    def fetch(self):
        css_exp = "div.submission-viewer-card-content div"
        rows = self.driver.find_elements(By.CSS_SELECTOR, css_exp)
        for row in rows:
            self.driver.execute_script("arguments[0].scrollIntoView(false);", row)
            print("- text:\n", row.text,
                  "\n- tag_name:\n", row.tag_name,
                  "\n- inner HTML:\n", row.get_attribute("innerHTML"),
                  "\n- element:\n", row,
                  "\nScript execution:", self.driver.execute_script("return arguments[0].innerHTML;", row))

    def close(self):
        self.driver.quit()

if __name__ == "__main__":
    url = "https://monitoreo.antai.gob.pa/transparencia/97/6-2023/entregas/2320"
    scraper = Scraper(url)
    scraper.scrape()
    scraper.close()

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Updating a session or global variable cannot be read by an AJAX request

Currently using Flask with google-python-api-client for fetching data from Youtube. Experimenting with loading the next video asynchronously from the playlist upon button click. To achieve this, setting the nextPageToken parameter in the request is essent ...

Shifting the SQLite table in a setwise fashion similar to SQL operations

I have a substantial amount of data stored in a SQLite database, totaling 224,000 rows. I am looking to extract time series information from this data to utilize in a data visualization tool. Each row in the database represents an event and includes inform ...

having difficulty obtaining the 'g-recaptcha-response' for Recaptchav2 using Selenium

I've been facing some challenges while developing a web scraping tool as the data I need is hidden behind a reCaptcha. After researching online, it seems that every captcha contains a TextArea element named 'g-recaptcha-response' which gets ...

Is it possible to input a Python StringIO() object into a ZipFile() function, or is this functionality not available?

I have a StringIO() file-like object that I am attempting to write to a ZipFile(), but I keep encountering a TypeError: coercing to Unicode: need string or buffer, cStringIO.StringI found Below is an excerpt of the code snippet I'm working with: file ...

Is there a flaw in the Django migrations build_graph function?

Upon running my tests after squashing migrations, an error report is generated: lib/python2.7/site-packages/django/db/migrations/loader.py:220: KeyError The source code causing the issue is as follows: def build_graph(self): """ Builds a mig ...

Is it possible to achieve compatibility between Python and PHP in web development?

I am currently engaged in a legacy PHP project that relies heavily on PHP for its backend operations. However, I have a strong interest in scripting and developing fun and useful features using Python. My question is: Is there a way to incorporate Python ...

Creating a nested field serializer for both reading and writing in Django Rest Framework

I am attempting to create a "def create" method that can handle nested serialization for multiple objects. def create(self, validated_data): suggested_songs_data = validated_data.pop('suggested_songs') suggest_song_list = li ...

Developing a Monitoring-Frontend Application with backbone.js

I am in the process of developing a tool for monitoring and analyzing statistics. The current setup is as follows: Collector-Backend: This component receives queries in JSON format from the frontend, fetches and stores them in a cache, and then notifies ...

Setting up Spark with Jupyter Notebook and Anaconda for seamless integration

I've been struggling for a few days to get Spark working with my Jupyter Notebook and Anaconda setup. Here's how my .bash_profile is configured: PATH="/my/path/to/anaconda3/bin:$PATH" export JAVA_HOME="/my/path/to/jdk" export PYTHON_ ...

Learn how to add a "-" symbol to elements within a pandas column that have specific characters. Encountering a "ValueError"? We've got you covered

I have received a Dataframe from a csv file which includes 'TransactionAmounts' made by customers. The system responsible for exporting the data appends a 'CR' to certain rows, as shown here: TransacAmt Column My goal is to insert a ne ...

Is there a different way we can achieve the functionality of tf.data.Dataset.zip?

Using the tf.data.Dataset.zip function to merge two datasets involves pairing each index value of the first dataset with the corresponding index value of the second dataset. a = tf.data.Dataset.range(1, 4) # ==> [ 1, 2, 3 ] b = tf.data.Dataset.range(4, ...

Encountering Issues When Asserting Text with Selenium

Hello, I am attempting to verify text from a website and have included the image below. The text contains a BR tag as you can see. Despite the fact that both texts are identical, I encounter an error when trying to assert the content. Here is my code snipp ...

The conflicting dependencies between pypyodbc and blpapi are causing compatibility issues

In my conda environment, I successfully installed the pypyodbc package. However, when attempting to install the blpapi package using the command: conda install -c dsm blpapi Solving environment: failed UnsatisfiableError: The following specifications we ...

What is the best way to locate a web element in Selenium using Python when it does not have an ID

I'm having trouble selecting an element on the minehut.com webpage that doesn't have an ID. Despite trying CSS Selectors, I haven't had any success. The element I want to select is: <button _ngcontent-c17 color="Primary" mat-raised-bu ...

My code is in dire need of some TLC - it's extremely delicate

My Python skills are still in the early stages and I've been tasked with creating a simple game similar to 'Higher or Lower' for an assignment. I'm currently working on preventing the user from crashing the game by entering invalid inp ...

Comparing C-style and Python-style approaches to string formatting in the Python logger application

As a beginner in Django, I've been exploring logging and noticed that the info() statements below seem quite similar: log = logging.getLogger(__name__) . . . log.info("This is a %s" % "test") # Python style log.info("This is a %s", "test") ...

Create bar chart labels and values simultaneously by utilizing the plt.bar() function

After finding inspiration from this answer ( ), I created a Matplotlib script that is capable of generating a bar chart with labels as shown below: import matplotlib.pyplot as plt counts = [10, 20, 30, 40, 50] labels = ['a', 'b', &ap ...

Issue: The object of type 'list' does not possess the 'lower' attribute

I've recently developed a script to calculate the cosine similarity between two columns in separate csv files. Each column contains job descriptions in line format. from sklearn.feature_extraction.text import CountVectorizer from sklearn.metrics.pair ...

Value of list element based on a condition

I am looking to create a dictionary that compares two values to determine if they are equal. I am encountering a syntax error when I try to do this (in the 'compareDict' value). Is there a more efficient approach to achieve this? def checkValueE ...

Unable to launch Anaconda 4.3.1 Navigator on macOS Sierra version 10.12.4

I'm having trouble opening Anaconda 4.3.1 on my macOS Sierra 10.12.4 system. Every time I try to launch Anaconda Navigator, it crashes immediately. If anyone has a solution for this issue, please help me out. Any tips or explanations in simple term ...