Troubleshooting issues with loading scripts in a scrapy (python) web scraper for a react/typescript application

Question

Troubleshooting issues with loading scripts in a scrapy (python) web scraper for a react/typescript application

I've been facing challenges with web scraping a specific webpage (beachvolleyball.nrw). In the past couple of days, I've experimented with various libraries but have struggled to get the script-tags to load properly.

Although using the developer tools allows me to retrieve data when selecting a tournament, I haven't had success replicating this process with selenium and other tools.

The elements I'm trying to scrape can be viewed here: https://i.stack.imgur.com/dXAu1.png

You can find the elements in the DOM here: https://i.stack.imgur.com/BExii.png

I've attempted multiple strategies without much luck. It might be more helpful to take a look at the DOM directly when the page is loading and assist me in retrieving the data using Splash 3.5 or any other preferred solution :)

Thank you for your assistance! In the meantime, I'll continue my efforts to resolve this issue.

TLDR: I'm having trouble loading scripts from this website using Splash or alternative methods. Navigating within the DOM is not the problem!

python reactjs selenium scrapy splash-screen

Answer 1

Answer №1

When dealing with pages generated by JavaScript, using Selenium to wait for the table to load before extracting values from it is crucial.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome(executable_path="/path/to/chromedriver")
driver.get("https://www.beachvolleyball.nrw/")
WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,".table-tournaments.table.table-hover")))
rows=driver.find_elements_by_css_selector(".table-tournaments.table.table-hover>tbody>tr")
for row in range(len(rows)):
    if len(rows[row].find_elements_by_xpath("./th"))>0:
        print("Row number: " + str(row))
        for th in rows[row].find_elements_by_xpath("./th"):
            print(th.text)
        print("====================================")
    if len(rows[row].find_elements_by_xpath("./td")) > 0:
        print("Row number: " + str(row))
        for td in rows[row].find_elements_by_xpath("./td"):
            print(td.text)
        print("====================================")

Output in Console:

Row number: 0
OCTOBER 2020
====================================
Row number: 1
03.10. Sat.

C
Hürth
30/32

====================================
Row number: 2
03.10. Sat.

C
Münster
2/12

====================================
Row number: 3
03.10. Sat.

S
Brühl Seniors
3/8

====================================
Row number: 4
03.10. Sat.

S
Brühl Seniors
6/8

====================================
Row number: 5
04.10. Sun.

C
Hürth
11/16

...

====================================
Row number: 13
31.12. Thu.

B
Beachliga Castrop-Rauxel
29/35

====================================

Answer 2

When dealing with pages generated by JavaScript, using Selenium to wait for the table to load before extracting values from it is crucial.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome(executable_path="/path/to/chromedriver")
driver.get("https://www.beachvolleyball.nrw/")
WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,".table-tournaments.table.table-hover")))
rows=driver.find_elements_by_css_selector(".table-tournaments.table.table-hover>tbody>tr")
for row in range(len(rows)):
    if len(rows[row].find_elements_by_xpath("./th"))>0:
        print("Row number: " + str(row))
        for th in rows[row].find_elements_by_xpath("./th"):
            print(th.text)
        print("====================================")
    if len(rows[row].find_elements_by_xpath("./td")) > 0:
        print("Row number: " + str(row))
        for td in rows[row].find_elements_by_xpath("./td"):
            print(td.text)
        print("====================================")

Output in Console:

Row number: 0
OCTOBER 2020
====================================
Row number: 1
03.10. Sat.

C
Hürth
30/32

====================================
Row number: 2
03.10. Sat.

C
Münster
2/12

====================================
Row number: 3
03.10. Sat.

S
Brühl Seniors
3/8

====================================
Row number: 4
03.10. Sat.

S
Brühl Seniors
6/8

====================================
Row number: 5
04.10. Sun.

C
Hürth
11/16

...

====================================
Row number: 13
31.12. Thu.

B
Beachliga Castrop-Rauxel
29/35

====================================

Answer 3

Answer №2

If you are trying to extract data from a table that is loaded via websockets, you can utilize the network tab in your developer tools to investigate. This feature is typically accessible in your web browser by pressing F12 or [CTRL] + [SHIFT] + 'C'. Upon navigating to the "Network" tab, you will be able to observe the websockets and the communication between the server and yourbrowser. One efficient method to scrape this data is through either using selenium or establishing a connection to the WebSocket with a library such as websocket-client.

Answer 4

If you are trying to extract data from a table that is loaded via websockets, you can utilize the network tab in your developer tools to investigate. This feature is typically accessible in your web browser by pressing F12 or [CTRL] + [SHIFT] + 'C'. Upon navigating to the "Network" tab, you will be able to observe the websockets and the communication between the server and yourbrowser. One efficient method to scrape this data is through either using selenium or establishing a connection to the WebSocket with a library such as websocket-client.

Troubleshooting issues with loading scripts in a scrapy (python) web scraper for a react/typescript application

Answer №1

Answer №2

Similar questions

Comparing the differences between a Tornado GET request query and query_arguments

Using dynamic loading in NextJS, how can one access the ReactQuill Ref?

Error encountered while attempting to save information to mongoDb collection: "Missing definition for Model"

How can I pass props from a page to components in Next.js using getServerSideProps?

Arranging content in React using material-ui tabs

Issues with executing a Selenium project in Java using Maven - Troubles with the Surefire plugin

Setting up WebPack for TypeScript with import functionality

Using Selenium to access and read a PDF document that opens in a new tab, without the need for

Trying to combine three columns in CSV and then updating the original CSV file

Data from graphql is not being received in Next.js

Encountering a deployment issue on Vercel while building with NextJS

Whenever I make changes to a record in the MERN stack, the object ends up getting deleted

Calculating the number of previous entries in a column and generating a new variable based on these counts

django: additional models featuring identical fields and administration interface

Is there a way to successfully run Python in a QwikLabs Linux VM using Windows PuTTY, as it currently does not work? Interestingly, it does work when using a Mac

After the build process, Nextjs Sitemap is eliminating the /en/ from all newly generated web links

Having trouble selecting or clicking a button in a modal that appears after a try:except block in Selenium

Exploring alternative methods for extracting ASINs from Amazon webpages using Python and Selenium

Encountering the error message 'XMLHttpRequest is not defined' while incorporating getServerSideProps() in NextJS

Having trouble executing the protractor configuration in gulp due to a connection issue with ECONNREFUSED with the address 127.0.0.1:4444