Troubleshooting issues with loading scripts in a scrapy (python) web scraper for a react/typescript application

I've been facing challenges with web scraping a specific webpage (beachvolleyball.nrw). In the past couple of days, I've experimented with various libraries but have struggled to get the script-tags to load properly.

Although using the developer tools allows me to retrieve data when selecting a tournament, I haven't had success replicating this process with selenium and other tools.

The elements I'm trying to scrape can be viewed here: https://i.stack.imgur.com/dXAu1.png

You can find the elements in the DOM here: https://i.stack.imgur.com/BExii.png

I've attempted multiple strategies without much luck. It might be more helpful to take a look at the DOM directly when the page is loading and assist me in retrieving the data using Splash 3.5 or any other preferred solution :)

Thank you for your assistance! In the meantime, I'll continue my efforts to resolve this issue.

TLDR: I'm having trouble loading scripts from this website using Splash or alternative methods. Navigating within the DOM is not the problem!

Answer №1

When dealing with pages generated by JavaScript, using Selenium to wait for the table to load before extracting values from it is crucial.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome(executable_path="/path/to/chromedriver")
driver.get("https://www.beachvolleyball.nrw/")
WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,".table-tournaments.table.table-hover")))
rows=driver.find_elements_by_css_selector(".table-tournaments.table.table-hover>tbody>tr")
for row in range(len(rows)):
    if len(rows[row].find_elements_by_xpath("./th"))>0:
        print("Row number: " + str(row))
        for th in rows[row].find_elements_by_xpath("./th"):
            print(th.text)
        print("====================================")
    if len(rows[row].find_elements_by_xpath("./td")) > 0:
        print("Row number: " + str(row))
        for td in rows[row].find_elements_by_xpath("./td"):
            print(td.text)
        print("====================================") 

Output in Console:

Row number: 0
OCTOBER 2020
====================================
Row number: 1
03.10. Sat.

C
Hürth
30/32

====================================
Row number: 2
03.10. Sat.

C
Münster
2/12

====================================
Row number: 3
03.10. Sat.

S
Brühl Seniors
3/8

====================================
Row number: 4
03.10. Sat.

S
Brühl Seniors
6/8

====================================
Row number: 5
04.10. Sun.

C
Hürth
11/16

...

====================================
Row number: 13
31.12. Thu.

B
Beachliga Castrop-Rauxel
29/35

====================================

Answer №2

If you are trying to extract data from a table that is loaded via websockets, you can utilize the network tab in your developer tools to investigate. This feature is typically accessible in your web browser by pressing F12 or [CTRL] + [SHIFT] + 'C'. Upon navigating to the "Network" tab, you will be able to observe the websockets and the communication between the server and yourbrowser. One efficient method to scrape this data is through either using selenium or establishing a connection to the WebSocket with a library such as websocket-client.

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Comparing the differences between a Tornado GET request query and query_arguments

I have noticed a discrepancy in the format between my tornado GET request handler query and query_arguments methods. request.query = "InstID=IRSwap/wN1G7RLwkUFP+LdocRpxPW&EndDate=10Y' request.query_arguments = {'InstID': ['IRSwap/ ...

Using dynamic loading in NextJS, how can one access the ReactQuill Ref?

Currently facing an interesting issue. I am making use of NextJS for its server-side rendering abilities and incorporating ReactQuill as a rich-text editor. To work around ReactQuill's connection to the DOM, I am dynamically importing it. However, thi ...

Error encountered while attempting to save information to mongoDb collection: "Missing definition for Model"

Every time I attempt to save a model to a mongoDb collection, I encounter the error message "ReferenceError: Model is not defined". In the model file, the code appears as follows: const mongoose = require('mongoose'); const Schema = mongoose.Sc ...

How can I pass props from a page to components in Next.js using getServerSideProps?

Struggling to fetch the coingecko-api for accessing live bitcoin prices. Trying to pass return props of getServerSideProps to my <CalculatorBuy /> component within the <Main /> component. Facing issues when importing async function in calcula ...

Arranging content in React using material-ui tabs

Recently, I set up the material-ui react tabs in this manner: <Tabs onChange={(value) => props.changeTabListener(value)} value={props.currentTab} style={styles.tabs}> <Tab label="Tab 1" value={props.candidatesTab}> & ...

Issues with executing a Selenium project in Java using Maven - Troubles with the Surefire plugin

I encountered the following issue: While running my Selenium-Maven-Java project, I faced a Surefire plugin error. When building my project on Circle CI, I received this build error. I would appreciate any suggestions on how to rectify this. This is ...

Setting up WebPack for TypeScript with import functionality

A tutorial on webpack configuration for typescript typically demonstrates the following: const path = require('path'); module.exports = { ... } Is it more advantageous to utilize ES modules and configure it with import statements instead? Or is ...

Using Selenium to access and read a PDF document that opens in a new tab, without the need for

My team is facing a challenge in our application where clicking on a link opens a new tab with a dynamically generated PDF. The PDF that is generated opens in a new tab and has the URL as "about:blank". I am unable to verify the content of the PDF using ...

Trying to combine three columns in CSV and then updating the original CSV file

Here is some sample data: name1|name2|name3|name4|combined test|data|here|and test|information|343|AND ",3|record|343|and My coding solution: import csv import StringIO storedoutput = StringIO.StringIO() fields = ('name1', 'name2', ...

Data from graphql is not being received in Next.js

I decided to replicate reddit using Next.js and incorporating stepzen for graphql integration. I have successfully directed it to a specific page based on the slug, but unfortunately, I am facing an issue with retrieving the post information. import { use ...

Encountering a deployment issue on Vercel while building with NextJS

I'm facing issues while attempting to deploy my Nextjs app on Vercel: Error occurred prerendering page "/". Read more: https://nextjs.org/docs/messages/prerender-error TypeError: (0 , react_development_.useState) is not a function or its ret ...

Whenever I make changes to a record in the MERN stack, the object ends up getting deleted

Every time I attempt to update a record, instead of updating the objects, they get deleted. Even though my routes seem correct and it's functioning fine in Postman. Routes: router.route('/update/:id').post(function (req, res) { Bet.findByI ...

Calculating the number of previous entries in a column and generating a new variable based on these counts

I am working with a data frame and my goal is to count the consecutive occurrences of values in one column within each group of another column. Here's an example: ID Class 1 A 1 A 2 A 1 B 1 ...

django: additional models featuring identical fields and administration interface

I currently have a card model that includes different workout models, with each workout containing the same set of data such as exercise name and repetitions. There are 7 workouts in total, one for each day of the week. As a result, I've duplicated th ...

Is there a way to successfully run Python in a QwikLabs Linux VM using Windows PuTTY, as it currently does not work? Interestingly, it does work when using a Mac

Recently, I started a course that requires accessing a Qwiklabs Linux virtual machine via SSH to write and run Python scripts. Surprisingly, when using PuTTY on Windows 10, everything works smoothly until it's time to execute the script - then I encou ...

After the build process, Nextjs Sitemap is eliminating the /en/ from all newly generated web links

Utilizing Strapi to pull data in JSON format. For instance, a typical website link appears as follows: https:/ /www.some-site.com/some-link What happens to the links once the post build is completed on my Nextjs project: <url><loc>https://web ...

Having trouble selecting or clicking a button in a modal that appears after a try:except block in Selenium

I am encountering a problem where I cannot click on a button within the modal. The modal only appears when my selenium script is running, not when I navigate the webpage manually. Interestingly, the modal consistently appears when my script runs in try/exc ...

Exploring alternative methods for extracting ASINs from Amazon webpages using Python and Selenium

So I'm trying to extract the ASIN from an Amazon webpage using the following code: asin = WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.ID, "averageCustomerReviews"))).get_attribute("data-asin") However, I&apos ...

Encountering the error message 'XMLHttpRequest is not defined' while incorporating getServerSideProps() in NextJS

I'm currently exploring NextJS with SSR and encountering an error when trying to fetch data from a Spotify playlist using the spotify-web-api-js library. This issue only occurs when executing on the server side: error - ReferenceError: XMLHttpRequest ...

Having trouble executing the protractor configuration in gulp due to a connection issue with ECONNREFUSED with the address 127.0.0.1:4444

My code works perfectly when I run selenium separately. However, my goal is to start webdriver within gulp but I am encountering the following error: Error code: 135 [12:48:27] E/launcher - Error message: ECONNREFUSED connect ECONNREFUSED 127.0.0.1:4444 ...