Leveraging Selenium for extracting data from a webpage containing JavaScript

I am trying to extract data from a Google Scholar page that has a 'show more' button. After researching, I found out that this page is not in HTML format but rather in JavaScript. There are different methods to scrape such pages and I attempted to use Selenium with the following code:

from selenium import webdriver
from bs4 import BeautifulSoup

options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
chrome_path = r"....path....."
driver = webdriver.Chrome(chrome_path)

driver.get("https://scholar.google.com/citations?user=TBcgGIIAAAAJ&hl=en")

driver.find_element_by_xpath('/html/body/div/div[13]/div[2]/div/div[4]/form/div[2]/div/button/span/span[2]').click()

soup = BeautifulSoup(driver.page_source,'html.parser')

papers = soup.find_all('tr',{'class':'gsc_a_tr'})

for paper in papers:
    title = paper.find('a',{'class':'gsc_a_at'}).text
    author = paper.find('div',{'class':'gs_gray'}).text
    journal = [a.text for a in paper.select("td:nth-child(1) > div:nth-child(3)")]
    
       
    print('Paper Title:', title, '\nAuthor:', author, '\nJournal:', journal)

Even though the browser now clicks the 'show more' button and displays the complete page, I am only able to retrieve information for the first 20 papers. I am puzzled by this limitation and would appreciate any assistance to resolve it.

Thank you!

Answer №1

I suspect the issue lies in the timing of when your program checks for newly loaded elements on the website. Have you tried incorporating a time delay to allow all elements to fully load before scraping the data? Here's an example without the headless features:

from selenium import webdriver
import time
from bs4 import BeautifulSoup

options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')

driver = webdriver.Chrome()

driver.get("https://scholar.google.com/citations?user=TBcgGIIAAAAJ&hl=en")
time.sleep(3)
driver.find_element_by_id("gsc_bpf_more").click()
time.sleep(4)
soup = BeautifulSoup(driver.page_source, 'html.parser')

papers = soup.find_all('tr', {'class': 'gsc_a_tr'})

for paper in papers:
    title = paper.find('a', {'class': 'gsc_a_at'}).text
    author = paper.find('div', {'class': 'gs_gray'}).text
    journal = [a.text for a in paper.select("td:nth-child(1) > div:nth-child(3)")]

    print('Paper Title:', title, '\nAuthor:', author, '\nJournal:', journal)

Answer №2

import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.page_load_strategy = 'normal'
driver = webdriver.Chrome(options=options)

driver.get("https://scholar.google.com/citations?user=TBcgGIIAAAAJ&hl=en")

# Strategy for retrieving articles from webpage
for i in range(1, 3):
    driver.find_element_by_css_selector('#gsc_bpf_more').click()
    # wait for elements to load
    time.sleep(3)

# Extracting data from container
for result in driver.find_elements_by_css_selector('#gsc_a_b .gsc_a_t'):
    title = result.find_element_by_css_selector('.gsc_a_at').text
    authors = result.find_element_by_css_selector('.gsc_a_at+ .gs_gray').text
    publication = result.find_element_by_css_selector('.gs_gray+ .gs_gray').text
    print(title)
    print(authors)
    print(publication)
    # separating content
    print()

Sample of the retrieved output:

Tax/subsidy policies in the presence of environmentally aware consumers
S Bansal, S Gangopadhyay
Journal of Environmental Economics and Management 45 (2), 333-355

Choice and design of regulatory instruments in the presence of green consumers
S Bansal
Resource and Energy economics 30 (3), 345-368

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

What should you do to move to the next line or block if a specific element cannot be located?

Hello there, I am Hugo. Currently, I am scraping a website that lacks a 'next page' button. To navigate through the pages, I am manually changing the page number in the URL. I have implemented a loop to cycle through a list of URLs within the cod ...

retrieve PHP function calls as an array using Ajax

While working in PHP, I have encountered a situation where I needed to call a PHP function using AJAX: <button onclick="loop()">Do It</button> function loop() { $.get("ajax.php", { action: "true" }, function(result) { $("in ...

A guide on customizing column names in MUI Datatables through object keys

I'm currently facing an issue where I need to set the name of a column in MUI Datatables using an object key. Specifically, I want to set one of the column names with the first element of children.childName so that it displays a list of child names, b ...

The functionality of getAttribute has changed in Firefox 3.5 and IE8, no longer behaving as it did before

Creating a JavaScript function to locate an anchor in a page (specifically with, not an id) and then going through its parent elements until finding one that contains a specified class. The code below works perfectly in Firefox 3.0 but encounters issues wi ...

Combining JWT authentication with access control lists: a comprehensive guide

I have successfully integrated passport with a JWT strategy, and it is functioning well. My jwt-protected routes are structured like this... app.get('/thingThatRequiresLogin/:id', passport.authenticate('jwt', { session: false }), thing ...

Analyze items in two arrays using JavaScript and add any items that are missing

I am working on a JSON function that involves comparing objects in two different arrays, array1 and array2. The goal is to identify any missing items and either append them to array2 or create a new array called newArray1. Here is an example: const arra ...

Performing an ASync call to the GetData routine in MongoClient using NodeJS

Combining code snippets from https://www.w3schools.com/nodejs/nodejs_mongodb_find.asp and https://stackoverflow.com/questions/49982058/how-to-call-an-async-function#:~:text=Putting%20the%20async%20keyword%20before,a%20promise%20to%20be%20resolved. Upon ob ...

Selenium: Mastering the Art of Dragging and Dropping

My goal is to automate the process of dropping a file from the desktop onto a page using Firefox as the browser and Selenium with Python for automation. Below is the code snippet for implementing drag-and-drop functionality on the page: <div id="dropb ...

Using EJS to Render a Function Expression?

Has anyone been able to successfully render a function call in express using EJS? Here's what I've tried so far: res.render("page", { test: test() }); Can someone confirm if this is possible, or provide guidance on how to call a function fr ...

Flaskdance does not provide a secure HTTPS URI

Looking to integrate Google sign-in with Flask Dance for a website built on Flask: from flask_dance.contrib.google import make_google_blueprint, google blueprint = make_google_blueprint( client_id= "CLIENT_ID", client_secret="CLIENT_SECRET", scope=[ " ...

Should a checkbox be added prior to the hyperlink?

html tags <ul class="navmore"> <li><a href="link-1">Link 1</a></li> <li><a href="link-2">Link 2</a></li> </ul> Jquery Implementation in the footer $(".navmore li a").each(function(){ v ...

The range selector on the x-axis of the Highstock chart is currently showing data only between January 8, 1970, and January 19,

I have a database with data from 1931 to yesterday. I am using Python Django models to retrieve the data and display it using Highstock. The chart only shows data for specific 15 days on the x-axis and the range selector is not functioning properly. Below ...

Guidelines on centering an AJAX spinning animation within a parent div

Below is the basic structure of an HTML document for a webpage. <div class="grand-parent"> <div class="header">Heading</div> <div class="parent-div"> <div class="child-div-1"></div> ...

Is there a way to modify a JSON dictionary at different levels?

Can you assist me with updating the value of a dictionary key in a JSON file using variable keys? Here's what currently works and the problem I'm encountering. For example: print(my_json['convergedSystem']['endpoints'][0][& ...

The beforeCreate function is failing to execute when a new user is being created

I'm currently working with sailsjs version 0.11.0. My goal is to ensure that when a new user is created, their password is encrypted before being stored in the database. To achieve this, I have implemented the use of the bcrypt library. In my User.js ...

Display a text field when the onclick event is triggered within a for

Control Panel for($i = 1; $i <= $quantity; $i++){ $data .= '<b style="margin-left:10px;">User ' . $i . '</b>'; $data .= '<div class="form-group" style="padding-top:10px;">'; $data .= ' ...

How can I design an avatar image within a button similar to Facebook's style?

I'm currently working on a project that involves adding an avatar and a dropdown menu for account settings to my navigation bar. I've already created the dropdown, but I'm having trouble styling the avatar within the button. The button is ta ...

What is the best way to transform a string representation of data into an array and then showcase it in

After importing CSV data and converting it into the variable stringData, I am facing an issue when trying to display this data in a React table. Although I have attempted to use the map function to separate the headers and map to <th>, displaying t ...

Converting a Base64 URL to an image object: A step-by-step guide

I currently have a base64 URL in the following format:  My objective is to convert this into an image file with the following properties: [File] 0: File lastModified: 1559126658701 lastModifiedDate: Wed M ...

Tips for choosing the remaining items in a multiple selection process

In my HTML form, I have a multi-select field that contains categories and the corresponding items within each category. My goal is to allow users to select individual courses or select an entire category (identified by values starting with "cat_") in orde ...