extract data from numerous pages using fixed link

After previously seeking help on navigating multiple pages with static URLs at , I appreciate the assistance received so far! However, my current goal is to extract ethnicity information for every character listed by clicking on each name. Although I am able to navigate through all the pages, my code is stuck scraping data only from the initial page.

Here's what I have attempted:

url = 'https://ethnicelebs.com/all-celeb'
driver = webdriver.Chrome()
driver.get(url)
while True:

    page = requests.post('https://ethnicelebs.com/all-celebs')
    soup = BeautifulSoup(page.text, 'html.parser')
    for href in soup.find_all('a', href=True)[18:]:
        print('Found the URL:{}'.format(href['href']))
        request_href = requests.get(href['href'])
        soup2 = BeautifulSoup(request_href.content)
        for each in soup2.find_all('strong')[:-1]:
            print(each.text)

    Next_button = (By.XPATH, "//*[@title='Go to next page']")
    WebDriverWait(driver, 50).until(EC.element_to_be_clickable(Next_button)).click()
    url = driver.current_url
    time.sleep(5)

(Credit goes to @Sureshmani!)

My objective is for the code to scrape data from each subsequent page as it navigates instead of being limited to the first page. How can I ensure that the scraping process continues with each new page? Thank you!

Answer №1

It seems there was some confusion in understanding your question, possibly due to the nested loop in the previous response. Here is an improved approach that should address the issue:

url = 'https://ethnicelebs.com/all-celeb'
driver = webdriver.Chrome()
while True:
    driver.get(url)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    for href in soup.find_all('a', href=True)[18:]:
        print('Found the URL:{}'.format(href['href']))
        driver.get(href['href'])
        soup2 = BeautifulSoup(driver.page_source)
        for each in soup2.find_all('strong')[:-1]:
            print(each.text)

    Next_button = (By.XPATH, "//*[@title='Go to next page']")
    WebDriverWait(driver, 50).until(EC.element_to_be_clickable(Next_button)).click()
    url = driver.current_url
    time.sleep(5)

In the previous code, you initiated a request with selenium at the start and then switched to using requests. For simultaneous navigation and scraping of a webpage, it is advisable to stick to using selenium as shown above.

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Retrieve a specific value from a JavaScript object

Utilizing the npm package app-store-scraper, I am extracting the app IDs of 1000 apps from the App Store. My objective is to retrieve the "id" field from each JavaScript object and save it in a .csv file. How can I accomplish this task? Below is the code ...

Determine the occurrences of float64 or int64 that are not equal (!=)

Despite the abundance of posts, none seem to address my specific issue. The data frame I am working with is as follows: df1 = [{"Customer Number": "AFIMBN01000BCA17030001177", "Account Name": "Sunarto","Debit/Credit Indicator" : "k","Money" : 100}, { ...

Facing difficulty transferring an array from React to Django

Trying to transfer an array from the React frontend (stored in local storage) to my view class in Django is resulting in the following error: Console Output: GET http://127.0.0.1:8000/api/quiz/multiple/ 500 (Internal Server Error) Django Logs: for qu ...

Leveraging numpy arrays for handling both integer values and arrays as input

As a Matlab user transitioning into Python, I attempted to code a minimal version of the de2bi function in Python. This function converts a decimal number into binary with the right-most significant bit first. However, I encountered some confusion when wor ...

"Is there a way to combine column values in one dataframe with another dataframe in pandas, but only if they are not already present in

I have imported two distinct excel files using the pd.readExcel method. The first file serves as a master document containing numerous columns. I will only display the relevant columns below: df1 Company Name Ex ...

Securing and managing server operations within an ajax request using Python

Greetings, I am currently in the process of securing a server function that is used for an Ajax request to prevent any potential malicious activity. Thus far, I have taken the following steps: Verification of a valid session during the function call. Uti ...

Python Selenium: Accelerate Your Character Speed

Is there a way to slow down the typing speed of my send key while inputting characters? I am struggling to control how quickly it types. I attempted using the Sleep method, but that was not effective. A('Initiating browser startup...') e=l() e.ad ...

Is there a way to determine the reason why an object is unable to be serialized as JSON?

I'm facing an issue when trying to serialize a simple data object to JSON in Python. I keep getting the error message "TypeError: <api.tests.MemberInfo object at 0x02B26210> is not JSON serializable." Here's the object I'm attempting ...

Issue encountered while creating code - Python Selenium Type Error

My code is throwing an error message that says "selenium.common.exceptions.WebDriverException: Message: TypeError: BrowsingContextFn().currentWindowGlobal is null." Can someone assist me with this? Here is the snippet of my code: from instapy import Insta ...

Remove all content before a specified line when cleaning text using regular expressions

I have a file that contains the famous play, The Tragedy of Macbeth. My goal is to clean up this file by removing everything before the line The Tragedie of Macbeth and saving the rest in a file called removed_intro_file. Here's what I've attemp ...

To construct a well-balanced Binary Search Tree in Python, simply provide a sorted list as input to the sorting function

Looking to print the preorder of a Balanced Binary Search Tree based on a sorted list in ascending order. The method must be a part of the TreeNode class and accept the sorted list as an argument rather than individual elements. The sorted_array_to_bst() ...

Set up dependencies within a pipx virtual environment

I was trying to set up ipython using pipx so that I could have a dedicated ipython environment accessible from anywhere I launch ipython. For instance, if I don't want to install any global pip packages, I attempted to use pipx to install ipython. Aft ...

Scrapy: Verifying the data in a CSV document prior to incorporation

My objective is to verify the title of an item listed in a csv file, and if it does not already exist, append it to the file. I have extensively researched various solutions for handling duplicate values but most of them pertain to DuplicatesPipeline which ...

Python Requests.get() fails to fully respond to XHR request due to an incomplete response

My current project involves scraping German zip codes (PLZ) based on a specific street in a particular city using Python's requests library. The data I am working with can be found on this server, and I'm applying techniques learned from here. T ...

Enumerating groups of three vertices in a graph with multiple edges

Consider the following graph: import igraph as ig g=ig.Graph.Erdos_Renyi(10, 0.5, directed=True) To obtain its triad census, you can use the triad_census function: tc = g.triad_census() The output of 'tc' might look like this: 003 : -214748 ...

Connection to Firefox at 127.0.0.1:7055 on CentOS could not be established within the 60-second time frame

Encountering the following error message: unable to obtain stable firefox connection in 60 seconds (127.0.0.1:7055) when executing this code: require 'watir-webdriver' require 'headless' headless = Headless.new headless.start begin ...

Retrieve the occurrence of a specific element by its position using Xpath with Selenium, all while implementing a wait feature

When scraping a large static web page with Selenium, it's crucial to ensure that the entire page is fully loaded before attempting to scrape it. To achieve this, I have implemented a solution where I wait for the last <a> element to be loaded on ...

Interacting with a button element using Selenium automation

I've been working on automating the job application process on the Indeed website using Selenium. However, I've encountered an issue where I am unable to select the 'Continue' button to move to the next page. Despite trying XPath, CSS ...

python Using urllib2 to retrieve JSON data

I need help retrieving a JSON object from a specific URL. I've provided the correct API key, but when I attempt to do the following: data=json.loads(a.read()) print data I encounter this error: Traceback (most recent call last): File "C:\Pyt ...

Difficulty encountered while attempting to tally the amount of words within a loop

Struggling to calculate the total number of words in a for loop, encountering issues with the sum() method and unsuccessful attempts using list-append method: for line in open("jane_eyre.txt"): strip = line.rstrip() words = strip.split() for i in ...