extract data from numerous pages using fixed link

Question

extract data from numerous pages using fixed link

After previously seeking help on navigating multiple pages with static URLs at , I appreciate the assistance received so far! However, my current goal is to extract ethnicity information for every character listed by clicking on each name. Although I am able to navigate through all the pages, my code is stuck scraping data only from the initial page.

Here's what I have attempted:

url = 'https://ethnicelebs.com/all-celeb'
driver = webdriver.Chrome()
driver.get(url)
while True:

    page = requests.post('https://ethnicelebs.com/all-celebs')
    soup = BeautifulSoup(page.text, 'html.parser')
    for href in soup.find_all('a', href=True)[18:]:
        print('Found the URL:{}'.format(href['href']))
        request_href = requests.get(href['href'])
        soup2 = BeautifulSoup(request_href.content)
        for each in soup2.find_all('strong')[:-1]:
            print(each.text)

    Next_button = (By.XPATH, "//*[@title='Go to next page']")
    WebDriverWait(driver, 50).until(EC.element_to_be_clickable(Next_button)).click()
    url = driver.current_url
    time.sleep(5)

(Credit goes to @Sureshmani!)

My objective is for the code to scrape data from each subsequent page as it navigates instead of being limited to the first page. How can I ensure that the scraping process continues with each new page? Thank you!

python selenium web-scraping

Answer 1

Answer №1

It seems there was some confusion in understanding your question, possibly due to the nested loop in the previous response. Here is an improved approach that should address the issue:

url = 'https://ethnicelebs.com/all-celeb'
driver = webdriver.Chrome()
while True:
    driver.get(url)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    for href in soup.find_all('a', href=True)[18:]:
        print('Found the URL:{}'.format(href['href']))
        driver.get(href['href'])
        soup2 = BeautifulSoup(driver.page_source)
        for each in soup2.find_all('strong')[:-1]:
            print(each.text)

    Next_button = (By.XPATH, "//*[@title='Go to next page']")
    WebDriverWait(driver, 50).until(EC.element_to_be_clickable(Next_button)).click()
    url = driver.current_url
    time.sleep(5)

In the previous code, you initiated a request with selenium at the start and then switched to using requests. For simultaneous navigation and scraping of a webpage, it is advisable to stick to using selenium as shown above.

Answer 2

It seems there was some confusion in understanding your question, possibly due to the nested loop in the previous response. Here is an improved approach that should address the issue:

url = 'https://ethnicelebs.com/all-celeb'
driver = webdriver.Chrome()
while True:
    driver.get(url)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    for href in soup.find_all('a', href=True)[18:]:
        print('Found the URL:{}'.format(href['href']))
        driver.get(href['href'])
        soup2 = BeautifulSoup(driver.page_source)
        for each in soup2.find_all('strong')[:-1]:
            print(each.text)

    Next_button = (By.XPATH, "//*[@title='Go to next page']")
    WebDriverWait(driver, 50).until(EC.element_to_be_clickable(Next_button)).click()
    url = driver.current_url
    time.sleep(5)

In the previous code, you initiated a request with selenium at the start and then switched to using requests. For simultaneous navigation and scraping of a webpage, it is advisable to stick to using selenium as shown above.

extract data from numerous pages using fixed link

Answer №1

Similar questions

Retrieve a specific value from a JavaScript object

Determine the occurrences of float64 or int64 that are not equal (!=)

Facing difficulty transferring an array from React to Django

Leveraging numpy arrays for handling both integer values and arrays as input

"Is there a way to combine column values in one dataframe with another dataframe in pandas, but only if they are not already present in

Securing and managing server operations within an ajax request using Python

Python Selenium: Accelerate Your Character Speed

Is there a way to determine the reason why an object is unable to be serialized as JSON?

Issue encountered while creating code - Python Selenium Type Error

Remove all content before a specified line when cleaning text using regular expressions

To construct a well-balanced Binary Search Tree in Python, simply provide a sorted list as input to the sorting function

Set up dependencies within a pipx virtual environment

Scrapy: Verifying the data in a CSV document prior to incorporation

Python Requests.get() fails to fully respond to XHR request due to an incomplete response

Enumerating groups of three vertices in a graph with multiple edges

Connection to Firefox at 127.0.0.1:7055 on CentOS could not be established within the 60-second time frame

Retrieve the occurrence of a specific element by its position using Xpath with Selenium, all while implementing a wait feature

Interacting with a button element using Selenium automation

python Using urllib2 to retrieve JSON data

Difficulty encountered while attempting to tally the amount of words within a loop