Retrieve all href links using the Python selenium module

Exploring Selenium with Python has been my recent interest, and I was eager to extract all the links present on a particular web page using Selenium.

Specifically, I wanted to retrieve all the links specified in the href= attribute of every <a> tag on

I have managed to write a script that successfully accomplishes this task. However, instead of returning the actual link values, it provides me with the object address. Initially, I attempted to utilize the id tag for extracting the values, but unfortunately, that approach did not yield the desired results.

This is the current version of my script:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys


driver = webdriver.Firefox()
driver.get("http://psychoticelites.com/")

assert "Psychotic" in driver.title

continue_link = driver.find_element_by_tag_name('a')
elem = driver.find_elements_by_xpath("//*[@href]")
#x = str(continue_link)
#print(continue_link)
print(elem)

Answer №1

If you're looking to extract specific attributes from a list of elements, one way to do it is by looping through the list like this:

elements = driver.find_elements_by_xpath("//a[@href]")
for element in elements:
    print(element.get_attribute("href"))

The method find_elements_by_* will give you a list of elements (note the correct spelling). By iterating through this list, you can access each element and retrieve the desired attribute value (such as href in this example).

Answer №2

After thorough investigation and experimentation, I have confirmed the existence of a function called find_elements_by_tag_name() that you can employ. In my own testing, this particular example performed without any issues.

elems = driver.find_elements_by_tag_name('a')
    for elem in elems:
        href = elem.get_attribute('href')
        if href is not None:
            print(href)

Answer №3

driver.get(URL)
time.sleep(7)
elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
    print(elem.get_attribute("href"))
driver.close()

Note: It's crucial to include a delay in the script. Run it in debug mode first to ensure the URL page loads successfully. If the page loading is slow, adjust the delay (sleep time) accordingly before extracting the data.

If you encounter any issues, feel free to check out the explanation provided in the link below or leave a comment.

Learn how to extract links from a webpage using selenium webdriver

Answer №4

Here is a possible solution:

    Consider using the following code snippet:
links = driver.find_elements_by_partial_link_text('')

Answer №5

The traditional way of using Selenium's driver.find_elements_by_*** is no longer compatible with Selenium 4. The current recommended approach is to utilize find_elements() along with the By class.

Approach 1: For loop

This method involves two lists, one for By.XPATH and the other for By.TAG_NAME, each serving a unique purpose. However, you can choose to use either of them as they are not both necessary.

Personally, I find By.XPATH more straightforward since it avoids returning a seemingly irrelevant None value compared to By.TAG_NAME. Additionally, this code snippet eliminates duplicates in the results.

from selenium.webdriver.common.by import By

driver.get("https://www.amazon.com/")

href_links = []
href_links2 = []

elems = driver.find_elements(by=By.XPATH, value="//a[@href]")
elems2 = driver.find_elements(by=By.TAG_NAME, value="a")

for elem in elems:
    link = elem.get_attribute("href")
    if link not in href_links:
        href_links.append(link)

for elem in elems2:
    link = elem.get_attribute("href")
    if (link not in href_links2) & (link is not None):
        href_links2.append(link)

print(len(href_links))  # 360
print(len(href_links2))  # 360

print(href_links == href_links2)  # True

Approach 2: List Comprehension

If duplicate links are acceptable, a single line list comprehension can be employed.

from selenium.webdriver.common.by import By

driver.get("https://www.amazon.com/")

elems = driver.find_elements(by=By.XPATH, value="//a[@href]")
href_links = [e.get_attribute("href") for e in elems]

elems2 = driver.find_elements(by=By.TAG_NAME, value="a")
# href_links2 = [e.get_attribute("href") for e in elems2]  # Does not remove None values
href_links2 = [e.get_attribute("href") for e in elems2 if e.get_attribute("href") is not None]

print(len(href_links))  # 387
print(len(href_links2))  # 387

print(href_links == href_links2)  # True

Answer №6

To incorporate the HTML dom into your Python project, you can utilize the html dom library. You can easily access and install this library using PIP by visiting the following link:

https://pypi.python.org/pypi/htmldom/2.0

from htmldom import htmldom
dom = htmldom.HtmlDom("https://www.github.com/")  
dom = dom.createDom()

The provided code snippet initializes a HtmlDom object. The HtmlDom requires a default parameter, which is the URL of the page you want to parse. After creating the dom object, you need to execute the "createDom" method of HtmlDom. This will analyze the HTML content and generate a parse tree that can be utilized for searching and modifying the HTML data. The only requirement imposed by the library is that the data (whether it's HTML or XML) must have a root element.

You can retrieve elements using the "find" method of the HtmlDom object:

p_links = dom.find("a")  
for link in p_links:
  print ("URL: " +link.attr("href"))

This piece of code will display all the links/URLs present on the webpage

Answer №7

Unfortunately, it seems that the original link provided by OP is no longer working...

If you're interested in extracting links from a webpage, here's a method to fetch all the "Hot Network Questions" links from this page using gazpacho:

from gazpacho import Soup

url = "https://stackoverflow.com/q/34759787/3731467"

soup = Soup.get(url)
a_tags = soup.find("div", {"id": "hot-network-questions"}).find("a")

[a.attrs["href"] for a in a_tags]

Answer №8

If you want to extract links from a webpage, you can easily achieve this using BeautifulSoup in a straightforward and effective manner. I have personally tested the code snippet below and it worked flawlessly for the same purpose.

Simply add the following lines of code after the line:

driver.get("http://example.com/")

Insert the code snippet below:

response = requests.get(browser.current_url)
soup = BeautifulSoup(response.content, 'html.parser')
for link in soup.find_all('a'):
    if link.get('href'):
       print(link.get("href"))
       print('\n')

Answer №9

Looking ahead to 2023:

target_url = "https://example.com"
driver.get(target_url)
all_links = driver.find_elements(By.XPATH, '//a [@href]')
for ind_link in all_links:
    current_link = ind_link.get_attribute("href")
    print("Found link:{}".format(current_link))

Answer №10

import requests
from selenium import webdriver
import bs4
driver = webdriver.Chrome(r'C:\chromedrivers\chromedriver') #specify the filepath for Chrome driver
data=requests.get('https://google.co.in/') #choose any website you want to scrape
soup=bs4.BeautifulSoup(data.text,'html.parser')
for link in soup.findAll('a'):
    print(link)

Answer №11

Revision regarding the previous answer:

elements = driver.find_elements_by_xpath("//a[@href]")
for element in elements:
    print(element.get_attribute("href"))

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

What steps should I follow to run tests with Maven while ensuring that the features are ordered correctly?

Currently, my integration tests with an e-commerce application are being run using the combo of Maven + Selenium + Jenkins. An issue arose where Maven does not adhere to the order of the .feature files: 1-test_case.feature 2-test_case.feature 3-test_ca ...

Utilizing selenium to gather information from a constantly changing website

I need to extract data from a CSV file on this specific website here. The information I am trying to gather includes the names of participants and the number of modules they have completed. Previously, I had written code that successfully extracted this d ...

Troubleshooting Problems with Facebook Posts on Selenium WebDriver

I'm struggling to automate the process of posting on Facebook. My script is having difficulty identifying the textarea in our news feed accurately. Despite trying various selectors, including those recommended by Selenium IDE, I have been unsuccessfu ...

Integrate HTML parsing both inside and outside of an iframe using BeautifulSoup

My current project involves web scraping a real estate listing platform using Selenium, BS4, and Python. The script works by fetching the links to each listing page before proceeding to parse the HTML content of those pages. One key component is a Chrome E ...

Verify if the strings are made up of a set of sub-elements

My goal with Python is to extract and print words from a list that are entirely made up of smaller words in another list. Let's consider the following example: list1 = ('ABCDEFGHI', 'DEFABCGHI', 'ABCABCGHIABC', 'AA ...

Error message "Attempting to divide a class function by a number causes the 'Float object is not callable' error."

Within my Portfolio class, there is a method called portfolio_risk(self, year). Whenever I attempt to divide the result of this method by a number, an error occurs: Float object is not callable I believe this issue stems from the parentheses used in th ...

Encountering a Selenium Webdriver issue of "Element not click-able at specified point" in regards to PrimeFaces SelectOneMenu

I've browsed through various resources to find a solution to my current issue, but none of the suggested fixes seem to work for me. The problem I'm facing is that my tests on SelectOneMenu elements are successful in Firefox and Internet Explorer, ...

What is the best way to utilize Selenium alongside Python3 and xpath to interact with an image within an html table?

Is there a way to utilize Selenium with Python3 and xpath in order to click on an image within an html table? The specific webpage I am referencing is: . The image I am attempting to click on using Selenium is the green plus sign that appears when a pdb co ...

An effective method for solving equations using a table

In the figure presented, I have a table displaying data with the initial entries shown below: n m equations 1 1 0 1 2 dP-41 1 3 2dP-28 2 1 -35 etc Formulas for suppression of harmonics I want to write code in Python that can take specifie ...

Set up the RefControl extension with the help of the Selenium WebDriver

I have successfully implemented this addon with selenium. However, I am unsure how to adjust the settings of addons in selenium. from selenium.webdriver.firefox.firefox_profile import FirefoxProfile from selenium.webdriver.firefox import webdriver profil ...

Navigating through a multi-level dataframe using Python

I have encountered JSON values within my dataframe and am now attempting to iterate through them. Despite several attempts, I have been unsuccessful in converting the dataframe values into a nested dictionary format that would allow for easier iteration. ...

Python error: Index out of range when utilizing index() function

@bot.command() async def start(ctx): edit = [] with open("wallet.json", "w") as file: for obj in file: dict = json.load(obj) edit.append(dict) userid = [] for player in edit: user ...

The commands sent to the Target Session after the initial one will have no impact. This applies to Chrome Selenium CDP Bidi API - Next Commands

Greetings everyone! I'm currently encountering a significant issue with hooking the TargetCreated Event and sending CDP commands to each new target discovered. I am utilizing the latest versions of Selenium and Chrome for this task. Upon initiating ...

Having trouble using the Python append method for axis=1 in a 2D array?

#Issue with Python's append function when axis=1 in 2D array. import numpy as np arr = np.array([[11, 15, 10, 6], [10, 14, 11, 5], [12, 17, 12, 8], [15, 18, 14, 9]]) print(arr) ...

What is the best way to download multiple files using Capybara and Selenium in Chrome?

Currently using Capybara in conjunction with Chrome and Selenium. When attempting to click on a link that results in an automatic download, the file is downloaded successfully for the first time. However, subsequent attempts trigger a prompt from Chrome st ...

Exploring sequences using a recursively defined for loop

I am having trouble writing a for loop to eliminate number sequences in combinations from a tuple. The loop seems to only check the first value before moving on to the next, and I can't figure out where to define the last value properly. It keeps comp ...

"Exploring the Concept of Defining a Driver in Pytest

Having recently transitioned to using Python for Selenium tests after working on Java and C# projects, I find myself a bit confused. Unlike in Java where you define the driver with a specific Type, how do I go about defining it in Python? # How can I set u ...

firefox selenium webdriver geckodriver compatibility issue

Encountering an error while attempting to launch Firefox browser in Eclipse. The error message reads as follows: Usage: E:\new gecko\geckodriver.exe [OPTIONS] E:\new gecko\geckodriver.exe: Unknown option --port=30415 Exception ...

Issue encountered while utilizing WebDriverWait with the ExpectedCondition function presenceOfElementLocated()

I encountered an error while compiling: public static WebDriverWait wait = null; wait = new WebDriverWait(driver, 120); wait.until(ExpectedConditions.presenceOfElementLocated(By.id(HomeScreen.tabHome_ID))); I am currently working with IntelliJ IDE. Err ...

Normalizing 2-dimensional input arrays using Keras

Having recently ventured into the realm of machine learning, I'm faced with a challenge in applying it to my specific problem. My training dataset consists of 44000 rows of features with a shape of 6 by 25. My goal is to construct a sequential model, ...