What is the best way to locate the XPath expression for this specific website?

Question

What is the best way to locate the XPath expression for this specific website?

I am looking to extract data from this particular website.

With a collection of numerous resumes at my disposal, my objective is to gather the skill sets mentioned in each one. Here is the webpage link for reference:

https://i.stack.imgur.com/lBoym.png

selenium xpath web-scraping css-selectors

Answer 1

Answer №1

If you want to extract information from a website without using selenium, you can achieve that easily with BeautifulSoup. Here is the code snippet:

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.livecareer.com/resume-search/search?jt=software%20engineer').text

soup = BeautifulSoup(r, 'html.parser')

ul = soup.find('ul', class_='resume-list list-unstyled')

li_items = ul.find_all('li')[1:]

links = []

for li in li_items:
    links.append('https://www.livecareer.com/' + li.a['href'])

skills = []

for link in links:
    
    r = requests.get(link).text
    soup = BeautifulSoup(r, 'html.parser')
    div = soup.find('div', class_='field singlecolumn')
    skills.append(div.text)

print(skills)

Output:

['agile, AutoCAD, C++, CAD, Oral, data entry, database, Engineer in Training, EIT, Engineering analysis, XML, functional, GUI, HTML, JavaScript, Team leadership, Lockheed Martin, macros, Manufacturing processes, MATLAB, mechanical, meetings, Excel, Organizational skills, presentations, Process improvement, program management, programming, Project planning, Python, research, scrum, Six Sigma, Software development, Solidworks, SQL, switches, telemetry, video, Web design, website, Written communication', "Senior Outreach at Senior Center\xa0Planned and organized a joint celebration of the Chinese New Year with the collaboration of the Westborough Public Schools. \xa0Promoted cultural awareness and broke the language barrier of different races of backgrounds.Volunteer at ChurchIdentified problems and implemented a process to eliminate a data-entry camp registration process by 100% by building new online Registration Forms for registration and the student's cultural classes arrangement.Designed posters, flyers and presentation slides with graphics and photos for the different organizational events with Microsoft Word and PowerPointDeveloped a structural documentation on publishing an annual report with detailed steps and instructions on the process that are easy to follow and quickly learn by others. \xa0Implemented a 30th Anniversary Special Edition project in a commercial quality of work with excellent time management skills to meet the deadline.Cayenne SoftwareExperienced team spirit in effort of reducing the workload of bugs fixes on the software product.\u200bAllmerica Financial CompanyProvided a sole support to the Hanover 1099 system with strong commitment and responsibility. Fined tuned the system resulting in cost savings for the Allmerica Financial Company. Winner of the Gold Crown Customer Recognition award.", 'Motivated Software Engineer seeking employment as part of a dynamic software development team. Fluent in C,C++,JAVA and python.', 'Developed peer-to-peer secure file transfer system in JAVA.This involved the application of symmetric\r\n     and asymmetric key cryptography algorithms, and JAVA concepts like multi-threading, socket\r\n     programming, etc.Implemented a system to query XML in JAVA.The query language was a subset of XPath\r\n    Modeled a project "Personal Health Management System" using UML and implemented it in Visual C#.The code was tested using NUnit.Object oriented software development process was used for this\r\n     project\r\n    Developed a \'license plate game\' in C on LINUX o/s using client/server architecture.This required the\r\n     application of distributed programming concepts like Sockets, RPC, multi-threading, etc.RESEARCH PAPER:\r\n    XMorph: A Shape-Polymorphic, Domain-Specific XML Data Transformation Language,\r\n     International Conference on Data Engineering (ICDE 2010), IEEE CS, Los Angeles, USA, March 2010.', 'Performance evaluation of In-Kernel System Call Implemented and evaluated In-kernel system call using dynamic loadable kernel module on x_86_64 architecture.Re-Development free approach to migrate Java applications to cloud at College of Engineering, Pune Implemented file access sub-system of a WebJDK which leverages the File-System API provided by HTML5.This allows the use of standard Java APIs for accessing client files.2016 2013.', 'Accomplished Computer Technician with a rapidly increasing range of industry experience looking to bring strong instincts and a proven record of procedural compliance, process management and strong operational skills to a rapidly growing company. ', 'Seeking a fulltime position as a Developer / Systems Admin / DBA for a company needing a hard working, \r\ntaskoriented person with an indepth understanding of software development and database tuning.', '3 Years of experience in Information Technology with emphasis on Design, Development and End to End Implementation of Consulting based solutions with expertise on working with Object Orient Analysis and Design using Java/J2EE Technologies viz. JSP/Servlets/EJB,JDBC, Web services , Web sockets, Spring Frameworks, Spring-boot, Angular, JQuery, XML/XSLT, JSON, Integration Developer Service Component Architecture, Service Data Objects, Rational Application Developer, Test Driven Development using JUnit, Jenkins, GIT, Cloud Foundry,  Eclipse/Intelij IDE, UNIX, Gradle Scripts, DB2/Oracle/MySQL Databases.', '.NET 3.5, .NET, ASP .NET 3.5, ASP.NET 2.0, ASP.NET 3.5, AJAX, ASM, Banking, Basic, Business Objects, c, CSS, CSS 2, customer satisfaction, data analysis, Database, delivery, EBusiness, editor, Electronics, HP, HTML 4, HTML, IDE, IIS 7.0, ITIL, JavaScript, C#, C# 3.0, Windows, windows applications, 2000, 3.1, Windows 98, Enterprise, Oct, Operating systems, Oracle 9, Oracle database, PL/SQL, personnel, programming, recording, reporting, sales, Servers, Service Level Agreement, SLA, Visual SourceSafe, Visual  SourceSafe, SQL, SQL Server, technical support, TOAD, UNIX, vi, Microsoft Visual Studio, Visual studio, Windows server', 'Represent Stanford  Ballroom Dance team in various competitions in the Bay area.\r\n*Represented University of Maryland in Ballroom dance competitions in UMD, UPenn, MIT, Columbia University & Ohio \r\n*Have a keen interest in photography, especially of dancers in motion.']

You can also organize the data in a DataFrame for better readability by incorporating these lines into your code:

dictionary = {'Links': links,
              'Skills': skills}

df = pd.DataFrame(dictionary)

print(df)

Output:

                                                                                                           
                Skills                                                   Links
0  https://www.livecareer.com//resume-search/r/so...  agile, AutoCAD, C++, CAD, Oral, data entry, da...
1  https://www.livecareer.com//resume-search/r/so...  Senior Outreach at Senior Center Planned and o...
2  https://www.livecareer.com//resume-search/r/so...  Motivated Software Engineer seeking employment...
3  https://www.livecareer.com//resume-search/r/so...  Developed peer-to-peer secure file transfer sy...
4  https://www.livecareer.com//resume-search/r/so...  Performance evaluation of In-Kernel System Cal...
5  https://www.livecareer.com//resume-search/r/so...  Accomplished Computer Technician with a rapidl...
6  https://www.livecareer.com//resume-search/r/so...  Seeking a fulltime position as a Developer / S...
7  https://www.livecareer.com//resume-search/r/so...  3 Years of experience in Information Technolog...
8  https://www.livecareer.com//resume-search/r/so...  .NET 3.5, .NET, ASP .NET 3.5, ASP.NET 2.0, ASP...
9  https://www.livecareer.com//resume-search/r/so...  Represent Stanford  Ballroom Dance team in var...

I hope this information proves useful!

Answer 2

If you want to extract information from a website without using selenium, you can achieve that easily with BeautifulSoup. Here is the code snippet:

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.livecareer.com/resume-search/search?jt=software%20engineer').text

soup = BeautifulSoup(r, 'html.parser')

ul = soup.find('ul', class_='resume-list list-unstyled')

li_items = ul.find_all('li')[1:]

links = []

for li in li_items:
    links.append('https://www.livecareer.com/' + li.a['href'])

skills = []

for link in links:
    
    r = requests.get(link).text
    soup = BeautifulSoup(r, 'html.parser')
    div = soup.find('div', class_='field singlecolumn')
    skills.append(div.text)

print(skills)

Output:

['agile, AutoCAD, C++, CAD, Oral, data entry, database, Engineer in Training, EIT, Engineering analysis, XML, functional, GUI, HTML, JavaScript, Team leadership, Lockheed Martin, macros, Manufacturing processes, MATLAB, mechanical, meetings, Excel, Organizational skills, presentations, Process improvement, program management, programming, Project planning, Python, research, scrum, Six Sigma, Software development, Solidworks, SQL, switches, telemetry, video, Web design, website, Written communication', "Senior Outreach at Senior Center\xa0Planned and organized a joint celebration of the Chinese New Year with the collaboration of the Westborough Public Schools. \xa0Promoted cultural awareness and broke the language barrier of different races of backgrounds.Volunteer at ChurchIdentified problems and implemented a process to eliminate a data-entry camp registration process by 100% by building new online Registration Forms for registration and the student's cultural classes arrangement.Designed posters, flyers and presentation slides with graphics and photos for the different organizational events with Microsoft Word and PowerPointDeveloped a structural documentation on publishing an annual report with detailed steps and instructions on the process that are easy to follow and quickly learn by others. \xa0Implemented a 30th Anniversary Special Edition project in a commercial quality of work with excellent time management skills to meet the deadline.Cayenne SoftwareExperienced team spirit in effort of reducing the workload of bugs fixes on the software product.\u200bAllmerica Financial CompanyProvided a sole support to the Hanover 1099 system with strong commitment and responsibility. Fined tuned the system resulting in cost savings for the Allmerica Financial Company. Winner of the Gold Crown Customer Recognition award.", 'Motivated Software Engineer seeking employment as part of a dynamic software development team. Fluent in C,C++,JAVA and python.', 'Developed peer-to-peer secure file transfer system in JAVA.This involved the application of symmetric\r\n     and asymmetric key cryptography algorithms, and JAVA concepts like multi-threading, socket\r\n     programming, etc.Implemented a system to query XML in JAVA.The query language was a subset of XPath\r\n    Modeled a project "Personal Health Management System" using UML and implemented it in Visual C#.The code was tested using NUnit.Object oriented software development process was used for this\r\n     project\r\n    Developed a \'license plate game\' in C on LINUX o/s using client/server architecture.This required the\r\n     application of distributed programming concepts like Sockets, RPC, multi-threading, etc.RESEARCH PAPER:\r\n    XMorph: A Shape-Polymorphic, Domain-Specific XML Data Transformation Language,\r\n     International Conference on Data Engineering (ICDE 2010), IEEE CS, Los Angeles, USA, March 2010.', 'Performance evaluation of In-Kernel System Call Implemented and evaluated In-kernel system call using dynamic loadable kernel module on x_86_64 architecture.Re-Development free approach to migrate Java applications to cloud at College of Engineering, Pune Implemented file access sub-system of a WebJDK which leverages the File-System API provided by HTML5.This allows the use of standard Java APIs for accessing client files.2016 2013.', 'Accomplished Computer Technician with a rapidly increasing range of industry experience looking to bring strong instincts and a proven record of procedural compliance, process management and strong operational skills to a rapidly growing company. ', 'Seeking a fulltime position as a Developer / Systems Admin / DBA for a company needing a hard working, \r\ntaskoriented person with an indepth understanding of software development and database tuning.', '3 Years of experience in Information Technology with emphasis on Design, Development and End to End Implementation of Consulting based solutions with expertise on working with Object Orient Analysis and Design using Java/J2EE Technologies viz. JSP/Servlets/EJB,JDBC, Web services , Web sockets, Spring Frameworks, Spring-boot, Angular, JQuery, XML/XSLT, JSON, Integration Developer Service Component Architecture, Service Data Objects, Rational Application Developer, Test Driven Development using JUnit, Jenkins, GIT, Cloud Foundry,  Eclipse/Intelij IDE, UNIX, Gradle Scripts, DB2/Oracle/MySQL Databases.', '.NET 3.5, .NET, ASP .NET 3.5, ASP.NET 2.0, ASP.NET 3.5, AJAX, ASM, Banking, Basic, Business Objects, c, CSS, CSS 2, customer satisfaction, data analysis, Database, delivery, EBusiness, editor, Electronics, HP, HTML 4, HTML, IDE, IIS 7.0, ITIL, JavaScript, C#, C# 3.0, Windows, windows applications, 2000, 3.1, Windows 98, Enterprise, Oct, Operating systems, Oracle 9, Oracle database, PL/SQL, personnel, programming, recording, reporting, sales, Servers, Service Level Agreement, SLA, Visual SourceSafe, Visual  SourceSafe, SQL, SQL Server, technical support, TOAD, UNIX, vi, Microsoft Visual Studio, Visual studio, Windows server', 'Represent Stanford  Ballroom Dance team in various competitions in the Bay area.\r\n*Represented University of Maryland in Ballroom dance competitions in UMD, UPenn, MIT, Columbia University & Ohio \r\n*Have a keen interest in photography, especially of dancers in motion.']

You can also organize the data in a DataFrame for better readability by incorporating these lines into your code:

dictionary = {'Links': links,
              'Skills': skills}

df = pd.DataFrame(dictionary)

print(df)

Output:

                                                                                                           
                Skills                                                   Links
0  https://www.livecareer.com//resume-search/r/so...  agile, AutoCAD, C++, CAD, Oral, data entry, da...
1  https://www.livecareer.com//resume-search/r/so...  Senior Outreach at Senior Center Planned and o...
2  https://www.livecareer.com//resume-search/r/so...  Motivated Software Engineer seeking employment...
3  https://www.livecareer.com//resume-search/r/so...  Developed peer-to-peer secure file transfer sy...
4  https://www.livecareer.com//resume-search/r/so...  Performance evaluation of In-Kernel System Cal...
5  https://www.livecareer.com//resume-search/r/so...  Accomplished Computer Technician with a rapidl...
6  https://www.livecareer.com//resume-search/r/so...  Seeking a fulltime position as a Developer / S...
7  https://www.livecareer.com//resume-search/r/so...  3 Years of experience in Information Technolog...
8  https://www.livecareer.com//resume-search/r/so...  .NET 3.5, .NET, ASP .NET 3.5, ASP.NET 2.0, ASP...
9  https://www.livecareer.com//resume-search/r/so...  Represent Stanford  Ballroom Dance team in var...

I hope this information proves useful!

Answer 3

Answer №2

Just a thought to consider...

When using chrome, simply follow these steps:

Right click on the element you wish to target
Select "Inspect"
Press ctrl + f to open the search window

Now, craft your own xpath expression that will uniquely identify the desired page object.

For example:

//a[contains(text(), 'my text')] 
//div[@id='myDivID']

It's important to manually create your xpath and avoid using the "Copy Xpath" option as it can generate overly complex paths like the one below, which are prone to breaking:

//*[@id="wrapper"]/div[2]/div[2]/div[1]/aside[1]/div/div/div[2]/div/div[1]/a

If you're unfamiliar with writing xpath, refer to this resource for guidance: https://www.w3schools.com/xml/xpath_intro.asp

An issue to note is that the text is currently within a div tag when it should ideally be enclosed in a span. You could attempt the following xpath:

//div[@class='field singlecolumn']/text()

Answer 4

Just a thought to consider...

When using chrome, simply follow these steps:

Right click on the element you wish to target
Select "Inspect"
Press ctrl + f to open the search window

Now, craft your own xpath expression that will uniquely identify the desired page object.

For example:

//a[contains(text(), 'my text')] 
//div[@id='myDivID']

It's important to manually create your xpath and avoid using the "Copy Xpath" option as it can generate overly complex paths like the one below, which are prone to breaking:

//*[@id="wrapper"]/div[2]/div[2]/div[1]/aside[1]/div/div/div[2]/div/div[1]/a

If you're unfamiliar with writing xpath, refer to this resource for guidance: https://www.w3schools.com/xml/xpath_intro.asp

An issue to note is that the text is currently within a div tag when it should ideally be enclosed in a span. You could attempt the following xpath:

//div[@class='field singlecolumn']/text()

Answer 5

Answer №3

If you're looking to find the XPath of an html element, you can make use of your browser's Developer Tools. The guide below is specific to Chrome, but similar steps apply to other browsers as well:

Simply right click on the item within the page that you wish to determine the XPath for.
Next, select "Inspect" which will launch the Dev Tools, highlighting the specified element.
If the highlighted element isn't the one you're seeking, explore the interactive html shown by hovering over elements to match the desired item on the main page.
Right click on the html element in the navigator panel.
Choose 'Copy -> Copy XPath' from the options provided.

An issue you may face when scraping these pages is that your target could potentially move around between visits. User-generated documents often have varying layouts, causing the XPath to differ and requiring a more advanced approach (such as jQuery, Selenium, Cypress) to search based on text content or navigate between parent/child elements.

Answer 6

If you're looking to find the XPath of an html element, you can make use of your browser's Developer Tools. The guide below is specific to Chrome, but similar steps apply to other browsers as well:

Simply right click on the item within the page that you wish to determine the XPath for.
Next, select "Inspect" which will launch the Dev Tools, highlighting the specified element.
If the highlighted element isn't the one you're seeking, explore the interactive html shown by hovering over elements to match the desired item on the main page.
Right click on the html element in the navigator panel.
Choose 'Copy -> Copy XPath' from the options provided.

An issue you may face when scraping these pages is that your target could potentially move around between visits. User-generated documents often have varying layouts, causing the XPath to differ and requiring a more advanced approach (such as jQuery, Selenium, Cypress) to search based on text content or navigate between parent/child elements.

What is the best way to locate the XPath expression for this specific website?

Answer №1

Answer №2

Answer №3

Similar questions

ChromeDriver for Selenium: element remains invisible

Encountering the System.NotSupportedException error when using Selenium 2 with Firefox

Transferring the parameter of a WebElement from a Cucumber feature to the Step Definition function

Selenium Webdriver: The dreaded org.openqa.selenium.NoSuchElementException strikes again

Navigating through the information stored in a spreadsheet using Selenium/data

Once more, tackling basic authentication in Chrome using the driver

Encountering ERR_SSL_PROTOCOL_ERROR with ChromeDriver even with the --ignore-certificate-errors flag

Is there a solution for the issue where the :after pseudo-element on a TD in IE9 does not accurately reflect the TD height?

In what scenarios would it be beneficial to retrieve an object in the Page Factory design pattern?

Selenium can locate an element by its CSS selector that comes after a specific element

Having trouble launching Firefox using Selenium WebDriver

Leveraging the Selenium IE driver with C# to efficiently publish a substantial volume of text consisting of 10,000 lines

Selenium test using JUnit experiencing failure due to driver prematurely accessing URL

Combining Graphical User Interface with Scripting

Newer builds of selenium/node-chrome have disabled hardware acceleration

org.openqa.selenium.SessionNotCreatedException: Error occurred when attempting to launch Firefox version 37 using Selenium version 3.11.0 due to incompatible capabilities

Attempting to start Selenium-Grid for the initial time, but encountering difficulties connecting with my remote machines

Guide to cloning a webdriver in Selenium

I encountered an issue with web automation in Python using the Selenium library

Tips for ensuring that each dropdown list is fully processed before proceeding to the next one