Discover the solution by utilizing XPath

I am struggling to extract data from an HTML table:

<div class="parameters">
    <div class="property">property 1</div>
    <div class="value">value</div>
</div>
<div class="parameters">
    <div class="property">property 2</div>
    <div class="value">value</div>
</div>
<div class="parameters">
    <div class="property">property 3</div>
    <div class="value">value</div>
</div>
<div class="parameters">
    <div class="property">property 4</div>
    <div class="value">value</div>
</div>

Specifically, I am trying to retrieve the value associated with property 4...

for item in response.css('div.parameters'):
    name = item.xpath('//div[text()[contains(.,"property 4")]]/following::div[1]/text()').get()

However, this code is not working as expected. Can someone please help me identify the error?

Answer №1

//div[contains(.,"details")]/./div//text()

This specific xpath query moves up one level and then selects all the subsequent div elements, producing the output as details value

Updated xpath query:

' '.join(response.xpath('//div[contains(.,"details")]/./div//text()').getall())

Demonstrated using a scrapy shell:

In [1]: from scrapy.selector import Selector

In [2]: %paste
html ='''
<div class="section">
    <div class="details">details 1</div>
    <div class="info">info 1</div>
</div>
<div class="section">
    <div class="details">details 2</div>
    <div class="info">info 2</div>
</div>
<div class="section">
    <div class="details">details 3</div>
    <div class="info">info 3</div>
</div>
<div class="section">
    <div class="details">details 4</div>
    <div class="info">value</div>
</div>
'''

## -- End pasted text --

In [3]: sel = Selector(text=html)

In [4]: 
   ...: ' '.join(sel.xpath('//div[contains(.,"details")]/./div//text()').getall())
Out[4]: 'details 4 value'

Answer №2

Give this a shot:

import xml.etree.ElementTree as ET

xml_data = """
<root>

<div class="parameters">
    <div class="property"&-gt;property A</div>
    <div class="value">value A</div>
  </div>
  <div class="parameters">
      <div class="property">property B</div>
      <div class="value">value B</div>
  </div>
  <div class="parameters">
    <div class="property">property C</div>
    <div class="value">value C</div>
  </div>
  <div class="parameters">
    <div class="property">property D</div>
    <div class="value">value D</div>
  &div>

</root>
""""""


parsed_xml = ET.fromstring(xml_data)

properties = parsed_xml.xpath('//div[contains(@class, "property")]')
values = parsed_xml.xpath('//div[contains(@class, "value")]')

result = {p.text: v.text for p, v in zip(properties, values)}
print(result["property D"])

Displays:

value D

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Creating a Python dictionary with file names as keys: A step-by-step guide

My goal is to send all files located in a specific folder on my local disk to a designated web address using Requests and Glob. Each time I upload a new file to the URL, I intend to update a dictionary with a new entry consisting of the "file name (key)" a ...

Retrieving attribute values in PHP using XPath

Similar Question: How can I extract an attribute from an XML node using PHP's DOM Parser? Can someone help me with extracting the value of an HTML tag? Here is the HTML code snippet: <input type="hidden" name="text1" id="text1" value="need ...

Utilizing Python requests for retrieving dynamic website data

Currently, I am utilizing Python (BeautifulSoup) to extract data from various websites. However, there are instances where accessing search results can be challenging. For example: import requests from bs4 import BeautifulSoup url1 = 'https://auto.r ...

What is the best way to transform api.Response into a data frame or json format using Python?

Looking at the data I have, my goal is to transform it into a data frame. t = cmc.globalmetrics_quotes_latest() (Cmc refers to the coinmarketcap api) type(t)= coinmarketcapapi.response """ RESPONSE: 820ms OK: {'active_cryptocurrencies ...

Extracting all Twitter links from a webpage with RSelenium

I'm encountering an issue while attempting to extract URLs from a webpage using Rselenium. I keep receiving an InvalidSelector error. My setup consists of R 3.6.0 on a Windows 10 PC, with Rselenium 1.7.5 and Chrome webdriver (chromever="75.0.3770.8") ...

Updating Dataframe Column Names using a List of New Column Names

I'm dealing with a dataframe that has numerous columns. My goal is to convert a select group of the column names to uppercase. However, my attempts at using code to accomplish this have been unsuccessful so far: df[cols_to_cap].columns = df[cols_to_ca ...

Assign a CSS class to a specific option within a SelectField in a WTForms form

Could someone explain the process of assigning a CSS class to the choices values? I am looking to customize the background of each choice with a small image. How can this be done using wtforms and CSS? class RegisterForm(Form): username = TextField( ...

Combining Flask with Celery: Maximizing Efficiency with Parallel Processes

My Flask Celery app instantiates the celery instance. I'm aware that I can add a normal Flask route to the same .py file, but would need to run the code twice: To run the worker: % celery worker -A app.celery ... To run the code as a normal ...

Multitasking with Gevent pool for handling multiple nested web requests

I am working on setting up a pool with a maximum of 10 concurrent downloads for organizing web data. The goal is to download the main base URL, parse all URLs on that page, and then proceed to download each individual URL, but maintaining an overall limit ...

Problem - b'Issue with starting sasl_client (-4) SASL(-4): no mechanism available: Callback not found: 2'

I am attempting to establish a connection to HIVE using Python in Jupyter Notebooks. I have successfully installed all the required packages for connecting to HIVE using Python: sasl 0.2.1 py37h8a4fda4_1 thrift ...

After successfully installing Spyder, my conda commands suddenly stopped functioning

Just starting out with Python and set up a virtual machine on Windows for work, installed Spyder. But when I try to run it, all I see is a command prompt that flashes quickly on and off; too fast to catch the error message. Now, whenever I run any 'c ...

Error encountered when training Decision Tree Classifier on a data set due to value mismatch

I have prepared X features along with corresponding labels y for the dataset I am currently analyzing. Now, my goal is to train a random forest classifier using this data but I encountered a ValueError while fitting the classifier on the training set: set ...

The "dense3" layer is throwing an error due to incompatible input. It requires a minimum of 2 dimensions, but the input provided only has 1 dimension. The full shape of the input is (4096,)

Having recently delved into the realms of keras and machine learning, I am experiencing some issues with my code that I believe stem from my model. My objective is to train a model using a CSV file where each row represents a flattened image, essentially m ...

Repairing the scientific notation in a JSON file to convert it to a floating

I'm currently facing a challenge with converting the output format of certain keys in a JSON file from scientific notation to float values within the JSON dictionary. For instance, I want to change this: {'message': '', &ap ...

How to automate clicking multiple buttons on the same webpage using Selenium with Python

As a Python and Selenium novice utilizing chromedriver, I find myself in need of assistance. The task at hand involves a web page that is unfortunately restricted from being accessed externally. This particular webpage hosts approximately 15 buttons with ...

Generate a fresh SQL table column by dividing the content of an already existing column

I have a table in sqlite that contains codes separated by '.', '-', or both. For example: code 9897.1t gb5ffh-hy dhy4.dt4-kj What is the best way to create a new column with only the first part of each code? I would prefer ...

Python allows for the sending of HTML content in the body of an email

Is there a way to show the content of an HTML file in an email body using Python, without having to manually copy and paste the HTML code into the script? ...

Attempting to extract individual strings from a lengthy compilation of headlines

Looking for a solution to organize the output from a news api in Python, which currently appears as a long list of headlines and websites? Here is a sample output: {'status': 'ok', 'totalResults': 38, 'articles': [{ ...

Adding a question mark to the URL in Django: A complete guide

I have the following urlpatterns: urlpatterns = [ url(r'^tools(?:/tool_one=(?P<tool_one>\w+))?/?$', views.ToolsViews.as_view(), name='tools'), ] This results in a URL like this: tools/tool_one=bags I want to add ...

"Mastering the art of inputting values on a webpage with the use of

Consider the following HTML structure: <div class="divSearchContainer"><input type="search" class="FL H100P" placeholder="Select"><div class="divSearchIconConatiner H100P CP FL" title="S ...