Utilize Python to merge various XML files into a single Excel workbook, each file imported into a separate worksheet

I am looking to combine 20 XML files into a single Excel file, with each XML file as an individual worksheet.

Although the structure of the XML files is consistent, the values vary.
You can see how the XML file appears here. Only the X and Y values are required.

My plan involves importing the necessary data (X and Y values) into Python first to create a table, which will then be imported into an Excel file.

Initially, I started by importing one XML file and creating a table in Python. Below is the code I wrote:

import xml.etree.ElementTree as ET
import pandas as pd

file_path = file_path
root = ET.parse(file_path).getroot()
for Point in root.findall('Point'):
    X = float(Point.find('X').text)
    Y = float(Point.find('Y').text)
    kraft_all = [Y]
    data = {'Weg/mm': [X], 'Kraft/N': [Y],'Max. Kraft/N':max(kraft_all)}
    df = pd.DataFrame.from_dict(data)
    print(df)

The output I obtained looks like this:

I have a couple of questions:

  1. Why does each value have its own column name? How can I display the column names only in the first row?
  2. How can I extract the maximum value from the Y values and place it in the first 'cell' of the third column 'Max. Kraft/N'?

Prior to this, I successfully imported multiple CSV files into an Excel file using pd.read_csv and df.to_excel. So once I address these initial questions, I aim to work on handling multiple files independently. However, any advice or suggestions would be highly appreciated :)

Thank you for your time!


Update_2023.10.09

Following guidance from @Edo Akse, I modified my code and added some additional lines to import the dataframe into an Excel file. It worked smoothly for one XML file :). Here's the updated code:

import xml.etree.ElementTree as ET
import pandas as pd

file_path = r'C:\Users\xli\OneDrive - TFI Aachen GmbH\Excel_Beige und Weiß_Tru\tru Beige\tru beige-1.xml'
root = ET.parse(file_path).getroot()
as_list = []
for Point in root.findall("Point"):
    X = float(Point.find("X").text)
    Y = float(Point.find("Y").text)
    line = {"Weg/mm": X, "Kraft/N": Y}
    as_list.append(line)
df = pd.DataFrame.from_dict(as_list)
print(df)
excel_path = r'C:\Users\xli\OneDrive - TFI Aachen GmbH\Excel_Beige und Weiß_Tru\beige.xlsx'
writer = pd.ExcelWriter(excel_path,engine='openpyxl')
df.to_excel(excel_writer=writer,sheet_name='tru beige-1')
writer.close()

Now, my objective is to

  1. import multiple XML files from a folder into distinct dataframes
  2. import these dataframes into separate sheets within a single Excel file

With reference to @Hermann12, I constructed the following code:

import glob
import pandas as pd
import os
from pathlib import Path

def load_xml(files):
    column = ["Weg[mm]","Kraft[N]"]
    df1 = pd.concat([pd.read_xml(file, names=column) for file in files])
    return df1

excel_path = r'C:\Users\xli\OneDrive - TFI Aachen GmbH\Excel_Beige und Weiß_Tru\beige.xlsx'
writer = pd.ExcelWriter(excel_path,engine='openpyxl')
num=1
for root,dirs,files in os.walk(r"C:\Users\xli\OneDrive - TFI Aachen GmbH\Excel_Beige und Weiß_Tru\tru Beige"):
    print(root)
    print(dirs)
    print(files)
    for file in files:
        xml_files = Path(r"C:\Users\xli\OneDrive - TFI Aachen GmbH\Excel_Beige und Weiß_Tru\tru Beige").glob('*.xml') 
        df = load_xml(xml_files)
        df.to_excel(excel_writer=writer,sheet_name=file)
    writer.close()

An error "ValueError: names does not match length of child elements in xpath" appeared.

If anyone could review my code and guide me on rectifying the errors, that would be greatly appreciated.

Many thanks for your time!

Answer №1

The primary concern lies within this specific section:

    kraft_all = [Y]
    data = {'Weg/mm': [X], 'Kraft/N': [Y],'Max. Kraft/N':max(kraft_all)}
    df = pd.DataFrame.from_dict(data)
    print(df)

Given that this code snippet is enclosed in a for loop, it results in the variables kraft_all, data, and df being constantly overwritten in each iteration of the loop. Consequently, by the end of the loop, these variables will only retain the final assigned values.

A solution would be to append values to the variables rather than overwriting them entirely.

According to this response, the proper way to create a DataFrame is to first construct a list and then convert it into a DataFrame.

When determining the maximum value, it's not recommended to store the max for every column in each row since it essentially stores the same value repeatedly. The optimal approach is to fetch the value when necessary or after loading the DataFrame.

Based on this information, I revised the code as shown below:

import xml.etree.ElementTree as ET
import pandas as pd

file_path = "tst.xml"
root = ET.parse(file_path).getroot()

as_list = []
for Point in root.findall("Point"):
    X = float(Point.find("X").text)
    Y = float(Point.find("Y").text)
    line = {"Weg/mm": X, "Kraft/N": Y}
    as_list.append(line)

df = pd.DataFrame.from_dict(as_list)

m = max(df["Kraft/N"])
print(m)
print(df)

Expected output:

42.0
   Weg/mm  Kraft/N
0     1.0      5.0
1     2.0      4.0
2     3.0     42.0
3     4.0     11.0
4     5.0     11.0

The XML structure used for this demonstration:

<Measurement xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    <Point>
        <X>1</X>
        <Y>5</Y>
        <UnitX>mm</UnitX>
        <UnitY>N</UnitY>
    </Point>
    <Point>
        <X>2</X>
        <Y>4</Y>
        <UnitX>mm</UnitX>
        <UnitY>N</UnitY>
    </Point>
    <Point>
        <X>3</X>
        <Y>42</Y>
        <UnitX>mm</UnitX>
        <UnitY>N</UnitY>
    </Point>
    <Point>
        <X>4</X>
        <Y>11</Y>
        <UnitX>mm</UnitX>
        <UnitY>N</UnitY>
    </Point>
    <Point>
        <X>5</X>
        <Y>11</Y>
        <UnitX>mm</UnitX>
        <UnitY>N</UnitY>
    </Point>
</Measurement>

Answer №2

If you provide the XML data instead of as a picture, we can use it to give a more detailed explanation. Using pandas, I would approach this in the following way if all XML files are located in the same folder as the Python script:

import pathlib
import pandas as pd

def load_xml(files):
    column = ["Weg[mm]","Kraft[N]","Unit X","Unit Y"]
    df1 = pd.concat([pd.read_xml(file, names=column) for file in files])
    res = df1.sort_values('Kraft[N]', ascending=False)
    return res
    

if __name__ == "__main__":
    xml_files = [f for f in pathlib.Path().glob("*.xml")]
    df = load_xml(xml_files)
    print(df)

Output:

   Weg[mm]  Kraft[N] Unit X Unit Y
2     3.00     42.00     mm      N
3     4.00     11.00     mm      N
4     5.00     11.00     mm      N
0     1.00      5.45     mm      N
1     2.00      4.00     mm      N
5    19.29      0.25     mm      N
   ...

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Python - Break a string into an array of two characters

str = "zyx wvut srqp onml opq pno lmnk" What is the best way to achieve the following output? array = ['zyx wvut', 'wvut srqp' , 'srqp onml' ,'onml opq', 'opq pno', 'pno lmnk'] ...

Comparison of Socket Latency between TCP and UDP

I recently created a small test to measure the speed of TCP and UDP socket communication in Python. Surprisingly, TCP performed almost twice as fast as UDP. To eliminate any potential routing effects, I ran the server and client on the same Unix machine b ...

Troubleshooting Issue with Post/Get Request in AJAX and Flask Framework

My current project involves building a YouTube scraper webpage purely for educational purposes. I have created a web page with a text box to enter search queries and a search button. When the button is clicked, an Ajax post request is sent with the text bo ...

Is Jackson a suitable tool for conducting XSLT transformations?

Within our projects, we utilize Jackson for mapping between JSON and Java objects, as well as Jettison for converting XML input streams to JSON objects. One common scenario involves applying an XSLT transformation on an XML document to create a "JSONized" ...

Discovering the clickable widget index in pyqt4: A beginner's guide

In my application, I am working on creating multiple widget orders using a list of dictionaries to create a list of widgets. I have implemented a mouse release event for clickable widgets, but I am facing issues in identifying the clicked widget index. C ...

Is there a way to extract the text that is displayed when I hover over a specific element?

When I hover over a product on the e-commerce webpage (), the color name is displayed. I was able to determine the new line in the HTML code that appears when hovering, but I'm unsure how to extract the text ('NAVY'). <div class="ui ...

Utilizing multiprocessing.Pool to distribute a counter across processes

Exploring the use of multiprocessing.Value + multiprocessing.Lock to share a counter between separate processes. Take a look at the following example: import itertools as it import multiprocessing def func(x, val, lock): for i in range(x): i ...

What is the best method to retrieve all symbols and last prices from this JSON file?

My attempt at this solution didn't yield the desired results. Interestingly, it did work with a different JSON file (). I suspect the issue lies in the differences at the beginning and end of the current file, as there are some additional parts before ...

What are some effective ways to enhance the efficiency of searching and matching in multi-dimensional arrays?

My goal is to compare elements in two separate arrays: Array_A, a 3d map of A_Clouds, and Array_B, a 3d map of B_Clouds. In these maps, each "cloud" consists of continuous pixels with unique integer values representing the cloud, while non-cloud values are ...

Gather updated information from a hover popup using Selenium and Python for a fresh data table integration

A few years ago, I successfully managed to scrape hover elements using Selenium. It was a bit challenging back then to select the correct hover table element that only appeared on hover. Recently, the website underwent a complete style overhaul, which seem ...

How can I retrieve the complete dictionary entry (word + phoneme) using pocketsphinx in Python?

Below is the code snippet: #!/usr /bin/env python import os import sphinxbase as sb import pocketsphinx as ps MODELDIR = 'deps/pocketsphinx/model' DATADIR = 'deps/pocketsphinx/test/data' # Create a decoder with specific model confi ...

Unexpectedly large dataset for the Test and Training Sets

Currently, I am in the process of developing a predictive model using linear regression on a dataset containing 157673 records. The data is stored in a CSV file and follows this format: Timestamp,Signal_1,Signal_2,Signal_3,Signal_4,Signal_5 2021-04-13 ...

Avoiding PyOSC from catching exceptions

PyOSC handles exceptions in a very polite manner, however, it can be challenging to debug. How can I overcome this issue? As an example, when encountering a coding bug, PyOSC may report it as: OSCServer: NameError on request from localhost:50542: global ...

The Python Plotly package does not support the "click event" functionality

I attempted to implement the code found on this webpage: Unfortunately, I encountered issues when trying to run it in Visual Studio Code and a Jupyter Notebook. Could you provide some insight into what might be causing this problem and any possible solut ...

Considering `null` as a separate entity when applying a unique constraint in a database table

In my database, I have a table that stores default and custom options for clients. The custom_id field is used to differentiate between default and unique custom jobs. When the custom_id field has a value, it represents a unique custom job record. If it is ...

tensorflow-addon is not designed to be compatible with the previous versions of tensorflow, specifically

Currently, I am dealing with code that utilizes TensorFlow 1.14 along with tensorflow-addons. However, it appears that the available versions of tensorflow-addons for installation are only compatible with tensorflow >= 2. When attempting to install an ol ...

Unable to import cvxpy and unable to import the name SolvingChain

I am encountering issues with importing cvxpy. Despite trying to uninstall and reinstall the package, the problem persists. Previously, it was functioning well but suddenly stopped without any changes being made. I managed to resolve the issue before witho ...

To properly document the results, I must utilize a button to record the identification of a specific color

I'm working on a code that creates a clickable box which changes colors from black to green to white by going through different shades of green whenever the mouse is clicked. What I now need to do is implement a feature that displays the current shade ...

Parsing XML using Java's E Mapper library and converting it to JSON with Objectmapper

I have a question for you. I am trying to convert a simple XML file to a JSON file. The conversion is successful, but the output I receive is not properly formatted. Here are the files: modalities.xml <cons> <modalities type="mod"> ...

Python's BeautifulSoup is throwing a KeyError for 'href' in the current scenario

Utilizing bs4 for designing a web scraper to gather funding news data. The initial section of the code extracts the title, link, summary, and date of each article from n number of pages. The subsequent part of the code iterates through the link column and ...