Utilize Python to merge various XML files into a single Excel workbook, each file imported into a separate worksheet

Question

Utilize Python to merge various XML files into a single Excel workbook, each file imported into a separate worksheet

I am looking to combine 20 XML files into a single Excel file, with each XML file as an individual worksheet.

Although the structure of the XML files is consistent, the values vary.
You can see how the XML file appears here. Only the X and Y values are required.

My plan involves importing the necessary data (X and Y values) into Python first to create a table, which will then be imported into an Excel file.

Initially, I started by importing one XML file and creating a table in Python. Below is the code I wrote:

import xml.etree.ElementTree as ET
import pandas as pd

file_path = file_path
root = ET.parse(file_path).getroot()
for Point in root.findall('Point'):
    X = float(Point.find('X').text)
    Y = float(Point.find('Y').text)
    kraft_all = [Y]
    data = {'Weg/mm': [X], 'Kraft/N': [Y],'Max. Kraft/N':max(kraft_all)}
    df = pd.DataFrame.from_dict(data)
    print(df)

The output I obtained looks like this:

I have a couple of questions:

Why does each value have its own column name? How can I display the column names only in the first row?
How can I extract the maximum value from the Y values and place it in the first 'cell' of the third column 'Max. Kraft/N'?

Prior to this, I successfully imported multiple CSV files into an Excel file using pd.read_csv and df.to_excel. So once I address these initial questions, I aim to work on handling multiple files independently. However, any advice or suggestions would be highly appreciated :)

Thank you for your time!

Update_2023.10.09

Following guidance from @Edo Akse, I modified my code and added some additional lines to import the dataframe into an Excel file. It worked smoothly for one XML file :). Here's the updated code:

import xml.etree.ElementTree as ET
import pandas as pd

file_path = r'C:\Users\xli\OneDrive - TFI Aachen GmbH\Excel_Beige und Weiß_Tru\tru Beige\tru beige-1.xml'
root = ET.parse(file_path).getroot()
as_list = []
for Point in root.findall("Point"):
    X = float(Point.find("X").text)
    Y = float(Point.find("Y").text)
    line = {"Weg/mm": X, "Kraft/N": Y}
    as_list.append(line)
df = pd.DataFrame.from_dict(as_list)
print(df)
excel_path = r'C:\Users\xli\OneDrive - TFI Aachen GmbH\Excel_Beige und Weiß_Tru\beige.xlsx'
writer = pd.ExcelWriter(excel_path,engine='openpyxl')
df.to_excel(excel_writer=writer,sheet_name='tru beige-1')
writer.close()

Now, my objective is to

import multiple XML files from a folder into distinct dataframes
import these dataframes into separate sheets within a single Excel file

With reference to @Hermann12, I constructed the following code:

import glob
import pandas as pd
import os
from pathlib import Path

def load_xml(files):
    column = ["Weg[mm]","Kraft[N]"]
    df1 = pd.concat([pd.read_xml(file, names=column) for file in files])
    return df1

excel_path = r'C:\Users\xli\OneDrive - TFI Aachen GmbH\Excel_Beige und Weiß_Tru\beige.xlsx'
writer = pd.ExcelWriter(excel_path,engine='openpyxl')
num=1
for root,dirs,files in os.walk(r"C:\Users\xli\OneDrive - TFI Aachen GmbH\Excel_Beige und Weiß_Tru\tru Beige"):
    print(root)
    print(dirs)
    print(files)
    for file in files:
        xml_files = Path(r"C:\Users\xli\OneDrive - TFI Aachen GmbH\Excel_Beige und Weiß_Tru\tru Beige").glob('*.xml') 
        df = load_xml(xml_files)
        df.to_excel(excel_writer=writer,sheet_name=file)
    writer.close()

An error "ValueError: names does not match length of child elements in xpath" appeared.

If anyone could review my code and guide me on rectifying the errors, that would be greatly appreciated.

Many thanks for your time!

python pandas excel xml

Answer 1

Answer №1

The primary concern lies within this specific section:

    kraft_all = [Y]
    data = {'Weg/mm': [X], 'Kraft/N': [Y],'Max. Kraft/N':max(kraft_all)}
    df = pd.DataFrame.from_dict(data)
    print(df)

Given that this code snippet is enclosed in a for loop, it results in the variables kraft_all, data, and df being constantly overwritten in each iteration of the loop. Consequently, by the end of the loop, these variables will only retain the final assigned values.

A solution would be to append values to the variables rather than overwriting them entirely.

According to this response, the proper way to create a DataFrame is to first construct a list and then convert it into a DataFrame.

When determining the maximum value, it's not recommended to store the max for every column in each row since it essentially stores the same value repeatedly. The optimal approach is to fetch the value when necessary or after loading the DataFrame.

Based on this information, I revised the code as shown below:

import xml.etree.ElementTree as ET
import pandas as pd

file_path = "tst.xml"
root = ET.parse(file_path).getroot()

as_list = []
for Point in root.findall("Point"):
    X = float(Point.find("X").text)
    Y = float(Point.find("Y").text)
    line = {"Weg/mm": X, "Kraft/N": Y}
    as_list.append(line)

df = pd.DataFrame.from_dict(as_list)

m = max(df["Kraft/N"])
print(m)
print(df)

Expected output:

42.0
   Weg/mm  Kraft/N
0     1.0      5.0
1     2.0      4.0
2     3.0     42.0
3     4.0     11.0
4     5.0     11.0

The XML structure used for this demonstration:

<Measurement xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    <Point>
        <X>1</X>
        <Y>5</Y>
        <UnitX>mm</UnitX>
        <UnitY>N</UnitY>
    </Point>
    <Point>
        <X>2</X>
        <Y>4</Y>
        <UnitX>mm</UnitX>
        <UnitY>N</UnitY>
    </Point>
    <Point>
        <X>3</X>
        <Y>42</Y>
        <UnitX>mm</UnitX>
        <UnitY>N</UnitY>
    </Point>
    <Point>
        <X>4</X>
        <Y>11</Y>
        <UnitX>mm</UnitX>
        <UnitY>N</UnitY>
    </Point>
    <Point>
        <X>5</X>
        <Y>11</Y>
        <UnitX>mm</UnitX>
        <UnitY>N</UnitY>
    </Point>
</Measurement>

Answer 2

The primary concern lies within this specific section:

    kraft_all = [Y]
    data = {'Weg/mm': [X], 'Kraft/N': [Y],'Max. Kraft/N':max(kraft_all)}
    df = pd.DataFrame.from_dict(data)
    print(df)

Given that this code snippet is enclosed in a for loop, it results in the variables kraft_all, data, and df being constantly overwritten in each iteration of the loop. Consequently, by the end of the loop, these variables will only retain the final assigned values.

A solution would be to append values to the variables rather than overwriting them entirely.

According to this response, the proper way to create a DataFrame is to first construct a list and then convert it into a DataFrame.

When determining the maximum value, it's not recommended to store the max for every column in each row since it essentially stores the same value repeatedly. The optimal approach is to fetch the value when necessary or after loading the DataFrame.

Based on this information, I revised the code as shown below:

import xml.etree.ElementTree as ET
import pandas as pd

file_path = "tst.xml"
root = ET.parse(file_path).getroot()

as_list = []
for Point in root.findall("Point"):
    X = float(Point.find("X").text)
    Y = float(Point.find("Y").text)
    line = {"Weg/mm": X, "Kraft/N": Y}
    as_list.append(line)

df = pd.DataFrame.from_dict(as_list)

m = max(df["Kraft/N"])
print(m)
print(df)

Expected output:

42.0
   Weg/mm  Kraft/N
0     1.0      5.0
1     2.0      4.0
2     3.0     42.0
3     4.0     11.0
4     5.0     11.0

The XML structure used for this demonstration:

<Measurement xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    <Point>
        <X>1</X>
        <Y>5</Y>
        <UnitX>mm</UnitX>
        <UnitY>N</UnitY>
    </Point>
    <Point>
        <X>2</X>
        <Y>4</Y>
        <UnitX>mm</UnitX>
        <UnitY>N</UnitY>
    </Point>
    <Point>
        <X>3</X>
        <Y>42</Y>
        <UnitX>mm</UnitX>
        <UnitY>N</UnitY>
    </Point>
    <Point>
        <X>4</X>
        <Y>11</Y>
        <UnitX>mm</UnitX>
        <UnitY>N</UnitY>
    </Point>
    <Point>
        <X>5</X>
        <Y>11</Y>
        <UnitX>mm</UnitX>
        <UnitY>N</UnitY>
    </Point>
</Measurement>

Answer 3

Answer №2

If you provide the XML data instead of as a picture, we can use it to give a more detailed explanation. Using pandas, I would approach this in the following way if all XML files are located in the same folder as the Python script:

import pathlib
import pandas as pd

def load_xml(files):
    column = ["Weg[mm]","Kraft[N]","Unit X","Unit Y"]
    df1 = pd.concat([pd.read_xml(file, names=column) for file in files])
    res = df1.sort_values('Kraft[N]', ascending=False)
    return res
    

if __name__ == "__main__":
    xml_files = [f for f in pathlib.Path().glob("*.xml")]
    df = load_xml(xml_files)
    print(df)

Output:

   Weg[mm]  Kraft[N] Unit X Unit Y
2     3.00     42.00     mm      N
3     4.00     11.00     mm      N
4     5.00     11.00     mm      N
0     1.00      5.45     mm      N
1     2.00      4.00     mm      N
5    19.29      0.25     mm      N
   ...

Answer 4

If you provide the XML data instead of as a picture, we can use it to give a more detailed explanation. Using pandas, I would approach this in the following way if all XML files are located in the same folder as the Python script:

import pathlib
import pandas as pd

def load_xml(files):
    column = ["Weg[mm]","Kraft[N]","Unit X","Unit Y"]
    df1 = pd.concat([pd.read_xml(file, names=column) for file in files])
    res = df1.sort_values('Kraft[N]', ascending=False)
    return res
    

if __name__ == "__main__":
    xml_files = [f for f in pathlib.Path().glob("*.xml")]
    df = load_xml(xml_files)
    print(df)

Output:

   Weg[mm]  Kraft[N] Unit X Unit Y
2     3.00     42.00     mm      N
3     4.00     11.00     mm      N
4     5.00     11.00     mm      N
0     1.00      5.45     mm      N
1     2.00      4.00     mm      N
5    19.29      0.25     mm      N
   ...

Utilize Python to merge various XML files into a single Excel workbook, each file imported into a separate worksheet

Answer №1

Answer №2

Similar questions

Python - Break a string into an array of two characters

Comparison of Socket Latency between TCP and UDP

Troubleshooting Issue with Post/Get Request in AJAX and Flask Framework

Is Jackson a suitable tool for conducting XSLT transformations?

Discovering the clickable widget index in pyqt4: A beginner's guide

Is there a way to extract the text that is displayed when I hover over a specific element?

Utilizing multiprocessing.Pool to distribute a counter across processes

What is the best method to retrieve all symbols and last prices from this JSON file?

What are some effective ways to enhance the efficiency of searching and matching in multi-dimensional arrays?

Gather updated information from a hover popup using Selenium and Python for a fresh data table integration

How can I retrieve the complete dictionary entry (word + phoneme) using pocketsphinx in Python?

Unexpectedly large dataset for the Test and Training Sets

Avoiding PyOSC from catching exceptions

The Python Plotly package does not support the "click event" functionality

Considering `null` as a separate entity when applying a unique constraint in a database table

tensorflow-addon is not designed to be compatible with the previous versions of tensorflow, specifically

Unable to import cvxpy and unable to import the name SolvingChain

To properly document the results, I must utilize a button to record the identification of a specific color

Parsing XML using Java's E Mapper library and converting it to JSON with Objectmapper

Python's BeautifulSoup is throwing a KeyError for 'href' in the current scenario