I'm faced with a predicament in Python where I need to split a portion of text using the line-ending characters at the end of each

In my program, I am working on analyzing XML files and one of the tasks is to split the data into sentences. However, I have encountered an issue where my line end characters are missing. I need to include them in order to add annotations with XML tags at the beginning and end of each sentence.

Currently, I have the following code snippet:

import re

line_end_chars = "!", "?", ".",">"


regexPattern = '|'.join(map(re.escape, line_end_chars))

line_list = re.split(regexPattern, texte)

Challenge:

When I run this code snippet with the text:

"Je pense que cela est compliqué de coder. Où puis-je apprendre?"

The output provides:

["Je pense que cela est compliqué de coder",
"Où puis-je apprendre"] 

Instead, the desired output should be:

["Je pense que cela est compliqué de coder.",
"Où puis-je apprendre?"] 

After resolving this issue, I plan to use a .replace method to add my XML tags accordingly.

Answer №1

A potential strategy involves utilizing the re.sub function in place of re.split, followed by using str.splitlines():

import re

line_end_symbols = "!", "?", ".",">"
s = "I think coding is complicated. Where can I learn?"

print( re.sub('(' + '|'.join(re.escape(ch) for ch in line_end_symbols) + ')\s*', r'\1\n', s).splitlines() )

This code snippet will output:

['I think coding is complicated.', 'Where can I learn?']

Answer №2

There are a couple of approaches to tackle this challenge.

import re

# Approach 1)
ending_characters = "!", "?", ".", ">"
regex_pattern = '|'.join(map(re.escape, ending_characters))
string = "I think coding is complicated. Where can I learn?"
line_list = []

for substring, delimiter in zip(re.split(regex_pattern, string), re.findall(regex_pattern, string)):
    line_list.append(substring + delimiter)

# Approach 2)
ending_characters = ["!", "?", ".", ">"]
string = "I think coding is complicated. Where can I learn?"
line_list = []

temp_string = ""
for char in string:
    if char in ending_characters:
        line_list.append(temp_string + char)
        temp_string = ""
    else:
        temp_string += char

Both methods output:

['I think coding is complicated.', 'Where can I learn?']

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

I'm having trouble getting the play command to work on my Discord bot using Wavelink. Does anyone know how to troubleshoot and fix this

This is the play command for my bot: @client.tree.command(name="play", description="Play some music") @app_commands.describe(musicname="The name of the music you want to play") async def play(interaction:Interaction, musicname ...

Executing Python code in Visual Studio without the need for the Python extension by compiling and running the program

Currently, I am utilizing Visual Studio 2019 as my chosen text editor for writing Python programs. My goal is to compile and run the program directly within Visual Studio without having to download the VS Python extension. Python 3.8 has been successfully ...

How can I exclude specific lines from an XML file using filters?

Let's say I have the following data: <div class="info"><p><b>Orange</b>, <b>One</b>, ... <div class="info"><p><b>Blue</b>, <b>Two</b>, ... <div class="info"><p><b& ...

How can I use regular expressions to locate a React JSX element that contains a specific attribute?

Currently conducting an audit within a vast codebase, my task involves searching for all instances of a component where it is utilized with a specific prop. I believe that using regex could prove beneficial in this situation; however, the challenge lies in ...

Ways to access the Page Source of a subsequent page

My current goal is to convert the driver into html for use with beautiful soup. However, I am encountering an issue where the output from the prettifier (the one in the driver) shows the HTML of the login page instead of the page that should come after it ...

Implementing an array of functions on an array of elements based on their positions

Imagine the scenario: def x_squared(x): result = x * x return result def twice_x(x): result = 2 * x return result def x_cubed(x): result = x * x * x return result x_values = np.array([1, 2, 3]) functions = np.array([x_squared, t ...

Python allows for the sending of HTML content in the body of an email

Is there a way to show the content of an HTML file in an email body using Python, without having to manually copy and paste the HTML code into the script? ...

Developing jpg/png images from .ppt/pptx files with Django

I currently have a PowerPoint file saved on Dropbox at "". My goal is to convert this presentation file into individual slide images (jpg/png..) within my Django template or using a Django def function. Once I have converted the slides, I plan to utilize ...

Tips on scraping content with no identifiable attributes in Selenium using Python

Looking to extract electricity prices data from the following website: . When trying to locate the web elements for the date and price, this is the structure: The first date: td class="row-name ng-binding ng-scope" ng-if="tableData.dataType ...

What steps should I take to update the path and locate the geckodriver on my system?

I'm a complete beginner in the programming world. After researching for hours, I managed to fix most errors except one. It seems simple, but I can't figure it out. I tried to use selenium to open a webpage. from selenium import webdriver driver ...

What are the steps to display a graph on a webpage using AJAX in Django?

I'm currently working on developing a weather forecast application, and I have implemented a date input box on the home page. When users enter a date and click submit, a graph displaying temperature data should be displayed. I have set up a URL that r ...

What is the method to initiate a Python thread from C++?

In my current project, I am specifically limited to using Python 2.6. The software involves a Python 2.6 application that interfaces with a C++ multi-threaded API library created with boost-python. I have been trying to implement a Python function callback ...

Combining Graphical User Interface with Scripting

I am facing a challenge while trying to integrate two separate scripts, one with a GUI and the other being a bot script. Individually, both scripts run as expected but when combined, the bot function gets called immediately without displaying the user in ...

I encountered a PermissionError while trying to log in as an administrator

Just starting out with django and I'm trying to create a superuser. After following all the necessary steps, the terminal confirms that I have successfully created the superuser. However, when I go to the server, I encounter a permission error. I eve ...

Here are the steps to resolve the error message: "selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element"

I am currently in the process of learning Selenium. My focus is on creating a script that can add multiple items to a shopping cart. Interestingly, when I removed the class from my script, it allowed me to successfully add an item to the cart. However, on ...

Basic paddleball game with an unresponsive ball

I've been learning Python through a course designed for kids, and one of the projects we worked on was creating a simple paddleball game. I managed to get the ball bouncing off the walls earlier, but now it's not working as expected after complet ...

The issue states that the executable 'geckodriver' must be located within the PATH directory

I'm currently working on a Python project that involves using the selenium library. My preferred browser for this project is firefox, so I went ahead and downloaded the geckodriver. After downloading, I made sure to add it to my Path: https://i.stack ...

Automate image clicking on web pages using Selenium and JavaScript with Python

I'm currently using Selenium to attempt clicking on a JavaScript image within a website, but I'm struggling to accomplish this task. Here is the code I have thus far: from selenium import webdriver from selenium.webdriver.common.keys import Key ...

Is it possible to webscrape a jTable that contains hidden columns?

I'm currently working on setting up a Python web scraper for this webpage: specifically targeting the 'team-players jTable' I've successfully scraped the visible table using BeautifulSoup and selenium, but I'm facing difficulties ...

Searching for ways to filter out specific tags using regular expressions

Can anyone provide a quick solution to help me with this issue? I am trying to remove all html tags from a string, except for the ones specified in a whitelist (variable). This is my current code: whitelist = 'p|br|ul|li|strike|em|strong|a', ...