What is the best method for retrieving text before and after a specific word in Excel or Python?

I have a large block of text with various lines like:

ksjd 234first special 34-37xy kjsbn
sde 89second special 22-23xh ewio
647red special 55fg dsk
uuire another special 98
another special 107r
green special 55-59 ewk
blue special 31-39jkl

My goal is to identify the word before "special" and the number (or range of numbers) on the right. Basically, I want to organize this information into a table format:

https://i.stack.imgur.com/ojs0E.jpg

This table would then look like this:

https://i.stack.imgur.com/8lSgs.jpg

Answer №1

An efficient method to achieve this task is by utilizing regular expressions:

In [1]: import re

In [2]: text = '''234unique first 34-37xy                          
   ...: 89special second 22-23xh
   ...: 647special red 55fg
   ...: special another 98
   ...: special another 107r
   ...: special green 55-59
   ...: special blue 31-39jkl'''

In [3]: [re.findall('\d*\s*(\S+)\s+(special)\s+(\d+(?:-\d+)?)', line)[0] for line in text.splitlines()]
Out[3]: 
[('unique', 'special', '34-37'),
 ('special', 'second', '22-23'),
 ('special', 'red', '55'),
 ('special', 'another', '98'),
 ('special', 'another', '107'),
 ('special', 'green', '55-59'),
 ('special', 'blue', '31-39')]

Answer №2

If you're working in Excel and need to extract text between two specific words, here's how you can do it:

  1. Start by selecting an empty cell and entering the formula =MID(A1,SEARCH("START",A1)+3,SEARCH("END",A1)-SEARCH("START",A1)-4). Press Enter after typing.

  2. Use the fill handle to drag and apply this formula to the desired range. This will extract the text between "START" and "END."

Please keep the following points in mind:

  1. In the formula provided, A1 represents the cell from which you want to extract text.

  2. "START" and "END" are the specific words marking the text you wish to extract.

  3. The number 3 denotes the character length of "START," while the number 4 is one more than the character length of "START."

Answer №3

Aside from @RolandSmith's explanation, here is a method for utilizing Regular Expressions in Excel using VBA


Option Explicit
Function ExtractSpecialCharacters(S As String, Index As Long) As String
    Dim RE As Object, MC As Object
    Const sPattern As String = "([a-z]+)\s+(special)\s+([^a-z]+)"

Set RE = CreateObject("vbscript.regexp")
With RE
    .Global = True
    .ignorecase = True
    .MultiLine = False
    .Pattern = sPattern
    If .test(S) = True Then
        Set MC = .Execute(S)
        ExtractSpecialCharacters = MC(0).submatches(Index - 1)
    End If
End With

End Function

The Index parameter in this UDF allows you to retrieve the 1st, 2nd, or 3rd submatch from the matched collection, enabling you to easily separate the original string into your desired components.

https://i.stack.imgur.com/y2nXs.png

Given that you mentioned having "thousands of lines", using a macro might be preferable. The provided macro processes data much faster but lacks dynamism. It assumes the original data resides in Column A on Sheet2 and outputs results in columns C:E on the same sheet, which can be adjusted accordingly:


Sub ExtractSpecial()
    Dim RE As Object, MC As Object
    Dim wsSource As Worksheet, wsResult As Worksheet, rResult As Range
    Dim vSource As Variant, vResult As Variant
    Dim J As Long

Set wsSource = Worksheets("sheet2")
Set wsResult = Worksheets("sheet2")
    Set rResult = wsResult.Cells(1, 3)

With wsSource
    vSource = .Range(.Cells(1, 1), .Cells(.Rows.Count, 1).End(xlUp))
End With

Set RE = CreateObject("vbscript.regexp")
With RE
    .Global = True
    .MultiLine = False
    .ignorecase = True
    .Pattern = "([a-z]+)\s+(special)\s+([^a-z]+)"

ReDim vResult(1 To UBound(vSource), 1 To 3)
For J = 1 To UBound(vSource)
    If .test(vSource(J, 1)) = True Then
        Set MC = .Execute(vSource(J, 1))
        vResult(J, 1) = MC(0).submatches(0)
        vResult(J, 2) = MC(0).submatches(1)
        vResult(J, 3) = MC(0).submatches(2)
    End If
Next J
End With

Set rResult = rResult.Resize(UBound(vResult, 1), UBound(vResult, 2))
With rResult
    .EntireColumn.Clear
    .Value = vResult
    .EntireColumn.AutoFit
End With

End Sub

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Python script using Selenium WebDriver to simulate a click action

On a webpage, I have a button defined as: <div class="form-group mt-3 mb-1 d-grid"> <button type="submit" class="btn btn-lg btn-primary"> Login </button> </div> In my Python code, I ...

What are some effective techniques for developing a script to automate a straightforward task with selenium?

Hey there! I'm just starting out on my coding journey and I have a cool project idea. I heard about this guy who wrote a script to send a text to his wife when he was working late, and I want to do something similar. Basically, I need a script that c ...

"Harmoniously blending with the existing code base, Asynchio enhances the

One crucial issue I'm facing is optimizing the performance of my code, specifically when it comes to a json API call. I'm exploring the possibility of integrating asyncio / aiohttp with my existing synchronous code rather than overhauling it comp ...

Data obtained from the server includes JSON objects along with quotes and regular expressions

While analyzing the functionality of a search page server through element inspection, it was observed that requests are sent via POST with JSON parameters. To verify this observation, I recreated the same POST request using Insomnia with identical paramete ...

What is the reason for the num.is_integer() function returning false?

class Item: pay_rate = 0.8 # The pay after %20 discount all = [] def __init__(self, name: str, price: float, quantity=0): #Run validations to the recieved arguments assert price >= 0, f"Price {price} is not greater than ...

Creating spinboxes in Python and retrieving integer data from them instead of strings

I'm currently working on developing a calculator application that provides an estimate of annual gas expenses. In order to execute the program, I need to use an algorithm that retrieves input data from spinboxes as integers. However, when attempting t ...

Discovering the method to locate and interact with a concealed file upload field using Selenium WebDriver in

Here is the HTML code I am working with: <a id="buttonToUpload" class="btn-pink medium" href="#"> <span class="icon-arrow-right">upload photo</span> </a> <form id="form1" enctype="multipart/form-data"> <input id="u ...

Having issues with feature names in Python's Sequential Forward Selection tool?

Can someone explain why, after using sffs.k_feature_names_, I am only getting the positions of the best columns and not their actual names? https://i.stack.imgur.com/Rwf8K.png ...

python starting a process on a remote machine using fabric

Hey there! I've been working on deploying my Python project using Fabric. One of the key components is a shell script called 'run_fetchserver.sh' that helps manage the start/stop process. fetch_path=$PROJECT_PATH if [ $1 = start ] then ...

Learn how to streamline the task of verifying Twitter profiles by creating an automated system that can also send messages if desired

Is there a way to use Selenium with Python and Requests to check if a specific profile has the message function? https://i.stack.imgur.com/0Q4RB.png ...

The int() method in Python is used to convert a string

When a decimal (for example, 49.9) is sent to the next variable, an error occurs in the code below. This happens because using int() converts the decimal into an integer. next=raw_input("> ") how_much = int(next) if how_much < 50: print"Nice, yo ...

What is the best method for discovering all possible partitions of list S into k subsets, some of which may be empty?

Given a list of unique elements, such as [1, 2], the goal is to split it into k=2 sublists. The objective is to generate all possible sublists: [ [ [1,2],[] ], [ [1],[2] ], [ [2],[1] ], [ [],[1,2] ] ] The task is then to split the list into 1<=k<=n ...

What is the best way to locate an element on a website that consistently changes its label ID buried deep within nested DIVs?

I'm facing an issue with my website tracking system, where I am trying to extract shipment details using Selenium in Python and save them to an Excel file. However, I am encountering difficulties with getting Selenium to function properly. It keeps sh ...

Difficulty accessing the link in India using Selenium with Python

I have been attempting to automate the process of accessing a website using Python, but the catch is that it only works when accessed from India. Unfortunately, my current code isn't getting the job done. The existing code, complete with the website ...

Filling in pre-existing fields within a Django model

I recently modified one of the fields in a model within my Django application. I changed the blank=True, null=True settings to False, and assigned a callable function that generates random strings as the default value for that field. However, after running ...

Discover the XPATH for selenium in VBA programming

I am currently facing a challenge with a web page hosted on a secure website. I have attached a snapshot of the specific section I am troubleshooting. The XPATH that identifies the rows of a table (totaling 13 rows) is: //div[@id='Section3'] How ...

Error: The indentation does not match any outer level of indentation (I am using Sublime Text)

Every time I try to run this script, an error pops up: IndentationError: unindent does not match any outer indentation level. I've attempted to troubleshoot in various code editors such as Sublime and Visual Studio Code, but the issue persists. This i ...

What is the method for determining values based on various criteria using Python data frames?

I have a large excel data file with thousands of rows and columns. Currently, I am using Python and pandas dataframes to analyze this data. My goal is to calculate the annual change for values in column C based on each year for every unique ID found in c ...

Dealing with HTML Tag Removal Issues in Pandas DataFrames

I currently have a Pandas DataFrame that includes a column named text, which holds HTML content. I am attempting to extract just the text without any tags. Here is my attempt: from bs4 import BeautifulSoup result_df['text'] = BeautifulSoup(resul ...

Utilizing OpenCV with Python to Harness the Power of Your Smartphone Camera via USB

Is there a way to use my iPhone camera with OpenCV in Python? I've tried using cap = cv2.VideoCapture(1), but it doesn't recognize the camera. I know there are alternatives like using the IP address to connect the camera to OpenCV, but I prefer ...