exploring the differences between beautifulsoup and re when conducting searches using regular expressions

When using urllib2.urlopen to fetch the source code of websites like this one, I decode the bytes and extract the code marked as applet using beautifulsoup. The code snippet may contain lines such as:

<param name="G_00" value="espacio='E1' tipo='macro' expresión='dinamica/resorte'">

I am specifically interested in capturing all values of "expresión=" within the code that are under the attribute tipo="macro" (specifically dinamica/resorte and dinamica/masa).

Using beautifulsoup, I locate these lines as tags with tipo='macro'. To be succinct, I focus on extracting the content following expresión=:

key_macro = ['expresión=', 'expresion=', 'expresión='....] # this presents a potential issue.
for y in key_macro:
    if string.find(tag, y) != -1:
        mexpression = r"%s'([\w\./]+)'" % y
        mpatron = re.compile(mexpression)
        mresult = mpatron.search(tag['value'])
        if mresult: 
            macroslist.append(mresult.group(1))
        wexpression = r"%s'([\w/]+)'" % y
        wpatron = re.compile(wexpression)
        wresult = wpatron.search(tag['value'])
        if wresult:
            macroslist.append(wresult.group())

The challenge: While #1 retrieves the .txt file successfully when available, #2 which seeks word/word patterns often fails to find instances like dinamica/resorte. Is there an issue with my regular expressions? How can I effectively specify word/word in a regex?

I attempted searching using beautifulsoup but encountered difficulty due to 'macro' being contained within the value. Although re + search seems promising (...when encountering formats like dinamica/resorte.txt, where #1 functions), it struggles without the file extension.

Thank you for your assistance.

Answer №1

Apologies for the straightforward and simplistic solution, but it would be beneficial to clarify your requirements in terms of the keys you may need to search for. In my opinion, the provided solution is not optimal. Nevertheless, you can try the following approach:

import re

def simple_solution(s, rex=re.compile(r"expresion='([a-zA-Z./]+)'")):
    s = s.replace('&oacute;', 'o')
    s = s.replace('ó', 'o')
    print s
    m = rex.search(s)
    if m:
        return m.group(1)
    return None

tag = "<param name=\"G_00\" value=\"espacio='E1' tipo='macro' expresi&oacute;n='dinamica/resorte'\">"
print tag
print simple_solution(tag)

This code snippet will output on the console:

c:\tmp\___python\Antonio\so10295276>python a.py
<param name="G_00" value="espacio='E1' tipo='macro' expresi&oacute;n='dinamica/resorte'">
<param name="G_00" value="espacio='E1' tipo='macro' expresion='dinamica/resorte'">
dinamica/resorte

A more refined approach with a sophisticated regular expression pattern using unicode strings:

import re

rex = re.compile(ur"expresio(o|ó|&oacute;)n='(?P<text>[a-zA-Z./]+)'")

tag = u"<param name=\"G_00\" value=\"espacio='E1' tipo='macro' expresi&oacute;n='dinamica/resorte'\">"
print tag

m = rex.search(tag)
if m:
    print m.group('text')
else:
    print None    

The above code segment will print:

c:\tmp\___python\Antonio\so10295276>python b.py
<param name="G_00" value="espacio='E1' tipo='macro' expresi&oacute;n='dinamica/resorte'">
dinamica/resorte

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Sqlalchemy: Query - extract parents that meet the inequality requirement for all their children

I am a newcomer to databases and SQLAlchemy, grappling with creating a query involving an inequality condition. The issue is that the query returns the parent as long as any of the children meet the condition, but I require all the children to satisfy it. ...

What is the process for invoking a PostgreSQL stored procedure with SQLAlchemy?

After developing a stored procedure in PostgreSQL that accepts 2 parameters, I attempted to call it using SQLAlchemy with the code snippet below, but encountered a syntax error. I also reviewed online tutorials for SQLAlchemy. connection.execute('st ...

Excel is failing to retain values while running a looping script with Selenium in Python

I am facing an issue with the below script where the values of Id and Name fields are not getting saved in excel in proper series. Instead, the previous values are being displayed in the excel due to a loop call. Can someone please guide me on how to resol ...

Ensuring Proper Alignment in PyQT5 Layouts

In my PyQt5 GUI, I have the following code: import sys from PyQt5.QtWidgets import QApplication, QWidget, QCalendarWidget, QMainWindow, QGridLayout, QLayout, QTableWidget, QHeaderView, QLabel, QHBoxLayout, QVBoxLayout, QLineEdit, QComboBox from PyQt5.QtCo ...

What is the best way to create an HTML table using a two-dimensional list in Python?

I am currently facing an issue with the functionality of my code. The HTML is being generated as desired, but it appears at the top of the page rather than in the intended location. def fixText(self,text): row = [] z = text.find(',') ...

Utilize Selenium and Python to double click on a specific text element

Is there a way to use Selenium to perform a double click on the 'Res.' text in the following HTML snippet: <g class = "text"> <text x="0" y="0"> <tspan x="24" y="43" data-chunk-id="0">Cardiovasc </tspan> ...

How can I exclude specific lines from an XML file using filters?

Let's say I have the following data: <div class="info"><p><b>Orange</b>, <b>One</b>, ... <div class="info"><p><b>Blue</b>, <b>Two</b>, ... <div class="info"><p><b& ...

Identifying file formats with Python

Being a beginner in Python, I have some doubts regarding data types. For instance, if I write the code wordfile = open("Sentence.txt","w"), what would be the data type of "wordfile"? ...

What is the best way to track the amount of time spent on individual tests in unittest?

One limitation of Unittest is that it only displays the total time spent on running all tests and does not show the individual timing for each test. Is there a way to include the timing for each test when utilizing Unittest? ...

Using Python, bring in browser cookies as JSON data

I have been attempting to import JSON cookies into a website using Selenium from a file named "cookie.json", but I am unsure of the correct approach. So far, my attempts have involved using "driver.add_cookie(cookie1)" with the variable pointing to the pat ...

Enhancing user experience using Python

For the sake of testing, I am attempting to extract a zip archive to "C:\\" and in order to do so, I need administrative privileges. Thus, I am trying to elevate the current user to acquire admin rights. if __name__ == "__main__": ASADMIN = ...

Using queues to share an SQLite connection among threads in Python

I am currently working with Python 3.2.1 on Arch Linux x86_64. My goal is to update an sqlite database within a threaded, timed loop using code similar to the following: import sqlite3 from threading import Timer from queue import Queue class DBQueue(Q ...

Ways to choose an Xpath component

In my exploration of automated connections on LinkedIn, I have been experimenting with sending custom connection messages to a search list. My process involves finding all buttons, locating the XPath of names, indexing them, compiling a list of all names, ...

Mastering the art of detecting redirects using curl or requests

I've recently delved into the world of web scraping and encountered an interesting hurdle. The aim is to enter a partial URL string for a website and capture the corrected URL output generated by the website's redirect function. The particular we ...

I am struggling to add a new button to my program and am unsure of how to use it to open a new window

from tkinter import* main_window=Tk() main_window.geometry('500x500') def display_first_window(): def display_second_window(): main_label.destroy() main_button.destroy() second_label=Label(main_window,text='Sele ...

Unable to get a reply from the sr1 function in scapy. Need help in setting the timeout parameter. [Python]

I am currently working on establishing a 3-way handshake connection using Scapy with all devices on my network. However, I am facing an issue where one device does not respond to a SYN packet, causing my program to get stuck. I utilize Wireshark to inspect ...

Retrieve the non-empty values within a pandas DataFrame

I have a DataFrame and I'm trying to extract non-null elements from it and transform them into a list. For example, suppose we have a DataFrame df: df = pd.DataFrame({"a":["A",None,"B"],"b":[None,"C","D"],"c":["E","F",None]}) a b c 0 ...

Is it possible to ensure that the Python debugger pauses before exiting every time?

While I am able to set breakpoints, there are times when I simply need to examine some variables after running a test. Is there a method in vscode to debug pytest tests without manually setting breakpoints or including pause statements, allowing it to aut ...

Comparing Duck Typing with Class-based Inheritance

In the realm of programming, there exists a concept known as DuckTyping. This approach involves two classes - Duck and Person. class Duck: def quack(self): print('Quack') def fly(self): print('Flap') class Person: ...

Determine if a character is capitalized in Python

I am new to programming and Python. My current challenge involves determining whether a character passed to a function is in upper or lower case. def encode(char, key, position, skipped): if char.islower() == True: ascii_offset = 97 if char.isupper() ...