Looking to automate the scraping of Wikipedia info boxes and displaying the data using Python for any Wikipedia page?

My current project involves automating the extraction and printing of infobox data from Wikipedia pages. For example, I am currently working on scraping the Star Trek Wikipedia page (https://en.wikipedia.org/wiki/Star_Trek) to extract the infobox section displayed on the right-hand side. Using Python, I aim to print each piece of information row by row on the screen. So far, my code looks like this:

from bs4 import BeautifulSoup
import urllib.request

urlpage =  'https://en.wikipedia.org/wiki/Star_Trek'
page = urllib.request.urlopen(urlpage)
soup = BeautifulSoup(page, 'html.parser')
table = soup.find('table', attrs={'class': 'infobox vevent'})
results = table.find_all('tr')
print(type(results))
print('Number of results', len(results))
print(results)

The above code successfully retrieves all the contents of the infobox. Here is a snippet:

[<tr><th class="summary" colspan="2" style="text-align:center;font- 
size:125%;font-weight:bold;font-style: italic; background: lavender;"> 
<i>Star Trek</i></th></tr>, <tr><td colspan="2" style="text-align:center">
<a class="image" href="/wiki/File:Star_Trek_TOS_logo.svg"><img alt="Star 
Trek TOS logo.svg" data-file-height="132" data-file-width="560" height="59" 

My objective now is to extract only the data and print it on the screen in a structured manner. I would like the output to include details such as the creator, original work, and other relevant information present in the infobox.

I want to find a way to automatically extract and print every row of data from an infobox so that I can apply this process to any Wikipedia page containing an infobox with the class "infobox vevent".

Answer №1

If you're looking to extract text from HTML without the tags, check out this helpful resource on Using BeautifulSoup Extract Text without Tags

This code snippet is credited to @0605002

>>> html = """
<p>
    <strong class="offender">YOB:</strong> 1987<br />
    <strong class="offender">RACE:</strong> WHITE<br />
    <strong class="offender">GENDER:</strong> FEMALE<br />
    <strong class="offender">HEIGHT:</strong> 5'05''<br />
    <strong class="offender">WEIGHT:</strong> 118<br />
    <strong class="offender">EYE COLOR:</strong> GREEN<br />
    <strong class="offender">HAIR COLOR:</strong> BROWN<br />
</p>
"""
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> print soup.text


YOB: 1987
RACE: WHITE
GENDER: FEMALE
HEIGHT: 5'05''
WEIGHT: 118
EYE COLOR: GREEN
HAIR COLOR: BROWN

Answer №2

When utilizing beautifulsoup, you have the ability to reshape the data according to your preferences. Utilize

fresult = [e.text for e in result]
to extract each individual result.

If you are interested in extracting information from an HTML table, you can consider using a method like the following, which involves the utilization of pandas:

import pandas
urlpage = 'https://en.wikipedia.org/wiki/Star_Trek'
data = pandas.read_html(urlpage)[0]
null = data.isnull()

for x in range(len(data)):
    first = data.iloc[x][0]
    second = data.iloc[x][1] if not null.iloc[x][1] else ""
    print(first,second,"\n")

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Transferring Access ids from .kv file to .py file in Kivy

. I am a beginner in Kivy and I am working on creating an Android app. I have almost completed the GUI, the front-end part, but I am facing a major issue. Despite searching extensively online, I have not found a solution to my problem. I am struggling with ...

Problem with a personalized query involving an agent utilizing LangChain and GPT-4

I've been working on a project that involves utilizing LangChain to develop an agent capable of answering questions based on pandas DataFrames. To achieve this, I'm leveraging a GPT-4 model. However, I've hit a roadblock while attempting to ...

Attempting to retrieve the current time using JavaSscript

const currentTime = new Date(); const hours = now.getHours(); Console.log(hours); This block of code is returning an error message... Uncaught ReferenceError: now is not defined Please note that this snippet is written in JavaScript. I attempted to us ...

Ensure images are correctly aligned

Could really use some help with this issue I'm having. It's probably a simple fix for all you experts out there, but I can't seem to align my images properly so they are even and not one higher than the other. Please take a look at the scree ...

Using the split() function within a Django template tag: A guide

I am having trouble trying to use the split() function within a Django template tag. I attempted it but unfortunately, it did not work as expected. {% for m in us.member %} {% with mv=((m.split(','))[0].split('='))[1] %} ...

Steps for Loading HTML5 Video Element in Ionic 2

Struggling to showcase a list of videos from my server, here's the issue I'm encountering and the layout of the app. https://i.stack.imgur.com/HWVt3.png This is the code snippet in HTML: <ion-list> <button ion-item *ngFor="let v ...

What is the method for customizing the background color in a .vue file using Bootstrap?

Can anyone assist me in updating the background color of a div class from grey to white? The code is within a .vue file using Bootstrap and includes an enclosed SVG file. Here's the snippet: <div class="text-left my-3 bg-white"> <button var ...

Exploring the foundations of web development with html and stylus

If you have experience with the roots platform, you are familiar with its default stack including jade, stylus, and coffee script. The documentation provides some information on using HTML, CSS, and pure JavaScript instead of the compiled languages, but d ...

The Mantine date picker is causing an error stating that objects are not valid as a React child

I'm currently experimenting with utilizing Mantine Date in NextJS. The goal is to have the selected date displayed in the HTML text heading when a user clicks on it. For instance, if a user selects January 1st, 2023, the text should show like this: Da ...

Tips for transferring information obtained from an API to my custom HTML page

As a novice in web development, I recently created a simple weather report website using the OpenWeather API. While I successfully fetched data from the API, I encountered an issue trying to display this data on my HTML page. Despite utilizing DOM manipu ...

Failure of Pytorch simulation to reach convergence on convex loss function unless initialized with 0

After extensive testing, I discovered that my code functions properly when the weights are initialized with 0. However, when I try to initialize them based on a specific seed, they fail to converge as expected. This should not be the case since the loss fu ...

Combining three PySpark columns into a single struct

Hello, I am a newcomer to PySpark and currently grappling with a challenge that needs solving. I have the task of merging three columns based on the values in a fourth column: Let's consider an example table layout like this: store car color cyli ...

Training two Pytorch models simultaneously

After struggling to get answers from similar threads, I find myself in need of guidance. My challenge is to simultaneously train two models within the same loop, with model updates involving a unique computation that incorporates combined loss values from ...

Looping through colors randomly using nested for loops in Python

I'm working on a Python code that draws squares in random locations on the screen. I want each square to be a random color as well. However, everything is functioning correctly except for the random color generation. The turtle only changes colors and ...

difficulty connecting scrapy data to a database

Currently, I am attempting to insert scraped items using Scrapy into a MySQL database. If the database does not already exist, I want to create a new one. I have been following an online tutorial as I am unfamiliar with this process; however, I keep encoun ...

Creating a visually appealing layout by dividing HTML content into two columns within a WebView

Utilizing WebView in my application to display html content for reading ebooks presented a challenge during development. After parsing the ebook and converting it into an html string, I load this string into the WebView component. myView.setVerticalScro ...

What is the best way to insert a new row into a table upon clicking a button with Javascript?

Hi everyone, I'm facing an issue with my code. Whenever I click on "Add Product", I want a new row with the same fields to be added. However, it's not working as expected when I run the code. Below is the HTML: <table class="table" id="conci ...

Using JavaScript, HTML, and CSS to select slices of a donut pie chart created with SVG

I have successfully implemented CSS hover effects and can manipulate the HTML to use the .active class. However, I am facing difficulties in making my pie slices switch to the active state upon click. Moreover, once this is resolved, I aim to enable select ...

Using Selenium in Python to Automate PDF Downloads on ChromeBrowser

After encountering issues with Firefox-based Selenium-Webdriver getting stuck during a mass download of PDF-files in a loop, I decided to switch to Chrome. However, even though the loop is now functioning properly with Chrome, I am unable to actually downl ...

Exploring the power of simplejson in Python by parsing JSON data

After sending a URL request with urllib2, I am struggling to read JSON data. Here is my code: request = urllib2.Request("https://127.0.0.1:443/myAPI", data=form_data, headers=headers) response = urllib2.open(request) The issue arises when attempting to p ...