Is it possible to extract data from several tables on a Wikipedia page, including their headers, using Python's requests and BeautifulSoup libraries?

Using Python libraries like requests and BeautifulSoup, I am attempting to scrape the tables from the following Wikipedia page: https://en.wikipedia.org/wiki/Mobile_country_code. While I am able to retrieve all the data within the tables, my goal now is to include an additional column labeled Country, populated with the names of the tables. You can view an example of the Wikipedia table (above) and the desired table layout (below) here.

The code snippet provided below enables me to fetch all the data without incorporating the Country column:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

wiki = requests.get('https://en.wikipedia.org/wiki/Mobile_country_code')
soup = BeautifulSoup(wiki.content, 'html.parser')

# Extract all the tables
tables = soup.find_all('table',class_="wikitable")

# Extracting the column names
column_names = [item.get_text() for item in tables[0].find_all('th')]

# Extracting the content
contents = [item.get_text() for item in tables[0].find_all('td')]

# Placing all the content into a list
values=[]
for table in tables:
    for item in table.select('td'):
        temp = item.get_text()
        values.append(temp)

# As there are 7 columns, determining the number of rows and reshaping the table
len(values)/7   # Returns 2452 rows

# Modifying the shape of the table
data = np.reshape(values,(2452,7))

# Creating a dataframe to store all the data
df = pd.DataFrame(data = data, columns=header_list)

Answer №1

Here is a suggestion to try:

#Extracting Data from HTML Tables
# Finding all tables
tables = soup.find_all('table',class_="wikitable")

# Extracting column names
column_names = [item.get_text() for item in tables[0].find_all('th')]

# Extracting content
contents = [item.get_text() for item in tables[0].find_all('td')]

# Initializing list for values
values_list = []
# Finding all countries
countries = soup.find_all('h3')
international = [soup.find('span',{"id":"International_operators"}).parent]
countries = countries+international
for c in countries:
    table = c.find_next_sibling("table")
    if table is not None: #Checking if the country has a table
        for item in table.select('tr')[1:]:
            values = [e.get_text() for e in item.select('td')]
            values = [c.text]+values
            values_list.append(values)

header_list = ["COUNTRY"]+ column_names

# Creating a dataframe with the extracted data
df = pd.DataFrame(values_list, columns=header_list)

The resulting df will look like this:

    COUNTRY             MCC MNC Brand    Operator       Status       Bands (MHz)                                        References and notes
0   Abkhazia - GE-AB    289 67  Aquafon  Aquafon JSC    Operational GSM 900 / GSM 1800 / UMTS 2100 / LTE 800            MCC is not listed by ITU;[85] LTE band 20[95]
1   Abkhazia - GE-AB    289 88  A-Mobile A-Mobile LLSC  Operational GSM 900 / GSM 1800 / UMTS 2100 / LTE 800 / LTE...   MCC is not listed by ITU[85]
...

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Is it possible to remove extra characters from strings that are already in a list using Python?

Here is an example of a string within a list: items = ["hello - this", "is - a", "sample - task", "i", "want - to", "do"] Is it possible to achieve the following output? items = ["hello-this", "is-a", "sample-task", "i", "want-to", "do"] I attempted to r ...

What is the process for estimating based on the departure time and arrival time from previous and subsequent trips?

https://i.stack.imgur.com/pHSkx.png I'm currently faced with a challenge involving a list of arrival and departure timings. Let's say in the image provided, when i = 3 and i = 4 have a departure time of 0. What I aim to accomplish is to utilize ...

Error: The __init__() function is lacking the required argument 'timeout.'

Despite initializing the super class constructor, I am encountering an error while trying to run some basic test cases for a simple project. Here is the code snippet: home.py class home(Pages): user = (By.NAME, "user-name") utxt = " ...

Dealing with JSON parsing issues in Python, specifically related to syntax errors involving square brackets

I am struggling with parsing a JSON file received from an API containing the definition of a word from the English Dictionary. I need to extract meanings, definitions, and examples, but I am encountering syntax issues with the JSON structure. Here is a sa ...

Error: The module 'Google' could not be found even after installation using pip

Having trouble uploading a file to Google Drive because the "Google" library isn't loading even after multiple attempts of running pip install Google. Any idea what might be causing this issue? https://i.stack.imgur.com/kW2w8.png Traceback (most rec ...

Displaying historical data in Django

My code is quite straightforward; it accesses the database to retrieve a list of upcoming events. now = datetime.datetime.now(pytz.utc) def index(request, listing='upcoming'): country_name = get_client_ip(request) if Location.objects. ...

How can you use map to write into slice references?

My current focus is on implementing Python slices that need to be passed to a function by reference. def mpfunc(r): r[:]=1 R=np.zeros((2,4)) mpfunc(R[0]) mpfunc(R[1]) print(R) This code performs as intended. R now holds the value of 1. However ...

Search for a substring in JSON using Python

Currently, I am faced with the challenge of extracting two pieces of information from a lengthy JSON file using Python 2.7. The structure of the JSON data is as follows: { 'device': [ { 'serial': '00000000762c1d3c&apo ...

Converting the EXIF DateTaken from a String to a Date and Time or an Integer for my specific needs

Seeking help with renaming family photos, I am looking to rename all the images in a specific folder using the following format: mmdd__00X For example, the 20th image taken on March 23rd should be named as: 0323__020 I have gathered code from various s ...

Django's Secure User Authentication and Account Access

I need assistance creating a login function and authentication in the views.py file within def login(request). I am looking to authenticate a specific username and password from my model class "Signup". If anyone can provide guidance on how to accomplish t ...

Steps for displaying all the uppercase text in a document

One challenge I am facing is that the code does not produce any output. I suspect there may be an issue with my if statement since it printed the entire text file when I removed the if statement. fname = input('Enter the file name: ') try: ...

The ggplot function geom_bar seems to be displaying the frequency of occurrences instead of the actual values, despite having the stat="identity" parameter enabled

I've been working with ggplot in Python to create a basic bar chart, but for some reason, the heights of the bars are reflecting the counts of variable names instead of the actual variables themselves. Here's a simple example: pattern = pd.Seri ...

The Selenium driver's execute_script() function is not functioning as expected

When I attempt to execute a JavaScript using driver.execute_script, nothing happens and the system just moves on to the next line of Python code. Any ideas? I've been web scraping on a webpage, using JavaScript in the Console to extract data. The Jav ...

Guide for including Publisher Certificate in Cx_freeze msi package

While creating an "msi" using cx_freeze, I am encountering a problem where the distributed file is showing unknown publisher. What are the steps to obtain publisher certificates and how can they be added to cx_freeze? ...

Execute a function on every pair of rows from a dataframe and columns from another dataframe

Imagine a scenario where you need to multiply a row and column vector to create a matrix, and then aggregate the rows of that resulting matrix. In this case, each element in the row vector consists of two values A and B, while each element in the column v ...

Creating unique IDs that start from 1 and increment by 1 each time a specific value is encountered in a different column can be achieved by implementing a step-by-step

In my Python project, I am looking to generate a unique Journey ID and Journey number. The goal is to increment the ID each time the previous row in the "Purpose" column equals 1, while the Journey number should do the same but within each Respondent ID gr ...

Python coding with nested statements

Seeking assistance with troubleshooting my code as I am unable to identify the issue at the moment. It seems to be a simple error that is eluding me. string = raw_input("Enter String->") length = len(string) index = 0 while index < l ...

The saving of data in the session is not working as expected

Currently, I am in the process of building a web application with Google App Engine and Python. I have encountered an unusual issue that has left me stumped on how to resolve it and what might be causing it. The problem arises when I fill out a form and se ...

Output the highest odd integer

My objective is to print the largest odd number until the user inputs 0. However, during testing I encountered an issue where the code returns 3 instead of 5, which should be the correct result. Can anyone offer assistance in resolving this problem? See b ...

Unable to transfer data to /html/body/p within the iframe for the rich text editor

When inputting content in this editor, everything is placed within the <p> tag. However, the system does not recognize /html/body/p, so I have tried using /html/body or switching to the active element, but neither approach seems to work. driver.sw ...