Python's BeautifulSoup is throwing a KeyError for 'href' in the current scenario

Utilizing bs4 for designing a web scraper to gather funding news data.

  1. The initial section of the code extracts the title, link, summary, and date of each article from n number of pages.
  2. The subsequent part of the code iterates through the link column and inputs the resulting URL into a new function that retrieves the company's URL.

Overall, the code functions well (scraped 40 pages without errors). However, I'm attempting to stress test it by increasing it to 80 pages, but encountering KeyError: 'href' and struggling to resolve this issue.

import requests
import pandas as pd
import numpy as np

from bs4 import BeautifulSoup
from tqdm import tqdm

def clean_data(column):
    df[column]= df[column].str.encode('ascii', 'ignore').str.decode('ascii')

#extract

def extract(page):
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15'}
    url = f'https://www.uktechnews.info/category/investment-round/series-a/page/{page}/'
    r = requests.get(url, headers)
    soup = BeautifulSoup(r.content, 'html.parser')
    
    return soup

#transform

def transform(soup):
    
    for item in soup.find_all('div', class_ = 'post-block-style'):
        title = item.find('h3', {'class': 'post-title'}).text.replace('\n','')
        link = item.find('a')['href']
        summary = item.find('p').text
        date = item.find('span', {'class': 'post-meta-date'}).text.replace('\n','')
        
        news = {
            'title': title,
            'link': link,
            'summary': summary,
            'date': date
        }
        newslist.append(news)
    return

newslist = []

#subpage
def extract_subpage(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15'}
    r = requests.get(url, headers)
    soup_subpage = BeautifulSoup(r.text, 'html.parser')
    
    return soup_subpage

def transform_subpage(soup_subpage):
    main_data = soup_subpage.select("div.entry-content.clearfix > p > a")
    
    if len(main_data) > 0:
        subpage_link = {
            'subpage_link': main_data[0]['href']
        }
        subpage.append(subpage_link)
    else:
        subpage_link = {
            'subpage_link': '--'
        }
        subpage.append(subpage_link)
    return
    
subpage = []

#load

page = np.arange(0, 80, 1).tolist()

for page in tqdm(page):
    try:
        c = extract(page)
        transform(c)
    except:
        None

df1 = pd.DataFrame(newslist)   

for url in tqdm(df1['link']):
    t = extract_subpage(url)
    transform_subpage(t)

df2 = pd.DataFrame(subpage)

Error screenshot provided below:

Screenshot

Seems like the issue lies within my if statement in the transform_subpage function not accounting for cases where main_data is not an empty list yet does not contain href links. Seeking guidance as a Python beginner!

Answer №1

Indeed, the issue arises from main_data[0] lacking an 'href' attribute at certain times. One potential solution is to adjust the code logic as follows:

def modify_subpage(soup_subpage):
    main_data = soup_subpage.select("div.entry-content.clearfix > p > a")
    
    if len(main_data):
        if 'href' in main_data[0].attrs:
            subpage_link = {
                'subpage_link': main_data[0]['href']
            }
            subpage.append(subpage_link)
        else:
            subpage_link = {
                'subpage_link': '--'
            }
            subpage.append(subpage_link)

Moreover, it is advisable not to iterate through a variable list and use the same name for each item. A refinement could be:

pages_list = np.arange(0, 80, 1).tolist()

for page_num in tqdm(pages_list):

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Assistance required to create a regex

Similar Question: Need help with regex writing I'm dealing with a formatted logfile that looks like this: Using data from (yyyy/mm/dd): 2011/8/3 0 files queued for scanning. Warning: E:\test\foo Händler.pdf ...

Can you explain the process of shuffling during training in the TensorFlow object detection API?

Recently, I've been experimenting with the tensorflow object detection API for instance segmentation. Within the pipeline configuration file, specifically in the train_config, we define the value of num_steps. This parameter represents the total numbe ...

Converting strings to floats in Python without relying on specific locales

When trying to convert a string to a float, I encounter various input formats like '1234,5', '1234.5', '1 234,5', or '1,234.5'. The challenge is that I can't adjust the locale decimal pointer or thousands separa ...

The function of conditional statements and saving data to a document

Utilizing the KEGG API for downloading genomic data and saving it to a file has been quite an interesting task. There are 26 separate files in total, and some of them contain the dictionary 'COMPOUND'. My goal is to assign these specific files to ...

How can I create a custom filter for model choice field selections in a user-specific way using Django?

This is the form I've created: class RecipeForm(forms.Form): def __init__(self, *args, **kwargs): self.user = kwargs.pop('user', None) super(RecipeForm, self).__init__(*args, **kwargs) Recipebase_id = forms.ModelCho ...

Check the preview of a music score generated from a MIDI file using Python

Is there a way to generate a png image of a score from a MIDI file using Python? I am aware that MuseScore can convert MIDI files into scores, so theoretically this should be possible. Currently, I am using the lilypond functions !midi2ly and !lilypond - ...

Other options instead of employing an iterator for naming variables

I am relatively new to Python, transitioning from a background in Stata, and am encountering some challenges with fundamental Python concepts. Currently, I am developing a small program that utilizes the US Census Bureau API to geocode addresses. My initia ...

How to differentiate specific points in a Plotly Express Scatterplot using various colors

At the moment, I have a scatterplot showcasing different directors based on their production budget and profit. I am looking to pick out specific directors by highlighting their points with unique colors and creating a legend identifying each one. For ins ...

Showing information from Flask API utilizing Angular with underscores

I'm in the process of creating components from my Flask API. Upon accessing the route, I can view the data for the desired objects. However, upon attempting to display it on the front end through interpolation, I am only able to see certain properties ...

Issues with the proper display of Bootstrap 4 in Django are causing problems

I am currently in the process of setting up Django to work with Bootstrap. Despite my efforts, I can't seem to get it functioning correctly and I'm unsure of where I may be making a mistake. Initially, I am just trying to display a panel. You can ...

Slowly scrolling down using Selenium

Struggling with performing dynamic web scraping on a javascript-rendered webpage using Python. 1) Encountering an issue where elements load only when scrolling down the page slowly. Tried methods such as: driver.execute_script("window.scrollTo(0, Y)") ...

Enhance a dataset by incorporating information from a different source

I have compiled a variety of datasets with different information that I now wish to combine. Here is an example of two datasets: Customer1 Customer2 Relationship Age_of_Relationship Alfa Wolk 1 12 C ...

The execute command within a function is malfunctioning in Python 3.5

Currently, I am working on a python program that aims to parse JSON files based on their tags using the Python `exec` function. However, I encountered an issue where the program fails when the `exec` statement is within a function. RUN 1: Exec in a functi ...

What is the best way to sort a loop and keep the data for future use?

I'm currently in the process of scraping data from Amazon as part of a project I'm working on. So far, I have set up the following workflow: driver = webdriver.Chrome(executable_path=r"C:\Users\chromedriver.exe") driver.maxim ...

Heroku deployment of Flask-SocketIO server encounters running issues

I have a basic Flask-SocketIO server running in Python and a SocketIO Client that sends data to the server, which then appears in the console upon receipt. Everything functions correctly when tested on my local machine. However, when attempting to deploy t ...

What is the correct way to utilize "%.02f" in the Python format method?

As a newcomer to Python, I recently made the switch to Python 3 and am currently getting comfortable with the format() function. My objective is to display temperatures in floating-point format using the print() function, for example: temperature = [23, ...

Utilizing the distance-weighted KNN algorithm in the sklearn library

Currently, I am analyzing the UCI eye movement eeg data using KNN. In my implementation, I have set the weights parameter to distance. Below is the code snippet: test_scores = [] train_scores = [] for i in range(1,7): knn = KNeighborsClassifier(i, we ...

Solving Project Euler Problem 8 using Python

I'm currently working on solving a question regarding the largest product in a series from the Project Euler website. My approach involved: Saving the 1000 digits as a text file Converting it to a string Creating an array named 'window' to ...

Adding a basic function within a Django HTML template can enhance the functionality and flexibility

Currently, I have a function named "copyright" in my project that generates a dynamic copyright message. My goal is to incorporate this function into my base Django template as shown below: def copyright(): // code here // more code here print(fi ...

What is the method to initiate a Python thread from C++?

In my current project, I am specifically limited to using Python 2.6. The software involves a Python 2.6 application that interfaces with a C++ multi-threaded API library created with boost-python. I have been trying to implement a Python function callback ...