Python's BeautifulSoup is throwing a KeyError for 'href' in the current scenario

Question

Python's BeautifulSoup is throwing a KeyError for 'href' in the current scenario

Utilizing bs4 for designing a web scraper to gather funding news data.

The initial section of the code extracts the title, link, summary, and date of each article from n number of pages.
The subsequent part of the code iterates through the link column and inputs the resulting URL into a new function that retrieves the company's URL.

Overall, the code functions well (scraped 40 pages without errors). However, I'm attempting to stress test it by increasing it to 80 pages, but encountering KeyError: 'href' and struggling to resolve this issue.

import requests
import pandas as pd
import numpy as np

from bs4 import BeautifulSoup
from tqdm import tqdm

def clean_data(column):
    df[column]= df[column].str.encode('ascii', 'ignore').str.decode('ascii')

#extract

def extract(page):
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15'}
    url = f'https://www.uktechnews.info/category/investment-round/series-a/page/{page}/'
    r = requests.get(url, headers)
    soup = BeautifulSoup(r.content, 'html.parser')
    
    return soup

#transform

def transform(soup):
    
    for item in soup.find_all('div', class_ = 'post-block-style'):
        title = item.find('h3', {'class': 'post-title'}).text.replace('\n','')
        link = item.find('a')['href']
        summary = item.find('p').text
        date = item.find('span', {'class': 'post-meta-date'}).text.replace('\n','')
        
        news = {
            'title': title,
            'link': link,
            'summary': summary,
            'date': date
        }
        newslist.append(news)
    return

newslist = []

#subpage
def extract_subpage(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15'}
    r = requests.get(url, headers)
    soup_subpage = BeautifulSoup(r.text, 'html.parser')
    
    return soup_subpage

def transform_subpage(soup_subpage):
    main_data = soup_subpage.select("div.entry-content.clearfix > p > a")
    
    if len(main_data) > 0:
        subpage_link = {
            'subpage_link': main_data[0]['href']
        }
        subpage.append(subpage_link)
    else:
        subpage_link = {
            'subpage_link': '--'
        }
        subpage.append(subpage_link)
    return
    
subpage = []

#load

page = np.arange(0, 80, 1).tolist()

for page in tqdm(page):
    try:
        c = extract(page)
        transform(c)
    except:
        None

df1 = pd.DataFrame(newslist)   

for url in tqdm(df1['link']):
    t = extract_subpage(url)
    transform_subpage(t)

df2 = pd.DataFrame(subpage)

Error screenshot provided below:

Screenshot

Seems like the issue lies within my if statement in the transform_subpage function not accounting for cases where main_data is not an empty list yet does not contain href links. Seeking guidance as a Python beginner!

python web-scraping beautifulsoup

Answer 1

Answer №1

Indeed, the issue arises from main_data[0] lacking an 'href' attribute at certain times. One potential solution is to adjust the code logic as follows:

def modify_subpage(soup_subpage):
    main_data = soup_subpage.select("div.entry-content.clearfix > p > a")
    
    if len(main_data):
        if 'href' in main_data[0].attrs:
            subpage_link = {
                'subpage_link': main_data[0]['href']
            }
            subpage.append(subpage_link)
        else:
            subpage_link = {
                'subpage_link': '--'
            }
            subpage.append(subpage_link)

Moreover, it is advisable not to iterate through a variable list and use the same name for each item. A refinement could be:

pages_list = np.arange(0, 80, 1).tolist()

for page_num in tqdm(pages_list):

Answer 2

Indeed, the issue arises from main_data[0] lacking an 'href' attribute at certain times. One potential solution is to adjust the code logic as follows:

def modify_subpage(soup_subpage):
    main_data = soup_subpage.select("div.entry-content.clearfix > p > a")
    
    if len(main_data):
        if 'href' in main_data[0].attrs:
            subpage_link = {
                'subpage_link': main_data[0]['href']
            }
            subpage.append(subpage_link)
        else:
            subpage_link = {
                'subpage_link': '--'
            }
            subpage.append(subpage_link)

Moreover, it is advisable not to iterate through a variable list and use the same name for each item. A refinement could be:

pages_list = np.arange(0, 80, 1).tolist()

for page_num in tqdm(pages_list):

Python's BeautifulSoup is throwing a KeyError for 'href' in the current scenario

Answer №1

Similar questions

Assistance required to create a regex

Can you explain the process of shuffling during training in the TensorFlow object detection API?

Converting strings to floats in Python without relying on specific locales

The function of conditional statements and saving data to a document

How can I create a custom filter for model choice field selections in a user-specific way using Django?

Check the preview of a music score generated from a MIDI file using Python

Other options instead of employing an iterator for naming variables

How to differentiate specific points in a Plotly Express Scatterplot using various colors

Showing information from Flask API utilizing Angular with underscores

Issues with the proper display of Bootstrap 4 in Django are causing problems

Slowly scrolling down using Selenium

Enhance a dataset by incorporating information from a different source

The execute command within a function is malfunctioning in Python 3.5

What is the best way to sort a loop and keep the data for future use?

Heroku deployment of Flask-SocketIO server encounters running issues

What is the correct way to utilize "%.02f" in the Python format method?

Utilizing the distance-weighted KNN algorithm in the sklearn library

Solving Project Euler Problem 8 using Python

Adding a basic function within a Django HTML template can enhance the functionality and flexibility

What is the method to initiate a Python thread from C++?