Not all HREF links are being captured by BeautifulSoup while scraping this website... no results are being returned

Question

Not all HREF links are being captured by BeautifulSoup while scraping this website... no results are being returned

My goal is to extract all the links from a specific website in order to compile a comprehensive repository of its associated products.

import requests
from bs4 import BeautifulSoup
import pandas as pd


baseurl = "https://www.examplewebsite.com/"
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'}

for x in range(1,6):
    r = requests.get(f'https://www.examplewebsite.com/items/?swoof=1&paged={x}', verify = False)
    soup = BeautifulSoup(r.text, 'html.parser')
    for link in soup.find_all('a'):
        print(link.get('href'))

python web-scraping href

Answer 1

Answer №1

It's puzzling what the issue might be here. Upon executing this code (with minor adjustments), all the desired products are returned?

import requests
from bs4 import BeautifulSoup
import pandas as pd
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)


baseurl = "https://www.ercotires.com/"
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'}

for x in range(1,6):
    r = requests.get('https://www.ercotires.com/tienda/', params={'swoof': '1','paged': x,}, headers=headers, verify=False)
    soup = BeautifulSoup(r.text, 'html.parser')
    for link in soup.find_all('a'):
        print(link.get('href'))

Output:

// Output intentionally omitted for brevity //

Answer 2

It's puzzling what the issue might be here. Upon executing this code (with minor adjustments), all the desired products are returned?

import requests
from bs4 import BeautifulSoup
import pandas as pd
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)


baseurl = "https://www.ercotires.com/"
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'}

for x in range(1,6):
    r = requests.get('https://www.ercotires.com/tienda/', params={'swoof': '1','paged': x,}, headers=headers, verify=False)
    soup = BeautifulSoup(r.text, 'html.parser')
    for link in soup.find_all('a'):
        print(link.get('href'))

Output:

// Output intentionally omitted for brevity //

Not all HREF links are being captured by BeautifulSoup while scraping this website... no results are being returned

Answer №1

Similar questions

Creating a new column in a pandas DataFrame by extracting a substring from a text

Error Encountered During Custom Object Training in TensorFlow

The <Django item> cannot be serialized into JSON format

Modifying the index value in a list within a Tower of Lists

Explore the description list using Selenium in Python

Navigating the Yahoo login process using Selenium with Python (testing various techniques)

When working with Auto-py-to-exe and the MULTIPROCESSING library, an error occurred: UndefinedEnvironmentName: The 'extra' variable is not present in the evaluation environment

What is the best method to assign np.nan values to a series based on multiple conditions?

Python3 Selenium - Issue encountered while retrieving the text value from an element on an HTML webpage (web scraping)

A more Pythonic approach to managing JSON responses in Python 2.7

Error: Unable to access the 'uname' attribute in the 'os' module within the dockerized environment (docker with Django REST framework)

Working with Django's if-elif-else expressions

I am looking for assistance with a Python web-scraping issue. I need help with scraping URLs from a webpage that has partially hidden pagination numbers. Can you lend a hand?

Python script transforms access.log file into JSON format

Executing a program using Selenium to gather information

Navigating through js/ajax(href="#") based pagination using Scrapy - A helpful guide

Unable to map 1024 bytes due to memory allocation issues, despite having sufficient RAM available

What is the method to extract div text using Python and Selenium?

Python Selenium is unable to locate the Password ID

Steps for creating a new dataset while excluding specific columns