Not all HREF links are being captured by BeautifulSoup while scraping this website... no results are being returned

My goal is to extract all the links from a specific website in order to compile a comprehensive repository of its associated products.

import requests
from bs4 import BeautifulSoup
import pandas as pd


baseurl = "https://www.examplewebsite.com/"
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'}

for x in range(1,6):
    r = requests.get(f'https://www.examplewebsite.com/items/?swoof=1&paged={x}', verify = False)
    soup = BeautifulSoup(r.text, 'html.parser')
    for link in soup.find_all('a'):
        print(link.get('href'))

Answer №1

It's puzzling what the issue might be here. Upon executing this code (with minor adjustments), all the desired products are returned?

import requests
from bs4 import BeautifulSoup
import pandas as pd
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)


baseurl = "https://www.ercotires.com/"
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'}

for x in range(1,6):
    r = requests.get('https://www.ercotires.com/tienda/', params={'swoof': '1','paged': x,}, headers=headers, verify=False)
    soup = BeautifulSoup(r.text, 'html.parser')
    for link in soup.find_all('a'):
        print(link.get('href'))

Output:

// Output intentionally omitted for brevity //

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Creating a new column in a pandas DataFrame by extracting a substring from a text

I have a list of 'words' that I need to count below word_list = ['one','three'] In my pandas dataframe, there is a column containing text shown below. TEXT | --------------------------- ...

Error Encountered During Custom Object Training in TensorFlow

I am encountering an issue while trying to train a set of images for my research project. The tutorial I have been following, which can be found here, leads me to the following error. Using Python 3.6 and the latest version of TensorFlow (CPU), when attem ...

The <Django item> cannot be serialized into JSON format

I am currently working on serializing a queryset and here is the code snippet I have: def render_to_response(self, context, **response_kwargs): return HttpResponse(json.simplejson.dumps(list(self.get_queryset())), mimetype=&quo ...

Modifying the index value in a list within a Tower of Lists

Looking to implement a unique Queue in Python using this specific code snippet data = queue.Queue(maxsize=4) lists = [None] * 3 for i in range(4): data.put(lists) This code sets up a Queue containing 4 Lists, each with three None elements: > print( ...

Explore the description list using Selenium in Python

How do I extract the text from the <dd> tag that corresponds to the <dt> labeled as Commodity Code? `<dl class="dl"> <dt>Trading Screen Product Name</dt> <dd>Biodiesel Futures (balmo)</dd> ...

Navigating the Yahoo login process using Selenium with Python (testing various techniques)

Currently, I am experimenting with logging into my Yahoo account using Selenium. This is part of my learning process as I am trying to familiarize myself with Selenium by creating programs for various websites. Despite my efforts, I have encountered diffi ...

When working with Auto-py-to-exe and the MULTIPROCESSING library, an error occurred: UndefinedEnvironmentName: The 'extra' variable is not present in the evaluation environment

I am currently attempting to compile my Python application into an .exe file that I typically run using the server.py script. My approach involves using Auto-py-to-exe in one directory mode, which results in the following command: pyinstaller --noconfirm - ...

What is the best method to assign np.nan values to a series based on multiple conditions?

If I have a data set like this: A B C D E F 0 x R i R nan h 1 z g j x a nan 2 z h nan y nan nan 3 x g nan nan nan nan 4 x x h x s f I am looking to update specific cells in the data by following th ...

Python3 Selenium - Issue encountered while retrieving the text value from an element on an HTML webpage (web scraping)

On a website, I am working with the HTML code below to extract the number of jobs from a table: <span class="k-pager-info k-label">1 - 10 of 16 items</span> Although I can locate the element successfully through different methods, wh ...

A more Pythonic approach to managing JSON responses in Python 2.7

Currently, I am using Python 2.7 to fetch JSON data from an API. Here is the code snippet I have been working on: import urllib2 URL = "www.website.com/api/" response = urllib2.urlopen(URL) data = json.load(response) my_variable = data['location&apo ...

Error: Unable to access the 'uname' attribute in the 'os' module within the dockerized environment (docker with Django REST framework)

Hello, I am working with Docker and DRF, and facing some challenges. Initially, I executed this command: pip install -r requirements.txt --use-deprecated=legacy-resolver Below are the contents of my requirements file: asgiref==3.5.2 backports.zoneinfo==0 ...

Working with Django's if-elif-else expressions

I am encountering an issue with the conditional statements in my code while trying to print the result based on the user's selected age in the form. class Survey(models.Model): age_choices = (('10-12', '10-12'), (' ...

I am looking for assistance with a Python web-scraping issue. I need help with scraping URLs from a webpage that has partially hidden pagination numbers. Can you lend a hand?

Looking to extract the URLs of each page from the pagination on a specific site located at . The challenge lies in the fact that not all page URLs are displayed simultaneously. When attempting to scrape using the provided link, only page-1, page-14, page- ...

Python script transforms access.log file into JSON format

Having some trouble converting my nginx access.log file into json format due to the following error: Index error: list index out of range import json i = 1 result = {} with open('access.log') as f: lines = f.readlines() for line in li ...

Executing a program using Selenium to gather information

Current Progress I successfully developed a Python script that utilizes Selenium to open a Firefox browser and extract data to an excel file. Furthermore, I have converted this script into an executable file without encountering any errors thanks to Pyins ...

Navigating through js/ajax(href="#") based pagination using Scrapy - A helpful guide

I am looking to loop through all the category URLs and extract the content from each page. Although I have attempted to retrieve only the first category URL using urls = [response.xpath('//ul[@class="flexboxesmain categorieslist"]/li/a/@href ...

Unable to map 1024 bytes due to memory allocation issues, despite having sufficient RAM available

I am currently immersed in the creation of a seminar paper focused on natural language processing (NLP) and the summarization of source code function documentation. To achieve this, I have meticulously curated my own dataset consisting of approximately 640 ...

What is the method to extract div text using Python and Selenium?

Is there a way to extract the text "950" from a div that does not have an ID or Class using Python Selenium? <div class="player-hover-box" style="display: none;"> <div class="ps-price-hover"> <div> ...

Python Selenium is unable to locate the Password ID

I am fairly new to Python and Selenium, attempting to develop a program that can log in to a Microsence network outlet. The browser opens successfully and I use the built-in API to access Firefox, but Selenium is unable to locate the Password ID for loggin ...

Steps for creating a new dataset while excluding specific columns

Is it possible to achieve the task of extracting a new dataframe from an existing one, where columns containing the term 'job', any columns with the word 'birth', and specific columns like name, userID, lgID are excluded? If so, what w ...