I encountered a problem extracting title URLs with Python during web scraping

I have encountered an issue while trying to scrape title URLs using my code. Can someone please help me troubleshoot it? Here is the code snippet:

import requests
from bs4 import BeautifulSoup
# import pandas as pd
# import pandas as pd
import csv


def get_page(url):
    response = requests.get(url)
    if not response.ok:
        print('server responded:', response.status_code)
    else:
        # 1. html , 2. parser
        soup = BeautifulSoup(response.text, 'html.parser')
    return soup


def get_index_data(soup):
    try:
        titles_link = soup.find_all('a', class_="body_link_11")
    except:
        titles_link = []
    # urls = [item.get('href') for item in titles_link]
    print(titles_link)


def main():
    mainurl = "http://cgsc.cdmhost.com/cdm/search/collection/p4013coll8/" \
              "searchterm/1/field/all/mode/all/conn/and/order/nosort/page/1"
    get_index_data(get_page(mainurl))


if __name__ == '__main__':
    main()

Answer №1

If you're looking to retrieve all the links, give this code a go:

def obtain_page_content(url):
    response = requests.get(url)
    if not response.ok:
        print('server responded:', response.status_code)
    else:
        soup = BeautifulSoup(response.text, 'html.parser') # 1. html , 2. parser
    return soup

def extract_data(soup):
    try:
        titles_link = soup.find_all('a',class_="body_link_11")
    except:
        titles_link = []
    else:
        titles_link_output = []
        for link in titles_link:
            try:
                item_id = link.attrs.get('item_id', None) 
                if item_id:
                    titles_link_output.append("{}{}".format("http://cgsc.cdmhost.com",link.attrs.get('href', None)))
            except:
                continue
        print(titles_link_output)

def start():
    mainurl = "http://cgsc.cdmhost.com/cdm/search/collection/p4013coll8/searchterm/1/field/all/mode/all/conn/and/order/nosort/page/1"
    extract_data(obtain_page_content(mainurl))

start()

Output:

['http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/2653/rec/1', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/2385/rec/2', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/3309/rec/3', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/2425/rec/4', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/150/rec/5', 'http://cgsc.cdmhost.com/cdm/compoundobject/collection/p4013coll8/id/2501/rec/6', 'http://cgsc.cdmhost.com/cdm/compoundobject/collection/p4013coll8/id/2495/rec/7', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/3672/rec/8', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/3407/rec/9', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/4393/rec/10', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/3445/rec/11', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/3668/rec/12', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/3703/rec/13', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/2952/rec/14', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/2898/rec/15', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/3502/rec/16', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/3553/rec/17', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/4052/rec/18', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/3440/rec/19', 'http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/3583/rec/20']

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Why is it that when performing a slice operation, a list is returned, but when using

Indexing results in a string being returned Input: l1 = ['bbq', 'rr'] [l1[0] + l1[0], l1[1]+l1[1]] Output: ['bbqbbq', 'rrrr'] Slicing returns lists as the output Input: [l1[:1] + l1[:1], l1[1:2]+l1[1:2]]<br> ...

SQLAlchemy: Understanding how to use getter and setter methods within a declarative mixin class

In an attempt to create basic getter and setter methods for a mixin class that will be utilized in my database schema, I am facing the following challenge: from sqlalchemy import Column, Integer, create_engine from sqlalchemy.orm import synonym, scoped_se ...

What is the correct way to loop through a collection of web elements in Selenium WebDriver using Python?

My current project involves testing a webpage containing a list of product articles. Each article includes an item title, price, stock availability tag, and an add to cart button. To achieve my goal, I am attempting to gather all the product articles using ...

Analyzing the values of various keys within a dictionary for comparison

My data structure involves dictionaries nested inside a list, represented as follows: sample_dict = [{1: [[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], \ [1, 2, 3, 4, 5], \ [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]]}, \ ...

Utilizing Runge-Kutta fourth order method for solving a system of Ordinary Differential Equations (ODEs

In an attempt to solve a 2x2 system of First Order Differential Equations with given initial conditions using Python, here is the code I have written: from math import * import numpy as np np.set_printoptions(precision=6) ## controlling numpy output deci ...

Coloring scatter plots in Matplotlib

Imagine there is a function that assigns either 0 or 1 to points in the unit square $[0,1]^2$. Can you provide steps on how matplotlib can be utilized to visualize the regions within the unit square where the function outputs 0 or 1? ...

Dynamic classes cannot be utilized with concurrent.futures.ProcessPoolExecutor

Within the code snippet below, I am dynamically generating an object of a class using the `_py` attribute and the `generate_object` method. The code performs flawlessly without any issues when not utilizing concurrency. However, upon implementing concurre ...

Creating a button definition in selenium without using the "find_element" method in Python

In my test automation project, I have a preference for pre-defining locators and assigning them to variables. This way, I can easily refer to the variable name later on in my code like so: login_button = <browser>.find_element_by_id("login") login_b ...

Using Python regular expressions to extract strings that are numeric values between 0 and 9,999,999.99, regardless of whether they contain no commas or multiple commas

I'm seeking a method to extract this specific string from a CSV list. I suspect that the commas are causing some issues, but I'm not entirely sure. The number at the end can range from 0 to 9,999,999.00 and may contain zero commas or multiple. Tr ...

Combining a web-based login system with a duo device management client library

Seeking advice on integrating Duo device management portal (DMP) libraries into a website for baking. Currently, we have a functioning web page with ldap authentication served through Apache on a Linux server. Users must input their ldap credentials to ac ...

troubles with ffmpeg subprocess

This code segment functions properly when executed through the Python script editor in Maya. How can I ensure that it will also run successfully when executed as part of a script? oneImage = "D:/imagesequence/dpx/brn_055.0000.jpg" firstImage = "c:/users ...

Choosing multiple classes using Xpath

<div class="unique"> <a class="nested" href="Another..."><img src="http://..."/></a> <p> different.... </p> <p><img src="http://....." /></p> </div> I have this interesting HTML struc ...

Can Selenium Be Used Without the Need to Install the Chrome App?

Is it possible to utilize Selenium without having to download the entire Google Chrome application? This question crossed my mind when I noticed that Selenium runs smoothly on replit, but encounters errors when run on VS Code on my computer (which lacks Go ...

Selenium kicks off from the user profile but encounters a crash when running in the terminal

from selenium import webdriver from selenium.webdriver.chrome.service import Service from webdriver_manager.chrome import ChromeDriverManager import time options = webdriver.ChromeOptions() options.add_argument("--profile-directory=Profile 2") o ...

Using BeautifulSoup to scrape a URL and generate a list of link addresses

Seeking insights on website similarity, I aim to extract data from the following link: Focusing on class='site', my goal is to retrieve information like: <a href="/siteinfo/ebay.com" class="truncation">ebay.com</a> ...

Repairing the coin change backtracking solution (bruteforce method)

While I understand that the optimal solution for this problem involves using dynamic programming, I decided to experiment with a bruteforce backtracking approach. In my approach, I subtract coins from the total amount and attempt to find combinations tha ...

Error in Odoo 14: The first item in the sequence was expected to be a string instance, but a boolean

I encountered an issue where I'm receiving the error message: "in _get_report_values map(lambda x: x.ref, docs.account_invoice_ids)) or ', '.join( TypeError: sequence item 0: expected str instance, bool found". data['test'] = docs. ...

Using SQLAlchemy connection pooling with multiple concurrent threads

To start off, I am creating a basic table structure: import threading from sqlalchemy import create_engine from sqlalchemy import Column, Integer, Float, String, Date, DateTime, Boolean, select from sqlalchemy.ext.declarative import declarative_base from ...

What is the process for extracting CHIME/FRB data files?

Illustration of a Rapid Radio Burst I'm encountering an issue with msgpack.py while attempting to read and decompress msgpack data. The files I'm working on uncompressing are located here: CHIME/FRB data files. For further assistance, you can re ...

A deep dive into the nuances of DP: Unlocking the

Examining the codes below, I implemented two different approaches to solve the problem (simple recursive and DP). Why is the DP method slower? Do you have any suggestions? #!/usr/local/bin/python2.7 # encoding: utf-8 Problem Statement: An array contains ...