Tips on selectively crawling specific URLs from a CSV document using Python

I am working with a CSV file that contains various URLs with different domain extensions such as .com, .eu, .org, and more. My goal is to only crawl domains with the .nl extension by using the condition if '.nl' in row: in Python 2.7.

from selenium import webdriver
import csv

fieldnames = ['Website', '@media', 'googleadservices.com/pagead/conversion']

def csv_writerheader(path):
    with open(path, 'w') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames, lineterminator='\n')
        writer.writeheader()

def csv_writer(dictdata, path):
    with open(path, 'a') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames, lineterminator='\n')
        writer.writerow(dictdata)

csv_output_file = 'output!.csv'

driver = webdriver.Chrome(executable_path=r'C:\Users\Jacob\PycharmProjects\Testing\chromedriver_win32\chromedriver.exe')

keywords = ['@media', 'googleadservices.com/pagead/conversion']

csv_writerheader(csv_output_file)

with open('top1m-edited.csv') as example_file:
    example_reader = csv.reader(example_file)
    for row in example_reader:

        # INITIALIZE DICT
        data = {'Website': row}

        if '.nl' in row:  # FILTERING DOMAINS WITH .NL EXTENSION
            try:
                driver.get(row[0])
                html = driver.page_source    

                for searchstring in keywords:
                    if searchstring.lower() in html.lower():
                        print (row, searchstring, 'FOUND!')
                        data[searchstring] = 'FOUND!'
                    else:
                        print (row, searchstring, 'not found')
                        data[searchstring] = 'not found'    

                csv_writer(data, csv_output_file)

            except:
                pass

Printed result:

C:\Python27\python.exe "C:/Users/Jacob/PycharmProjects/Testing/fooling around 2.py"

Process finished with exit code 0

Currently, my script is not producing any meaningful results and generating a CSV file with minimal output. However, when I remove the condition if '.nl' in row:, the script functions properly.

What modifications should be implemented to specifically target and scrape URLs with the .nl domain extension?

Answer №1

for each individual in sample_set:

The element category is an array. Therefore, it is searching for a specific item in the array that matches ".nl". You have several choices at your disposal. If the dataset contains only one column with URLs, you can modify this:

if '.nl' in category:

to this:

if '.nl' in category[0]:

ADDENDUM: also, any modifications made to category must be reflected in category[0], such as

information = {'Website': category[0]}

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Calculating the total sum of numbers from a file by reading them in Python

When I attempt to tally up all the numbers in a given file, everything works fine until there is a word amidst the numbers. For example: 1 2 car 3 4 Below is the code that I have used: def main(): count=0 with open("numbers.txt", "r") as f: try: ...

Which graphical user interface framework pairs best with Pygame?

I'm currently developing a game that requires the ability to display GUI elements within a pygame surface. I have researched various options but have not found exactly what I need. Some libraries like Ocemp, PGU, and GooeyPy seem to be in this problem ...

Is it possible to execute selenium tests without utilizing the Webdriver interface in your codebase?

Recently, I experimented with running a Selenium test without utilizing the Webdriver interface in my code. Surprisingly, the code performed as expected without encountering any issues. System.setProperty("webdriver.chrome.driver", "C://Java learning//Sel ...

Troubleshooting the issue of missing object in Django template

I keep encountering the issue that json is not serializable. views.py def get_post(request): if request.is_ajax(): if request.method=="GET": craves = CraveData.objects.filter(person=request.user) print craves ...

What steps should I follow to enable Tesseract to recognize the license plate in my Python OpenCV project?

https://i.stack.imgur.com/gurvA.jpghttps://i.stack.imgur.com/L90dj.jpg Everything seems to be working perfectly in my OpenCV code. It successfully identifies the license plate, extracts a black and white version using contours, but unfortunately when I tr ...

Using a Python loop to find the sum of numbers between two randomly generated values within the range of 1 to

Can someone assist me in creating a loop to generate two random integers between 1-10 and then calculate their sum within the range? import random total_sum = 0 from random import randrange num1 = (randrange(1,11)) num2 = (randrange(1,11)) count = tota ...

How can we transform the input for the prediction function into categorical data to make it more user-friendly?

To verify my code, I can use the following line: print(regressor.predict([[1, 0, 0, 90, 100]])) This will generate an output. The first 3 elements in the array correspond to morning, afternoon, and evening. For example: 1, 0, 0 is considered as morning 0 ...

Creating a Python dictionary with file names as keys: A step-by-step guide

My goal is to send all files located in a specific folder on my local disk to a designated web address using Requests and Glob. Each time I upload a new file to the URL, I intend to update a dictionary with a new entry consisting of the "file name (key)" a ...

Encountered an issue during the data extraction process utilizing BeautifulSoup

My goal is to extract the membership years data from the IMDB Users page. Link On this page, there are multiple badges and one badge that is common for all users is the last one. This is the code I am using: def getYear(review_url): respons ...

Loop through a set of n elements

Looking to iterate and calculate all possibilities of a given formula. Struggling with nested iterations due to lack of algorithmic skills. To calculate all possibilities (0-100%) for 3 constant values {z1, z2, z3}, I have come up with: a=frange(0,1.0,0. ...

Improving parallel scaling in Python for large object lists processed with multiprocessing Pool.map()

Let's define the following code block: from multiprocessing import Pool import numpy as np def func(x): for i in range(1000): i**2 return 1 It can be observed that the function func() performs a task and always returns the value 1. ...

What are the best practices for managing certificates with Selenium?

I am currently utilizing Selenium to initiate a browser. What is the best approach for handling webpages (URLs) that prompt the browser to accept or deny a certificate? In the case of Firefox, I sometimes encounter websites that require me to accept their ...

The Django dropdown does not display any options

My dropdown menu should be populated with values from a database table in my Django project, but it's showing up blank and I can't figure out why. Here is the HTML code from my home.html page: <select name="regions" id="regions"> {% f ...

Python Scrapy: Extracting live data from dynamic websites

I am attempting to extract data from . The tasks I want to accomplish are as follows: - Choose "Dentist" from the dropdown menu at the top of the page - Click on the search button - Observe that the information at the bottom of the page changes dynamica ...

Selenium C# failing to identify newly opened browser window

I am currently experiencing some synchronization issues in my code. While executing a process, I click a button that opens a new window. To switch to the new window, I am using the following code snippet: _webdriver.SwitchTo().Window(_webdriver.WindowHand ...

What steps can be taken to convert this Matplotlib graphic into a Numpy array?

I have been working on a function that manipulates image data stored as a Numpy array. This function draws rectangles on the image, labels them, and then displays the updated image. The source Numpy array has a shape of (480, 640, 3), representing an RGB ...

What is the best way to insert key-value pairs from a Python dictionary into Excel without including the brackets?

Is there a way to append key value pairs in a Python dictionary without including the brackets? I've tried searching for similar questions but haven't found a solution that works for me. # Create a new workbook called 'difference' fil ...

What is the best way to configure PyDev to automatically format code with a specific character limit per line?

I rely on PyDev for Python coding within Eclipse. Can PyDev be configured to automatically format my code and restrict the maximum number of characters per line? ...

Facing a Chromedriver malfunction on my Django application hosted on the DigitalOcean App Platform

I am currently working on a web application using Django and the App platform on DigitalOcean. I have integrated Selenium and Chromedriver into my app, and I was able to install Chromedriver using Python libraries such as chromedriver_binary via pip. Howev ...

Using Selenium to automate the process of clicking the "create repository" button on GitHub

I'm in the process of developing a Python program that automates the creation of a Git and GitHub repository and performs the initial commit. However, I'm encountering issues with clicking the create repository button despite trying multiple meth ...