Tips on selectively crawling specific URLs from a CSV document using Python

Question

Tips on selectively crawling specific URLs from a CSV document using Python

I am working with a CSV file that contains various URLs with different domain extensions such as .com, .eu, .org, and more. My goal is to only crawl domains with the .nl extension by using the condition if '.nl' in row: in Python 2.7.

from selenium import webdriver
import csv

fieldnames = ['Website', '@media', 'googleadservices.com/pagead/conversion']

def csv_writerheader(path):
    with open(path, 'w') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames, lineterminator='\n')
        writer.writeheader()

def csv_writer(dictdata, path):
    with open(path, 'a') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames, lineterminator='\n')
        writer.writerow(dictdata)

csv_output_file = 'output!.csv'

driver = webdriver.Chrome(executable_path=r'C:\Users\Jacob\PycharmProjects\Testing\chromedriver_win32\chromedriver.exe')

keywords = ['@media', 'googleadservices.com/pagead/conversion']

csv_writerheader(csv_output_file)

with open('top1m-edited.csv') as example_file:
    example_reader = csv.reader(example_file)
    for row in example_reader:

        # INITIALIZE DICT
        data = {'Website': row}

        if '.nl' in row:  # FILTERING DOMAINS WITH .NL EXTENSION
            try:
                driver.get(row[0])
                html = driver.page_source    

                for searchstring in keywords:
                    if searchstring.lower() in html.lower():
                        print (row, searchstring, 'FOUND!')
                        data[searchstring] = 'FOUND!'
                    else:
                        print (row, searchstring, 'not found')
                        data[searchstring] = 'not found'    

                csv_writer(data, csv_output_file)

            except:
                pass

Printed result:

C:\Python27\python.exe "C:/Users/Jacob/PycharmProjects/Testing/fooling around 2.py"

Process finished with exit code 0

Currently, my script is not producing any meaningful results and generating a CSV file with minimal output. However, when I remove the condition if '.nl' in row:, the script functions properly.

What modifications should be implemented to specifically target and scrape URLs with the .nl domain extension?

python csv selenium-webdriver web-crawler

Answer 1

Answer №1

for each individual in sample_set:

The element category is an array. Therefore, it is searching for a specific item in the array that matches ".nl". You have several choices at your disposal. If the dataset contains only one column with URLs, you can modify this:

if '.nl' in category:

to this:

if '.nl' in category[0]:

ADDENDUM: also, any modifications made to category must be reflected in category[0], such as

information = {'Website': category[0]}

Answer 2

for each individual in sample_set:

The element category is an array. Therefore, it is searching for a specific item in the array that matches ".nl". You have several choices at your disposal. If the dataset contains only one column with URLs, you can modify this:

if '.nl' in category:

to this:

if '.nl' in category[0]:

ADDENDUM: also, any modifications made to category must be reflected in category[0], such as

information = {'Website': category[0]}

Tips on selectively crawling specific URLs from a CSV document using Python

Answer №1

Similar questions

Calculating the total sum of numbers from a file by reading them in Python

Which graphical user interface framework pairs best with Pygame?

Is it possible to execute selenium tests without utilizing the Webdriver interface in your codebase?

Troubleshooting the issue of missing object in Django template

What steps should I follow to enable Tesseract to recognize the license plate in my Python OpenCV project?

Using a Python loop to find the sum of numbers between two randomly generated values within the range of 1 to

How can we transform the input for the prediction function into categorical data to make it more user-friendly?

Creating a Python dictionary with file names as keys: A step-by-step guide

Encountered an issue during the data extraction process utilizing BeautifulSoup

Loop through a set of n elements

Improving parallel scaling in Python for large object lists processed with multiprocessing Pool.map()

What are the best practices for managing certificates with Selenium?

The Django dropdown does not display any options

Python Scrapy: Extracting live data from dynamic websites

Selenium C# failing to identify newly opened browser window

What steps can be taken to convert this Matplotlib graphic into a Numpy array?

What is the best way to insert key-value pairs from a Python dictionary into Excel without including the brackets?

What is the best way to configure PyDev to automatically format code with a specific character limit per line?

Facing a Chromedriver malfunction on my Django application hosted on the DigitalOcean App Platform

Using Selenium to automate the process of clicking the "create repository" button on GitHub