Questions tagged [web-crawler]

A Digital Arachnid (also called Cyber Spider) is a programmed entity that explores the Internet in a systematic and automated fashion or with a sense of orderliness. Other expressions for Digital Arachnids encompass insects, automatic cataloguers, automatons, Cyber Spiders, Internet droids, or - particularly within the FOAF society - Web pioneers.

How to use XPath in Python to extract and separate text from a href within the same td element

My webpage contains HTML code similar to this: <tr><td style="text-align:center;">7</td><td class="multi_row" style="line-height:15px;">Loaded on 'NYK LEO 303W' at Port of Loading<br> <a href="JavaScript:void(0); ...

Tips for extracting a website's dynamic HTML content after AJAX modifications using Perl and Selenium

When dealing with websites that utilize Ajax or javascript to display data, I'm facing a challenge in saving this data using WWW::Selenium. Although my code successfully navigates through the webpage and interacts with the elements, I've encounte ...

Tips for selecting a button on Yahoo Finance with Selenium

My goal was to automatically extract "quarterly" data from financial reports on Yahoo Finance, but I couldn't figure out how to do it. I initially tried clicking on "the quarterly button" on the financial page (), however, the code below didn't w ...

Extracting data from websites using Python's Selenium module, focusing on dynamic links generated through Javascript

Currently, I am in the process of developing a webcrawler using Selenium and Python. However, I have encountered an issue that needs to be addressed. The crawler functions by identifying all links with ListlinkerHref = self.browser.find_elements_by_xpath( ...

Vue.js | Web scrapers struggle to track v-for generated URLs

I am currently managing a small website that utilizes Laravel and Vue.js to display a list of items. You can check it out here. It seems like the Google crawler is having trouble following the links generated by the v-for function. In my Google Search Con ...

setting up the NPM web scraping tool

I am attempting to set up NPM crawler on my Windows system using the command npm install crawler The installation is failing, and I am getting the following debug information 5874 error Error: ENOENT, lstat 'E:\Project\test\node_modu ...

Scraping from the web: How to selectively crawl and eliminate duplicate items

What is the most effective method for ensuring that Scrapy does not store duplicate items in a database when running periodically to retrieve new content? Would assigning items a hash help prevent this issue? Your advice on avoiding duplicates would be g ...

Troubleshooting problems with encoding in Python Selenium's get_attribute method

Currently, I am utilizing Selenium with Python to crawl the drop-down menu of this particular page. By employing the find_elements_by_css_selector function, I have successfully obtained all the data from the second drop-down menu. However, when attempting ...

After the automation is finished, I am interested in outputting the IMDB rating of a movie or series to the terminal

I prefer using Google search to find the element because I find it easier to navigate compared to IMDB. import selenium.webdriver as webdriver print("This script is designed to retrieve the IMDb rating of a movie or TV series!!!") def get_results(search_ ...

Automatically include fields in scrapy when they are detected

Currently facing a challenge while attempting to write a spider for web scraping. The issue arises when trying to extract the href Attribute from all the <li> tags within the <ul> tag and storing them in incrementally named variables like Field ...

Encountering problems with the python and selenium code I used to create my Twitter scraper

I have developed a Python script that extracts information like name, tweets, followers, and following from the profiles available in the "view all" section of my Twitter profile page. The script is currently functioning as intended. However, I have encoun ...

Scrapy spider encountering issues during the crawling process

Having trouble crawling coupons on the Cuponation website. Whenever I try to run the crawler, it shows an error. Can someone please assist me? Thanks. import scrapy from scrapy.http import Request from scrapy.selector import HtmlXPathSelector from scrap ...

Which programming languages are recommended for building a web crawler?

My background includes extensive experience with PHP, but I have come to understand that PHP may not be the most suitable language for building a large-scale web crawler due to its limitations on running processes indefinitely. Can anyone recommend alter ...

Extracting information from an API

Hey there, good morning! I'm currently working on gathering car data from this website: My process involves sending a request through the search bar on the homepage for a specific location and date. This generates a page like this: From there, I use the ...

Guide to cloning a webdriver in Selenium

Currently, I am engaged in Web Scraping using selenium webdriver. One challenge I face is the need to navigate to numerous subpages from the main page in order to gather data. Rather than constantly returning to the main page, I am exploring the idea of ...

Cheerio - Ensure accurate text retrieval for selectors that produce multiple results

Visit this link for more information https://i.stack.imgur.com/FfYeg.png I am trying to extract specific market data from the given webpage. Specifically, I need to retrieve "Sábado, 14 de Abril de 2018" and "16:00". Here is how I did it using Kotlin an ...

Tips on selectively crawling specific URLs from a CSV document using Python

I am working with a CSV file that contains various URLs with different domain extensions such as .com, .eu, .org, and more. My goal is to only crawl domains with the .nl extension by using the condition if '.nl' in row: in Python 2.7. from selen ...

Obtain pictures from Google image search using Python

As a newcomer to web scraping, I initially turned to https://www.youtube.com/watch?v=ZAUNEEtzsrg for guidance on downloading images with specific tags (e.g. cat), and it proved successful! However, I have encountered a new issue where I can only download a ...

Is there a way to display a list of dropdown menu options using Selenium with Python?

I am currently attempting to utilize Selenium Python and its PhantomJS function in order to print out all the items from the first drop-down menu on this particular page (). Unfortunately, I keep encountering a No Attribute error message. If anyone could ...

Tips for retrieving page source with selenium Remote Control

Looking to Develop a Basic Java Web Crawler. WebDriver driver = new HtmlUnitDriver(); driver.get("https://examplewebsite.com"); String pageSource=driver.getPageSource(); System.out.println(pageSource); The result is as follows: <!DOCTYPE html PUBLIC ...

Efficiently removing and re-adding elements in a list in Python based on a specific condition

I have a list of URLs to crawl. ['','', ''] Here is the code snippet: prev_domain = '' while urls: url = urls.pop() if base_url(url) == prev_domain: urls.append(url) continue else: cr ...

Python and Scrapy encounter issues with locating website links

I've been searching for the URLs of all events listed on this page: https://www.eventshigh.com/delhi/food?src=exp However, I can only locate the URL in JSON format: { "@context":"http://schema.org", "@type":"Event", "name":"DANDIYA NIGHT 20 ...

Steps for transforming domain.com/#!/url into domain.com/url are outlined below. Firstly

Can someone help me figure out how to change the link structure from domain.com/#!/home to domain.com/home? I tried using htaccess but had no success. After researching online, I learned it is referred to as "ajax crawling" but I couldn't find a solut ...

Retrieve data from an Angular application embedded within a MVC C# application using a web crawler

UPDATE: I successfully resolved the issue by sending requests through cURL using AJAX URLs that Angular uses to communicate with .NET. To find these URLs, I used the Inspect tool in my browser. Here's what I did: curl_setopt($ch, CURLOPT_URL, "http://ww ...

Verification forms and age detection algorithms

Recently, I designed a website dedicated to a particular beer brand which required an age verification page. The PHP script that handles the verification utilizes sessions to store the verification status. This script redirects all visitors to the verifica ...

Scrapy spider malfunctioning when trying to crawl the homepage

I'm currently using a scrapy scrawler I wrote to collect data from from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import Selector from .. import items clas ...

Selenium - How to effectively obtain hyperlinks?

There are numerous web elements similar to this example <a data-control-name="browsemap_profile" href="/in/quyen-nguyen-63098b123/" id="ember278" class="pv-browsemap-section__member ember-view"> <img widt ...

Exploring Angular Applications with Search Engine Crawlers on ASP.NET

I am currently using AngularJS for my website's front end, and ASP.NET for the back end. I am in need of a headless browser that can easily render content on the server side for web crawlers. I have been looking into Awesomium.NET and WebKit.NET, but t ...

Is it possible to determine if a web page is designed for a personal computer or a

Is there a way to identify whether a given URL is intended for PC or mobile devices? I have a collection of URLs, and I need to determine if each one corresponds to a PC or mobile version. Is there any specific HTML element or marker within the page sour ...

"Exploring the use of Selenium Webdriver to extract data based on class name

<li class='example'> content1 </li> <li class='example'> content2 </li> <li class='example'> content3 </li> Within this HTML snippet, I am aiming to gather all the text within elements that have the 'example' class. dr ...

What could be the reason for receiving empty content when using requests.get with the correct header?

While attempting to crawl a website, I copied the Request Headers information from Chrome directly. However, after using requests.get, the returned content is empty. Surprisingly, the header printed from requests is correct. Any idea why this might be happ ...

How can one remove surrounding paragraph texts using BeautifulSoup in Python?

I'm a beginner in Python, working on a web crawler to extract text from articles online. I'm facing an issue with the get_text(url) function, as it's returning extra unnecessary text along with the main article content. I find this extra text to be irrele ...

Encountered an error while web crawling in JavaScript: Error - Connection timeout

I encountered an error while following a tutorial on web crawling using JavaScript. When I execute the script, I receive the following errors: Visiting page https://arstechnica.com/ testcrawl ...

Python crawling loop for web scraping

I'm interested in retrieving all the reviews from the Google Play Store using Python, but I need to click on the "view more" buttons. I believe a loop might be necessary for this task. import time from selenium import webdriver from selenium.webdriver.c ...

Is it possible to log in to a social media website such as Facebook or Twitter and gather data using the programming language

I'm embarking on a journey to create a console application using C# in Visual Studio, but I find myself lost at the starting line. My first goal is to implement a login feature using either PhantomJS or Selenium, then navigate to a specified website URL ...

Is there a method to instruct crawlers to overlook specific sections of a document?

I understand that there are various methods to control the access of crawlers/spiders to documents such as robots.txt, meta tags, link attributes, etc. However, in my particular case, I am looking to exclude only a specific portion of a document. This por ...

When an element is not found, selenium often encounters issues with getting stuck

Struggling to gather data from the IMDB website and save it to a CSV file using my code. Whenever I encounter an element that is missing, the process gets stuck. Check out my script below: from selenium import webdriver from selenium.common.exceptions im ...

Scrape embedded content within an IFrame on a webpage with Java

Currently, I am interested in crawling the dynamic content within an IFramed webpage; Unfortunately, none of the crawlers I have tested so far (Aperture, Crawl4j) seem to support this feature. The result I typically get is: <iframe id="template_cont ...

Repeatedly clicking on a pagination 'load more' button in Selenium (Python) to load dynamically generated JavaScript data

I have been attempting to crawl the content of a website by dynamically clicking a 'load-more' button. Despite researching similar questions on Stack Overflow, I have not found a solution that addresses my specific issue when parsing the website ...

Evade Incapsula's JStest

Seeking assistance with updating my status using cURL on a website secured by Incapsula. I am encountering difficulty accessing the main page due to their JS test security measures, despite cloning headers, useragent, and IP. Can anyone suggest a solution ...

Challenges encountered in retrieving 'text' with Selenium in Python

After attempting to extract the list data from every drop-down menu on this page, I managed to access the 'li' tag section and retrieve the 'href' data using Selenium Python 3.6. However, I encountered an issue when trying to obtain the ...

Guide on automatically logging in to a specific website using Selenium with Kakao, Google, or Naver credentials

I encountered an issue with my selenium login code for a particular website, where the keys seem to be generating errors. https://i.stack.imgur.com/wJ3Nw.png Upon clicking each button, it brings up the login tab. However, I keep encountering this error ...

A guide to identifying web crawlers for search engine optimization with the help of Express

I have been on the lookout for npm packages to assist me in detecting crawlers, but all I find are outdated and unmaintained ones that rely on obsolete user-agent databases. Is there a dependable and current package available that can aid me in this task? ...

Policy regarding copyright content removal and protection of privacy

In the majority of musical websites I have come across, they often mention the following in their policies: At this particular website (www.Indiamp3.com), you will find links that redirect to audio files. These files are hosted elsewhere on the internet a ...

dealing with the StaleElementReferenceException while using selenium

Trying to create a web crawler using Selenium, I encountered a StaleElementReferenceException in my program. Initially, I suspected that the exception was caused by crawling pages recursively and navigating to the next page without first returning to the p ...

The specified selector is invalid or illegal in HTMLUnit

Attempting to mimic a login using htmlunit has presented me with an issue despite following examples. The console messages I have gathered are as follows: runtimeError: message=[An invalid or illegal selector was specified (selector: '*,:x' erro ...

Using Python to extract all the links on a dynamic webpage

I am struggling to develop a universal crawler that can analyze a webpage and compile a list of all links within it, with the goal of examining an entire domain and all its internal links. I have attempted using HtmlUnit in Java and Selenium in Python, bu ...

PHP will not refresh the page until it has fully loaded

I am interested in writing a crawler script using PHP, but I have encountered an issue with displaying pages in real-time. It seems that PHP does not update the page immediately, sometimes outputting several echos at once and waiting for the page to finish ...

Exploring solutions for table counter in Web Crawler using Selenium and Webdriver

I have a question regarding web crawling and specifically targeting the hkgolden website. I utilized Selenium and WebDriver (Chromedriver) to create this web crawler. My current goal is to determine the number of tables located at: https://forumd.hkgold ...

What is the best way to automate the crawling process without the need to input any specific numbers or codes

I have developed a Python script using Selenium for scraping restaurant names from a webpage. The script is working well when I manually input the number of entries to parse. The webpage uses lazy-loading to display 40 names with each scroll, but my scri ...

Trouble arises when implementing AJAX in conjunction with PHP!

I am facing an issue with my PHP page which collects mp3 links from downloads.nl. The results are converted to XML and display correctly. However, the problem arises when trying to access this XML data using ajax. Both the files are on the same domain, b ...

Excluding certain URLs from the PHP crawler's navigation

I'm currently working on a GenerateSitemap.php file for configuring the crawler, but I'm struggling to figure out how to make it skip specific URLs like https://example.com/noindex-url. I've tried reading up on it, but I can't seem to g ...