Maintain the use of Selenium Webdriver throughout Celery tasks

Recently, I created a Flask web application that utilizes Celery for task handling. Within this app, one of the tasks involves scraping approximately 200 pages using a custom Class derived from a selenium chrome driver.

@celery_app.task
def scrape_async():
   driver = MyDriver(exec_path=os.environ.get('CHROMEDRIVER_PATH'), options=some_chrome_options)
   
   # Update 'urls_to_scrape' attribute by finding the urls to scrape from a main url
   driver.find_urls_to_scrape_from_main_url()
   
   # Loop over each page and store the data in database
   for url in driver.urls_to_scrape:
      driver.scrape_page(url)

  # Exit driver
  driver.quit()

This process worked flawlessly both locally and in production until the number of pages to scrape increased. At that point, I encountered a memory usage error on Heroku and realized the high memory consumption involved in this task.

After conducting some research, I discovered that implementing a subtask within my loop (i.e., executing this subtask for each individual page) could be a more efficient approach, as illustrated below.

@celery_app.task
def subtask(url):
  driver.scrape_page(url)

@celery_app.task
def scrape_async():
   driver = MyDriver(exec_path=os.environ.get('CHROMEDRIVER_PATH'), options=some_chrome_options)
   
   # Update 'urls_to_scrape' attribute by finding the urls to scrape from a main url
   driver.find_urls_to_scrape_from_main_url()
   
   # Loop over each page and store the data in database
   for url in driver.urls_to_scrape:
      subtask.delay(url)

  # Exit driver
  driver.quit()

I am now faced with the dilemma of how to maintain the driver object between the main task and the subtask. Although I came across this resource on instantiating Tasks, I struggled to figure out how to create one driver instance that spans multiple tasks.

Do you have any suggestions on how I can address this issue and achieve my objective?

Thank you

Answer №1

If you're embarking on distributed crawling, consider implementing a selenium grid to streamline the process. This will help minimize resource usage when it comes to loading, rendering htmls, and executing JavaScript scripts on other servers.

By utilizing this approach, you can alleviate concerns about worker selection if you're working with tasks backend. For crawling nested links and multiple pages, multiple workers can operate independently, passing required meta data using results backend alongside group chains and chords functionality in celery.

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Guidelines for incorporating my company's PPT slide designs in your python-pptx presentation

When I run the following code: from pptx import Presentation prs = Presentation() slide_layout = prs.slide_layouts[0] prs.save("Final.pptx") A PowerPoint presentation is generated with a single slide that uses the default title layout. However, when I o ...

Encountering a Problem with Github Actions: Exit Code 2 Detected

I need help with the following code snippet for setting up Github Actions: name: Python application on: push: branches: [ main ] jobs: build: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - name: Install dep ...

iterate over elements with dynamic values in selenium

I am currently facing a challenge where I extract text from a website and then need to use it in a loop to continuously click on elements based on that variable. The code I have written only loops for 5 times, but the desired behavior is to click on elemen ...

Strategy to implement when a web element is appearing and disappearing twice

I am currently facing a challenge and I need help resolving it. Using Selenium and Helium, I am conducting tests on a web application. During the testing process, the web app displays a screen blocker that appears and disappears twice. I need to wait fo ...

Having trouble launching Firefox with Selenium test in Java using Netbeans?

I recently installed Selenium IDE on Firefox, ran a simple test, and exported the test cases to Netbeans using Java/JUNIT4/WebDriver. However, when I tried to run the code in Netbeans, Firefox did not launch as expected. Interestingly, another program of m ...

Determining the scope of values within a numpy array

I am working with a NumPy array that has dimensions of 94 x 155: a = [1 2 20 68 210 290.. 2 33 34 55 230 340.. .. .. ... ... .... .....] My goal is to find the range for each row, resulting in 94 different ranges. I have searched for a ...

Instead of assigning the 'Selection' to a specific 'Phase', it is being applied to every 'Phase' item within the collection

A problem has arisen with the 'AddOption()' function within the Story class that needs to be addressed. Instead of adding options to a specific stage, all stages in the dictionary seem to be receiving these options. The problematic section is ind ...

When using flask.jsonify, the output is wrapped in square brackets, as opposed to the usual curly

I'm currently facing an issue with my Flask and jsonify setup. When I try to output JSON, it's coming back in array format with square brackets instead of the expected JSON object with curly brackets. Does anyone have guidance on how to resolve ...

Ways to execute a singular python function concurrently on multiple occasions

I am looking to run a Python function in parallel 100 times. Can anyone provide the code snippet to achieve this? from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium import webdriver from selenium.webdriver.chrome ...

Enter a string into a concealed search bar using Python Selenium

<input type="text" autocomplete=off autocorrect=off class="select-input" Id="myid_bus21" tabindex="0"> The last 2 characters of the IDs change for each link I use, which is why I used: driver.find_element_by_xpath("//*[contains(I'd, 'myi ...

Unable to establish TCP port connection from an external host

Seeking assistance desperately! I've been grappling with this issue for hours now. I'm at my wit's end, exhausted from scouring every possible resource with still no solution in sight. The predicament I find myself in involves a Python scri ...

Is it possible to execute selenium tests without utilizing the Webdriver interface in your codebase?

Recently, I experimented with running a Selenium test without utilizing the Webdriver interface in my code. Surprisingly, the code performed as expected without encountering any issues. System.setProperty("webdriver.chrome.driver", "C://Java learning//Sel ...

Guidelines for Retrieving Data from Dictionaries and Generating Lists

I've got a bunch of dictionaries: list_of_dicts = [ {'id': 1, 'name': 'example1', 'description': 'lorem'}, {'id': 2, 'name': 'example2', 'description': ...

Contrasting Selenium RC with WebDriver

Can you explain the key distinction between Selenium RC and WebDriver? ...

Updating Element Attribute with Selenium in Python

Currently, I am utilizing python selenium to streamline data input on a specific website. The challenge lies in the fact that one of the fields restricts characters to a maximum length of "30". <input data-qa="input-LogId" data-testid="input-LogId" ...

My Django app seems to be malfunctioning - I keep receiving a "404 Page not found" error. What

I have encountered an issue similar to one discussed in a previous question posted here. However, none of the solutions provided there resolved my problem, prompting me to create a new question. Upon running my code, I am receiving the following error mess ...

Discovering the minimum and maximum values defined by the user in a list of tuples

I am currently working with a collection of tuples, each containing 2 integers: data_list=[(20, 1), (16, 0), (21, 0), (20, 0), (24, 0), (25, 1)] My goal is to identify and retrieve the tuple that has the smallest second integer value but the largest firs ...

Creating a message that says "Item Not Found" when the input provided by the user does not correspond to any items in a given list (Tkinter)

A project I am working on involves developing an F1 Tkinter GUI application. This app allows users to input a driver's name either by typing it out or selecting from a listbox, and then relevant statistics are displayed. My current focus is on handli ...

Tips for saving comma-separated values to one CSV document:

Is there a way to create a CSV file with a single column, where each row consists of values separated by either ',' or ';'? Here is an example: the data [['s', [1, 2, 3]], ['a', [4, 5, 6]]] I want the CSV to look l ...

Is there a way to only read from an Access database (.mdb) file without making any changes?

My current code is able to read a .MDB Database and convert it into a CSV file. However, since the database is located in a shared network folder, other users conducting tests are unable to write to the database while the code is running. I am looking for ...