Create a flexible framework for extracting data from online sources

Currently, I am facing a situation where my server receives requests and I need a way to initiate the scraping process on a separate resource that is dynamically created. I would prefer not to conduct the scraping on my primary instance to prevent overloading it. Additionally, I do not want the secondary instance to continue running and costing me when it's not actively scraping data. Therefore, I am seeking a solution that allows me to request the start of the scraping process and automatically close the instance once the task is completed.

After exploring options such as Google Cloud Functions, which have a maximum runtime cap of 9 minutes per function - insufficient for my scraping needs, and AWS SDK, which enables the creation and closure of VMs but presents challenges with pushing API scripts onto the new AWS instance, I find myself in need of a more scalable and adaptable system. With multiple scripts dedicated to scraping different websites, I require a robust and flexible solution that can accommodate these various tasks.

I am flexible in terms of technology and open to any suggestions or assistance that could provide a suitable answer to this issue. Your help and guidance are highly appreciated. Thank you.

Answer №1

I'm struggling with getting my API script onto the freshly created AWS instance.

You can accomplish this by utilizing UserData:

When you start an instance in Amazon EC2, you have the option to provide user data that can be used for automated configuration tasks and even running scripts once the instance is up and running.

In essence, you would design your UserData to install scripts, all necessary dependencies, and execute them. This process will occur when new instances are launched.

If scalability is a requirement, you can launch your instances within an Auto Scaling Group and adjust the scaling as needed.

Another option is to run your scripts as Docker containers, such as using AWS Fargate.

Just keep in mind that AWS Lambda has a time limit of 15 minutes, similar to Google functions.

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

The installation of npm was successful, however, the node version is not displaying on Windows 10

After checking the node version, I received the following response: PS C:\WINDOWS\system32> node -v Node Commands Syntax: node {operator} [options] [arguments] Parameters: /? or /help - Display this help message. list ...

Library for Nodejs that specializes in generating and converting PDF/A files

Is there a library available that can convert/create a PDF/A file? I've been searching for solutions but the existing answers suggest using an external service or provide no response at all. I heard about libraries in other languages like ghostscriptP ...

Tips for resolving cors error on static website deployed on aws s3 hosting?

I currently have a static website hosted on an AWS S3 bucket and am attempting to make a POST request to my backend that is running locally. However, I'm encountering the following CORS error: Access to XMLHttpRequest at 'http://localhost:5000/lo ...

Struggling to fetch elements and navigate through pages with the next button feature

Currently in the process of extracting data from this specific URL. I am working on retrieving the following information: Unit Name, Site Street 1, 2, Site City, Province/State, Code, Facility Category, Completed. I have successfully managed to do so, but ...

Firebase is throwing a TypeError because it is expecting a string for the path, but instead it received

Starting up with firebase is a new adventure for me. I may not have a complete understanding of firebase yet, but based on my limited knowledge, I have set up my app in this way. Within the main Index.js file, I am requiring const path = require(' ...

Potential Cross-Origin Resource Sharing (CORS) problem arises when integrating Node Express with an Ionic

Currently, I have an Ionic application that communicates with a Node Express application using Restangular. Everything works smoothly when the Node Express server is configured to use HTTP. On the Ionic app side: RestangularProvider.setBaseUrl('http ...

The request from localhost:3000 to localhost:3003 could not be proxied by ReactJS

Currently, I am working on developing a Single Page Application (SPA) using create-react-app with an expressjs server as the backend. During development, my frontend test server runs on port 3000 while my backend expressjs test server runs on port 3003. T ...

What could be causing the error "Err: user.validPassword is not a function" to occur

I am in the process of developing a node.js app and I have implemented Passport js. After referring to the Passport-local documentation on their official website, I decided to utilize the local passport-strategy. However, I encountered an error stating tha ...

Make sure that JSON.stringify is set to automatically encode the forward slash character as `/`

In my current project, I am developing a service using nodejs to replace an old system written in .NET. This new service exposes a JSON API, and one of the API calls returns a date. In the Microsoft date format for JSON, the timestamp is represented as 159 ...

React and Express facing CORS header challenge

I'm facing an issue with CORS despite trying various solutions found on Stack Overflow. My app uses Express/NodeJS as an API and React JS for the frontend. During development, the React app at http://localhost:3000 communicates successfully with the ...

Exploring the functioning of mongodb connection with simultaneous requests in NodeJS express servers

As a beginner with mongoDB, I am currently working on integrating it with a Node express server. One query that has been bugging me is how to handle concurrent requests to the mongodb for reading collection data using the mongoose driver module. To illust ...

Files added using the AWS S3 Android SDK are not appearing in the directory

I uploaded some files to my S3 bucket using the Android SDK. I can successfully process them on my Node.js server using a stream and signed URL. However, when I log into S3, I cannot see the files. At first, I thought it might be because I was adding keys ...

Error message: Nodemon has been successfully installed on the system, but it appears to

Having faced issues with installing nodemon, I attempted to uninstall and reinstall it multiple times both locally and globally using the following: npm install -g nodemon (with and without sudo) The installation appeared to be successful as indicated b ...

The server is having trouble listening on the port with babel-node

Currently, I am facing challenges trying to run a node ES6 with Babel setup on a docker container. The main issue I'm encountering is getting the application to start listening on port 3000. Although I can observe that the app.js file is being process ...

Having trouble relaunching Node.js application following PM2 deletion

Typically, I use the command pm2 stop to halt my application and it usually works without any issues. However, recently I attempted using pm2 delete on my app followed by restarting it, but unfortunately, it doesn't seem to be working anymore. After ...

Can I programatically change the "Deliver to Country" option on Amazon's website using Python Selenium to capture screenshots?

Facing an Issue: I am trying to search for keywords on Amazon and capture screenshots using the selenium package. However, when I perform a search on amazon.co.uk, the delivery address displayed is for the United States. How can I modify the "Deliver to Co ...

Guide on grouping records by day with timestamp as a field in MongoDB and Node.js

My goal is to categorize records by day within a specific time frame. I have attempted to achieve this by incorporating the following code into the aggregate function: { $group : { _id : { day: { $dayOfMonth: "$timestamp" }}, count ...

Retrieving process.env.BASE_URL on Google Cloud Platform using Node.js/Express

Currently, I am facing some challenges while working on my initial Node.Js app on Google Cloud. Specifically, I am struggling to make the following code work once it is deployed: const baseUrl = process.env.BASE_URL || 'http://localhost:3000' I ...

What causes Jest to throw ReferenceErrors?

Question Does anyone know why I am encountering this error? ● Test suite failed to run ReferenceError: Cannot access 'mockResponseData' before initialization > 1 | const axios = require('axios'); ...

Create a QR code for your Wi-Fi network using the qr-image library

The npm package qr-image is a great tool for generating QRCodes using NodeJS. Here's an example: import qr from 'qr-image'; import fs from 'fs'; qr.image('https://google.com/search?q=hello%20world!', {type:'png&apos ...