What is the process for extracting the download button URL and parsing a CSV file in Python?

Question

What is the process for extracting the download button URL and parsing a CSV file in Python?

In my Python Google Colab project, I am attempting to access a CSV file from the following link:

After scrolling down slightly on the page, there is a download button visible. My goal is to extract the link using Selenium or BeautifulSoup in order to read the CSV file. The code snippet I am working with looks like this:

# Installing necessary packages
!pip install selenium
!apt-get update # Update Ubuntu for proper apt installation
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin

# Import required libraries
import pandas as pd
from selenium import webdriver
import sys

# Using Selenium to fetch and read the CSV file
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
driver.get('https://www.macrotrends.net/stocks/charts/AAPL/apple/stock-price-history')# Enter the URL of the desired page here
btn = driver.find_element_by_tag_name('button')
btn.click()
df = pd.read_csv('##.csv')

Everything seems to be functioning properly up to the btn.click() step, but I encounter an error afterwards because I'm not able to locate the download button's link or the file name. Can anyone provide guidance on how to do this successfully? Any help would be greatly appreciated.

python selenium csv selenium-webdriver web-scraping

Answer 1

Answer №1

Forget about using selenium. The necessary data is actually included within the <script> tags.

import requests
from bs4 import BeautifulSoup
import json
import pandas as pd

ticker = 'AAPL'
url = 'https://www.macrotrends.net/assets/php/stock_price_history.php?t={}'.format(ticker)

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

scripts = soup.find_all('script', {'type': 'text/javascript'})
for script in scripts:
    if 'var dataDaily' in str(script):
        jsonStr = '[' + str(script).split('[', 1)[-1].split('];')[0] + ']'
        jsonData = json.loads(jsonStr)
        
df = pd.DataFrame(jsonData)
df = df.rename(columns={'o':'open','h':'high','l':'low','c':'close','d':'date','v':'volume'})
df.to_csv('MacroTrends_Data_Download_{}.csv'.format(ticker), index=False)

Result:

print(df)
             date      open      high  ...   volume     ma50    ma200
0      1980-12-12    0.1012    0.1016  ...  469.034      NaN      NaN
1      1980-12-15    0.0964    0.0964  ...  175.885      NaN      NaN
2      1980-12-16    0.0893    0.0893  ...  105.728      NaN      NaN
3      1980-12-17    0.0910    0.0915  ...   86.442      NaN      NaN
4      1980-12-18    0.0937    0.0941  ...   73.450      NaN      NaN
          ...       ...       ...  ...      ...      ...      ...
10135  2021-02-25  124.6800  126.4585  ...  148.200  131.845  112.241
10136  2021-02-26  122.5900  124.8500  ...  164.560  131.838  112.460
10137  2021-03-01  123.7500  127.9300  ...  116.308  131.840  112.716
10138  2021-03-02  128.4100  128.7200  ...  102.261  131.790  112.957
10139  2021-03-03  124.8100  125.7100  ...  111.514  131.661  113.184

[10140 rows x 8 columns]

Answer 2

Forget about using selenium. The necessary data is actually included within the <script> tags.

import requests
from bs4 import BeautifulSoup
import json
import pandas as pd

ticker = 'AAPL'
url = 'https://www.macrotrends.net/assets/php/stock_price_history.php?t={}'.format(ticker)

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

scripts = soup.find_all('script', {'type': 'text/javascript'})
for script in scripts:
    if 'var dataDaily' in str(script):
        jsonStr = '[' + str(script).split('[', 1)[-1].split('];')[0] + ']'
        jsonData = json.loads(jsonStr)
        
df = pd.DataFrame(jsonData)
df = df.rename(columns={'o':'open','h':'high','l':'low','c':'close','d':'date','v':'volume'})
df.to_csv('MacroTrends_Data_Download_{}.csv'.format(ticker), index=False)

Result:

print(df)
             date      open      high  ...   volume     ma50    ma200
0      1980-12-12    0.1012    0.1016  ...  469.034      NaN      NaN
1      1980-12-15    0.0964    0.0964  ...  175.885      NaN      NaN
2      1980-12-16    0.0893    0.0893  ...  105.728      NaN      NaN
3      1980-12-17    0.0910    0.0915  ...   86.442      NaN      NaN
4      1980-12-18    0.0937    0.0941  ...   73.450      NaN      NaN
          ...       ...       ...  ...      ...      ...      ...
10135  2021-02-25  124.6800  126.4585  ...  148.200  131.845  112.241
10136  2021-02-26  122.5900  124.8500  ...  164.560  131.838  112.460
10137  2021-03-01  123.7500  127.9300  ...  116.308  131.840  112.716
10138  2021-03-02  128.4100  128.7200  ...  102.261  131.790  112.957
10139  2021-03-03  124.8100  125.7100  ...  111.514  131.661  113.184

[10140 rows x 8 columns]

What is the process for extracting the download button URL and parsing a CSV file in Python?

Answer №1

Similar questions

Leveraging sqlite memory database for selenium testing in Rails 4

Is it possible for Python to perform in a similar manner in a bash script?

Utilizing Selenium WebDriver to pick a date from a calendar

Exploring DFS problem-solving techniques using recursion - uncovering its inner workings!

Exploring Python's Lambda characteristics

Using Selenium in Python to extract distinct list

Grab the information swiftly

How can strings be properly formatted before being utilized as a JSON object?

When using Chrome with Selenium, the web page is able to detect the presence of Selenium and prevents users from logging in

Is there a way to configure the character encoding to UTF-8 when importing a CSV file into PHPMy

Having trouble setting cookie with Selenium::Remote::Driver

Sending a returned value to another function

YAML to JSON conversion failed due to an error: yaml: line 3: expected key not found

Can all exceptions for requests be consistently captured? (And more broadly, for a module)

Is it possible that Selenium struggles to locate an element on a Linux headless system, yet has no trouble doing so on a Windows headless

generate a random click function in Selenium using Python to maintain session activity

Guide on utilizing the latest version of Chrome with Selenium using Python

Optimal Google Chrome version for Selenium automation testing

Converting a JSON file embedded in pandas python to a CSV format

How can the {% extends '...' %} statement in Django be made contingent on a certain condition?