Expanding Excel Spreadsheets Following Data Extraction from the Internet

Question

Expanding Excel Spreadsheets Following Data Extraction from the Internet

I managed to extract data successfully from the given website . I created an excel file with the information for one product. However, when trying to scrape data for a second product, I encountered issues adding another sheet to the existing excel file. Any guidance on this matter would be greatly appreciated. Thank you in advance. Here is my code snippet: -

from selenium import webdriver 
import time, re
from selenium.webdriver.support.ui import Select
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
import time

chrome_path = r"C:\Users\user\Desktop\chromedriver_win32\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)

driver.get("https://fcainfoweb.nic.in/Reports/Report_Menu_Web.aspx")

html_source = driver.page_source
results=[]

driver.find_element_by_xpath("""//*[@id="ctl00_MainContent_Rbl_Rpt_type_1"]""").click()
element_variation = driver.find_element_by_id ("ctl00_MainContent_Ddl_Rpt_Option1")
drp_variation = Select(element_variation)
drp_variation.select_by_visible_text("Daily Variation")

driver.find_element_by_id("ctl00_MainContent_Txt_FrmDate").send_keys("01/05/2020")
driver.find_element_by_id("ctl00_MainContent_Txt_ToDate").send_keys("27/05/2020")

element_commodity = driver.find_element_by_id ("ctl00_MainContent_Lst_Commodity")
drp_commodity = Select(element_commodity)
drp_commodity.select_by_visible_text("Rice")

driver.find_element_by_xpath("""//*[@id="ctl00_MainContent_btn_getdata1"]""").click()

soup = BeautifulSoup(driver.page_source, 'html.parser')
table = pd.read_html(driver.page_source)[2] #second table is the one that we want
print(len(table))
print(table)

results.append(table)
driver.back()
time.sleep(1)
with pd.ExcelWriter(r'C:\Users\user\Desktop\python.xlsx') as writer:
 table.to_excel(writer, sheet_name = "rice", index=False) # Rice results on sheet named rice
 writer.save() 

driver.find_element_by_xpath("""//*[@id="btn_back"]""").click()
driver.find_element_by_xpath("""//*[@id="ctl00_MainContent_Rbl_Rpt_type_1"]""").click()
element_variation = driver.find_element_by_id ("ctl00_MainContent_Ddl_Rpt_Option1")
drp_variation = Select(element_variation)
drp_variation.select_by_visible_text("Daily Variation")

driver.find_element_by_id("ctl00_MainContent_Txt_FrmDate").send_keys("01/05/2020")
driver.find_element_by_id("ctl00_MainContent_Txt_ToDate").send_keys("27/05/2020")

element_commodity = driver.find_element_by_id ("ctl00_MainContent_Lst_Commodity")
drp_commodity = Select(element_commodity)
drp_commodity.select_by_visible_text("Wheat")

driver.find_element_by_xpath("""//*[@id="ctl00_MainContent_btn_getdata1"]""").click()

soup = BeautifulSoup(driver.page_source, 'html.parser')
table = pd.read_html(driver.page_source)[2] #second table is the one that we want
print(len(table))
print(table)

results.append(table)
driver.back()
time.sleep(1)

with pd.ExcelWriter(r'C:\Users\user\Desktop\python.xlsx') as writer:
 table.to_excel(writer, sheet_name = "wheat", index=False) # Wheat results on sheet named wheat
 writer.save()

python excel selenium beautifulsoup

Answer 1

Answer №1

When dealing with certain types of files, it may be necessary to read all data into memory, add new data, and then save all the data back to the file. Other files may require using the "append" mode.

For more information, refer to the documentation for ExcelWriter, which includes an option mode="a" for appending to an existing file.

with pd.ExcelWriter(r'C:\Users\user\Desktop\python.xlsx') as writer:
    table.to_excel(writer, sheet_name="rice", index=False)
    #writer.save() 

with pd.ExcelWriter(r'C:\Users\user\Desktop\python.xlsx', mode='a') as writer:
    table.to_excel(writer, sheet_name="wheat", index=False)
    #writer.save()

Alternatively, you can achieve this without using the append mode in a single with block.

with pd.ExcelWriter(r'C:\Users\user\Desktop\python.xlsx') as writer:
    table.to_excel(writer, sheet_name="rice", index=False)
    table.to_excel(writer, sheet_name="wheat", index=False)
    #writer.save()

By the way: I discovered that the append mode does not work with the xlsxwriter engine, so I had to switch to using the openpyxl engine (which also requires installing the openpyxl module with pip).

with pd.ExcelWriter(r'python.xlsx', engine='openpyxl', mode='a') as writer:

I found a list of available engines in response to this question: Engines available for to_excel function in pandas

Here is the full working code:

from selenium import webdriver
from selenium.webdriver.support.ui import Select
import pandas as pd
import time

# --- functions ---

def get_data(start_date, end_date, product):

    # Select `Variation Report`
    driver.find_element_by_id('ctl00_MainContent_Rbl_Rpt_type_1').click()

    # Select `Daily Variant`
    element_variation = driver.find_element_by_id('ctl00_MainContent_Ddl_Rpt_Option1')
    
    ...

Answer 2

When dealing with certain types of files, it may be necessary to read all data into memory, add new data, and then save all the data back to the file. Other files may require using the "append" mode.

For more information, refer to the documentation for ExcelWriter, which includes an option mode="a" for appending to an existing file.

with pd.ExcelWriter(r'C:\Users\user\Desktop\python.xlsx') as writer:
    table.to_excel(writer, sheet_name="rice", index=False)
    #writer.save() 

with pd.ExcelWriter(r'C:\Users\user\Desktop\python.xlsx', mode='a') as writer:
    table.to_excel(writer, sheet_name="wheat", index=False)
    #writer.save()

Alternatively, you can achieve this without using the append mode in a single with block.

with pd.ExcelWriter(r'C:\Users\user\Desktop\python.xlsx') as writer:
    table.to_excel(writer, sheet_name="rice", index=False)
    table.to_excel(writer, sheet_name="wheat", index=False)
    #writer.save()

By the way: I discovered that the append mode does not work with the xlsxwriter engine, so I had to switch to using the openpyxl engine (which also requires installing the openpyxl module with pip).

with pd.ExcelWriter(r'python.xlsx', engine='openpyxl', mode='a') as writer:

I found a list of available engines in response to this question: Engines available for to_excel function in pandas

Here is the full working code:

from selenium import webdriver
from selenium.webdriver.support.ui import Select
import pandas as pd
import time

# --- functions ---

def get_data(start_date, end_date, product):

    # Select `Variation Report`
    driver.find_element_by_id('ctl00_MainContent_Rbl_Rpt_type_1').click()

    # Select `Daily Variant`
    element_variation = driver.find_element_by_id('ctl00_MainContent_Ddl_Rpt_Option1')
    
    ...

Expanding Excel Spreadsheets Following Data Extraction from the Internet

Answer №1

Similar questions

Basic Python-based Discord chatbot designed to function as a versatile dictionary utilizing a variety of data sources

Tips for improving the speed of loading infinite scroll pages

Click on a regular key on the Selenium/Python website

The transition from using Selenium to sending requests

Tips for optimizing the Headless Chrome window in Robot Framework

"Exploring the World of Floating and Integer Numbers in Python

Waiting for Elements to be Added to Parent in a Lazy Loading Website with Selenium and Python

Python encountering errors while attempting to load JSON file

Mastering the process of running selenium automation scripts (written in Java) with Safari Technology Preview

When utilizing the JSON Wire Protocol, Selenium fails to employ the webdriver.firefox.profile feature

Showing an input box in Iron PythonorHow to implement

Storing list values into variables in Selenium Webdriver: A step-by-step guide

There seems to be an issue with the CSV file, possibly indicating an error or the file may not be an SYLYK file when

Scraping data using Selenium with paragraph tags

Recommendation for a web service Python framework

Tips for combining cells partially in a vertical direction within the pandas library

Having trouble successfully logging in to a website using Python and Selenium through Chrome

How can I remove an empty row from a list of dictionaries using a for loop in Python?

Steps for precisely locating an element within a table using Selenium

A guide on how to automate the clicking of all retrieved links using Selenium with Ruby