Looping through a series of rows by utilizing ws.iter_rows within the highly efficient openpyxl reader

Question

Looping through a series of rows by utilizing ws.iter_rows within the highly efficient openpyxl reader

I am currently faced with the task of reading an xlsx file that contains 10 by 5324 cells.

This is essentially what I was attempting to achieve:

from openpyxl import load_workbook
filename = 'file_path'

wb = load_workbook(filename)
ws = wb.get_sheet_by_name('LOG')

col = {'Time':0 ...}

for i in ws.columns[col['Time']][1:]:
    print i.value.hour

The code was running much slower than expected (I was performing operations, not just printing) and eventually, my patience wore thin and I terminated it.

Do you have any recommendations on how I can optimize the reader? I need to iterate over a specific range of rows, rather than all rows. This is my attempt, but it seems flawed:

wb = load_workbook(filename, use_iterators = True)
ws = wb.get_sheet_by_name('LOG')
for i in ws.iter_rows[1:]:
    print i[col['Time']].value.hour

Is there a way to achieve this without using the range function?

One approach I considered is:

for i in ws.iter_rows[1:]:
    if i.row == startrow:
        continue
    print i[col['Time']].value.hour
    if i.row == endrow:
        break

However, I am wondering if there is a more elegant solution out there? (though this method doesn't seem to work either)

python excel xlsx openpyxl

Answer 1

Answer №1

To tackle this issue with a minimum threshold in mind, we can implement the following approach:

# Your code:
from openpyxl import load_workbook
filename = 'file_path'
wb = load_workbook(filename, use_iterators=True)
ws = wb.get_sheet_by_name('LOG')

# Solution 1:
for row in ws.iter_rows(row_offset=1):
    # code to execute per row...

Here is an alternative method to achieve the same outcome using the enumerate function:

# Solution 2:
start, stop = 1, 100    # Defining lower and upper limits
for index, row in enumerate(ws.iter_rows()):
    if start < index < stop:
        # code to execute per row...

The index variable serves as a tracker for the current row number, allowing it to replace range or xrange. This method is user-friendly and compatible with iterators, unlike range or slicing. It also offers flexibility by enabling the usage of only the lower limit if needed. Happy coding!

Answer 2

To tackle this issue with a minimum threshold in mind, we can implement the following approach:

# Your code:
from openpyxl import load_workbook
filename = 'file_path'
wb = load_workbook(filename, use_iterators=True)
ws = wb.get_sheet_by_name('LOG')

# Solution 1:
for row in ws.iter_rows(row_offset=1):
    # code to execute per row...

Here is an alternative method to achieve the same outcome using the enumerate function:

# Solution 2:
start, stop = 1, 100    # Defining lower and upper limits
for index, row in enumerate(ws.iter_rows()):
    if start < index < stop:
        # code to execute per row...

The index variable serves as a tracker for the current row number, allowing it to replace range or xrange. This method is user-friendly and compatible with iterators, unlike range or slicing. It also offers flexibility by enabling the usage of only the lower limit if needed. Happy coding!

Answer 3

Answer №2

In the documentation, it is mentioned:

Keep in mind: A worksheet created in memory starts out empty, with no cells until they are accessed for the first time. This approach helps minimize memory usage by only creating objects that are actually needed.

Be careful: By scrolling through cells instead of accessing them directly, all cells will be generated in memory even if they remain unused. For example:
>>> for i in xrange(0,100):
...             for j in xrange(0,100):
...                     ws.cell(row = i, column = j)
This code snippet will create unnecessary 100x100 cells in memory.

However, there are methods available to clean up these excess cells, which we will explore later on.

It's important to note that accessing the columns or rows of a worksheet may load numerous additional cells into memory. It is recommended to access only the specific cells you require.

For instance:

col_name = 'A'
start_row = 1
end_row = 99

range_expr = "{col}{start_row}:{col}{end_row}".format(
    col=col_name, start_row=start_row, end_row=end_row)

for (time_cell,) in ws.iter_rows(range_string=range_expr):
    print time_cell.value.hour

Answer 4

In the documentation, it is mentioned:

Keep in mind: A worksheet created in memory starts out empty, with no cells until they are accessed for the first time. This approach helps minimize memory usage by only creating objects that are actually needed.

Be careful: By scrolling through cells instead of accessing them directly, all cells will be generated in memory even if they remain unused. For example:
>>> for i in xrange(0,100):
...             for j in xrange(0,100):
...                     ws.cell(row = i, column = j)
This code snippet will create unnecessary 100x100 cells in memory.

However, there are methods available to clean up these excess cells, which we will explore later on.

It's important to note that accessing the columns or rows of a worksheet may load numerous additional cells into memory. It is recommended to access only the specific cells you require.

For instance:

col_name = 'A'
start_row = 1
end_row = 99

range_expr = "{col}{start_row}:{col}{end_row}".format(
    col=col_name, start_row=start_row, end_row=end_row)

for (time_cell,) in ws.iter_rows(range_string=range_expr):
    print time_cell.value.hour

Looping through a series of rows by utilizing ws.iter_rows within the highly efficient openpyxl reader

Answer №1

Answer №2

Similar questions

Is there a specific algorithm in Python that is capable of filtering out data points that represent "deep valleys" on a linear slope?

Problem encountered with color() function in turtle module

Determine the Mean Absolute Percentage Error on a monthly basis using Python, one step

Bringing in a module in Python

Tips on sending a successful HTTP 200 response for a Slack API event request in Python using the request module

Error 404 - CSS file missing: encountering a 404 error while attempting to execute Flask code

Converting JSON to CSV Using Python

How can you identify the widget using its ID when you have assigned it a value of -1 in wxPython?

What are some methods for eliminating discontinuities in the complex angles of NumPy eigenvector components?

Error: You can only use integers, slices (:) or ellipsis (...) in this context

Unable to input numerical values using sendkeys in Python Appium

Ensure that the tkinter submit button handle function is executed only once, unless new and different input is provided

Provide an iterator following alterations

Guide to setting up jaydebeapi on Python version 2.7

Encountering Keyerror while trying to parse JSON in Python

Error: Unable to locate module _vectorized

Reduce the redundancy of Python script arguments by utilizing argparse or alternative modules

Python: parsing comments in a cascading style sheet document

My Django app seems to be malfunctioning - I keep receiving a "404 Page not found" error. What

A guide on retrieving all the table data and displaying it in the user interface using templates in Django