TarInfo objects dripping information

I've encountered an issue with my Python tool that reads through a tar.xz file and handles each individual file within it. The input file is 15MB compressed, expanding to 740MB when uncompressed.

Unfortunately, on a particular server with limited memory resources, the program crashes due to running out of memory. To troubleshoot this problem, I utilized objgraph to analyze object creation. It appears that the TarInfo instances are not being properly released. A snippet of the main loop structure looks something like this:

with tarfile.open(...) as tar:
    while True:
        next = tar.next()
        stream = tar.extractfile(next)
        process_stream()
        iter+=1
        if not iter%1000:
            objgraph.show_growth(limit=10)

The diagnostic output consistently shows:

TarInfo     2040     +1000
TarInfo     3040     +1000
TarInfo     4040     +1000
TarInfo     5040     +1000
TarInfo     6040     +1000
TarInfo     7040     +1000
TarInfo     8040     +1000
TarInfo     9040     +1000
TarInfo    10040     +1000
TarInfo    11040     +1000
TarInfo    12040     +1000

This pattern persists until all 30,000 files are processed.

To further investigate, I removed the lines responsible for creating streams and processing them. Surprisingly, the memory usage did not change - indicating a leakage of TarInfo instances.

This issue occurs across multiple operating systems including Ubuntu, OS X, and Windows, while using Python version 3.4.1.

Answer №1

It appears that this behavior is intentional. The TarFile object stores all the TarInfo objects it contains in a `members` attribute. Every time you call the next function, the extracted TarInfo object is added to this list:

def next(self):
    """Return the next member of the archive as a TarInfo object, when
       TarFile is opened for reading. Return None if there is no more
       available.
    """
    self._check("ra")
    if self.firstmember is not None:
        m = self.firstmember
        self.firstmember = None
        return m

    # Read the next block.
    self.fileobj.seek(self.offset)
    tarinfo = None
    ... <snip>

    if tarinfo is not None:
        self.members.append(tarinfo)  # <-- the TarInfo instance is added to members

The members list will continue to grow with each extraction operation. While this functionality may be useful for methods like getmembers and getmember, it can be cumbersome in specific scenarios. One workaround suggested is to clear the members attribute while iterating through the items (as recommended here):

with tarfile.open(...) as tar:
    while True:
        next = tar.next()
        stream = tar.extractfile(next)
        process_stream()
        iter+=1
        tar.members = []  # Clear members list
        if not iter%1000:
            objgraph.show_growth(limit=10)

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Running a .py file in IDLE from a folder separate from the Python installation location: How to do it?

Wanting to execute a .py script using IDLE from a different folder located at D:\Python\Practice while Python is installed in C:\Python. I attempted to switch the directory to D:\Python\Practice where the script resides using the ...

Reorganize a dataset using suffix in Python

I am seeking guidance on how to transform a dataframe by grouping data based on the suffix of the X column. X Y a_1 12 b_1 20 c_1 30 a_2 2 b_2 56 c_2 70 d_2 2 The desired output should be: _1 _2 a 12 2 b 20 56 c 30 70 d 0 ...

Unsuccessful conversion of JSON data to HTML using json2html in Python

I have been experimenting with converting JSON data to HTML using json2html. The JSON data structure I am working with is as follows: json_obj = [{"Agent Status": "status1", "Last backup": "", "hostId": 1234567, "hostfqdn": "test1.example.com", "username" ...

Python 3: The list encounters a cycle of indices after passing index 47 with over 100 elements. What is the reason behind this behavior and how can it

Here's a function that calculates the nth prime number. I'm aware it's not the most efficient method, especially since I'm relatively new to coding. Despite this, the code below does work and will return the prime number at the specifie ...

Help with selenium web scraping

In the month of June, I came across a question on assistance needed to scrape a website using Selenium in Python. Having tried the code back then and running into issues now, it seems like there might have been some changes to the site being scraped. The ...

How to select the entire text in Python Tkinter by simply left-clicking

Is there a way for the user to easily copy text from a text widget in tkinter with just a left mouse click? Ideally, when the user clicks on the text using the left mouse button, the entire text in the widget gets selected (similar to blue lines in Windows ...

What is the process for retrieving information from custom_export within Otree?

Currently, I am experimenting with utilizing ExtraModel and custom_export to extract information from the live_pages. However, after accessing the devserver and examining the data tab, I can't seem to locate the exported data. Furthermore, when I down ...

What is the purpose of using socket.accept() function in Python and how should it be implemented?

Why does the accept() method return a new socket and an address when I already have one? Can't I just use the existing socket that I created? import socket sock = socket.socket() sock.bind(('', 9090)) sock.listen(1) conn, addr = sock.accept ...

`The `open...` function does not function properly when used in a scheduled task on Windows

While working with Python on Linux, I encountered the need to schedule a task on Windows. Automating my scripts proved challenging, until I discovered an alternative to cron through this command: schtasks /Create /SC MINUTE /TN TestTask_Python /TR "C:&bso ...

Updating MySQL with values from a list using Python

I am facing an issue where I have a list with some value, but I want to split the values in the list and insert them into different columns. def insert_data(daftar): query = "INSERT INTO citra(img_source,fitur_0,fitur_1,fitur_2,fitur_3,fitur_4,fitur_5) " ...

Use a for loop to discover the words in a sentence along with their corresponding index positions

In my current project, I am developing a code that asks the user to input a sentence, which is then stored as str1, and then prompts for a word to be defined as str2. For instance: Please provide a sentence: i like to code in python and code things Thank ...

Having difficulty navigating to the subsequent page link

I have created a Python script using Selenium to extract data from a JavaScript-enabled webpage. To navigate to the next page, I need to perform three actions: fill in two search boxes and click the search button. The issue arises when the script fails to ...

How to click on a hyperlink element without any attributes in Selenium using Python

Hey there, I'm facing a challenge where I need to click on a href link that doesn't have any class, id, text, or anything at all. <div> <a href="/ad/repost/id/2109701"> <span>Réafficher</span> <i clas ...

Issues with the proper display of Bootstrap 4 in Django are causing problems

I am currently in the process of setting up Django to work with Bootstrap. Despite my efforts, I can't seem to get it functioning correctly and I'm unsure of where I may be making a mistake. Initially, I am just trying to display a panel. You can ...

Concealed file selection in Python Automation with Selenium

I'm experiencing an issue while attempting to upload a file to a hidden file input using Python Selenium. To provide more clarity, please refer to the image linked below. Example of Issue For the first field, I uploaded a file myself. Here's a ...

Error encountered: Unable to decode JSON object in Django unit test for class-based view

I need to run a test on my class-based view. Here is the models.py file: class TaskList(models.Model): task = models.CharField(max_length=200) details = models.TextField() date_added = models.DateField(auto_now_add=True) Below is the views. ...

Concern regarding the Functionality of Django URL Routers

I am currently working on creating a URL router in Django that can handle the following URLs: http://localhost:8000/location/configuration http://localhost:8000/location/d3d710fcfc1391b0a8182239881b8bf7/configuration url(r'^locations/configuration ...

Linear y-axis and log x-axis: finding the best fit line on a logarithmic scale

My query is closely tied to the topic discussed on SO here: Fit straight line on semi-log scale with Matplotlib Yet, I am trying to generate a best fit line in a chart where the X-axis is presented logarithmically and the Y-axis linearly. import matplotl ...

Troubleshooting Chrome driver version conflict with Selenium and PyInstaller in Python

Currently, I am utilizing the Google Chrome driver to automate various functions. To execute it, I have included this line in my code: driver = webdriver.Chrome(ChromeDriverManager().install()) Surprisingly, everything works smoothly when I run the progra ...

Removing the initial characters from the beginning of every first line in every JSON file

I've recently started working with Python and I'm attempting to merge multiple JSON files into one single JSON file from a folder. I've managed to combine them, but now I want to remove certain characters from the first line of each file in ...