What could be causing the increase in file size after running PCA on the image?

I am currently working on developing an image classification model to identify different species of deer in the United States. As part of this process, I am utilizing Principal Component Analysis (PCA) to reduce the memory size of the images and optimize the run time of the model.

However, I have encountered a puzzling issue where all the new PCA-compressed images generated by my Deer_PCA function are larger in file size compared to the original images. For instance, the original image was 128 KB, but the compressed version after running it with n_components = 150 now stands at 293 KB. Can anyone shed some light on why this unexpected outcome is happening?

Below is the image that was processed using the function; make sure to place the image in an empty folder before executing the code:

Here is the resulting compressed image obtained after applying the Deer_PCA function:

Displayed below is the code implementation:

# Required packages

import cv2
import os,sys
from PIL import Image
import pandas as pd

from scipy.stats import stats
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt
import matplotlib.image as mpimg

# Function to perform PCA on images within a specific folder and save them in another folder

def Deer_PCA(inpath, outpath,n_comp):
    for image_path in os.listdir(inpath):

        # Read input file
        input_path = os.path.join(inpath, image_path)
        print(input_path)
        
        w_deer = cv2.cvtColor(cv2.imread(input_path), cv2.COLOR_BGR2RGB)

        # Split channels
        blue_2,green_2,red_2 = cv2.split(w_deer)

        # Scale channels
        w_blue = blue_2/255
        w_green = green_2/255
        w_red = red_2/255

        # Perform PCA on each channel
        pca_b2 = PCA(n_components=n_comp)
        pca_b2.fit(w_blue)            
        trans_pca_b2 = pca_b2.transform(w_blue)

        pca_g2 = PCA(n_components=n_comp)
        pca_g2.fit(w_green)
        trans_pca_g2 = pca_g2.transform(w_green)

        pca_r2 = PCA(n_components=n_comp)
        pca_r2.fit(w_red)
        trans_pca_r2 = pca_r2.transform(w_red)

        # Merge channels post-PCA
        b_arr2 = pca_b2.inverse_transform(trans_pca_b2)
        g_arr2 = pca_g2.inverse_transform(trans_pca_g2)
        r_arr2 = pca_r2.inverse_transform(trans_pca_r2)

        img_reduced2 = (cv2.merge((b_arr2, g_arr2, r_arr2)))
        
        print("Merge Successful")

        # Save output
        fullpath = os.path.join(outpath, 'PCA_'+image_path)
        cv2.imwrite(fullpath, img_reduced2*255)
        
        print("Successfully saved\n")
        

# Check image sizes 

original_image_path = '/Users/matthew_macwan/Downloads/CIS/I_Class_Deer/mule_deer_doe/mule deer doe_1.jpeg'

PCA_compressed_image_path = '/Users/matthew_macwan/Downloads/CIS/I_Class_Deer/mule_deer_doe/PCA_mule deer doe_1.jpeg'

print('Original Image:',sys.getsizeof(original_image_path))

print('PCA Image:',sys.getsizeof(PCA_compressed_image_path))

Answer №1

There seems to be a misconception here. Performing PCA on a single image involves treating each column (or row, the specifics are unclear) as an individual observation. While this does reduce the image to 150 rows (or columns), ultimately decreasing the data volume and potentially diminishing the information content.

However, when reconstructing the original image from the PCA, you end up with an array of the same size as the original and save it as a JPEG file. This means that there are not fewer data points to store; while the overall information in the image may decrease, the process differs from how JPEG compression operates. Therefore, the JPEG algorithm is unlikely to benefit or compress the data into fewer bytes efficiently.

If your output JPEG file ends up larger than the input, it could be due to the PCA modifications complicating the JPEG algorithm or influenced by the quality setting used. Adjusting the quality setting of the JPEG compression is the most effective way to reduce file sizes.

Using PCA for image compression requires saving the PCA basis vectors along with the image projected onto those vectors. However, this approach may not be the most effective method for compressing images.

An alternative image compression technique involves converting a large collection of images into vectors by arranging their sample values in rows and then applying PCA to the entire dataset. Each image can then be represented as a linear combination of these basis vectors, necessitating storage of only the weights per basis vector. While this method showcases how PCA functions, its effectiveness is not guaranteed. It is advisable to stick to established image compression methods like JPEG and JPEG2000.


with the goal of reducing memory usage and enhancing the model's runtime efficiency during later stages.

It should be noted that the file size has no direct impact on the workload of the model. When the image is loaded from the file into memory, a specific number of pixels are acquired, which the model must analyze. The storage space occupied by the data on disk is inconsequential at this stage. To improve the model's speed, consider reducing the pixel count through subsampling. However, ensure that the essential recognition features remain intact post-resampling. Overly aggressive pixel reduction may hinder the model's ability to distinguish between different objects effectively!

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

How to interact with a button inside a span element without an ID using Selenium

Can anyone help me figure out how to click a button in the browser using Selenium and Python? The button I am trying to click is located within this HTML code: <div id="generate"> <i class="fa fa-bolt"></i> <span>D ...

Error message "The table 'MSysAccessStorage' does not exist" is encountered while attempting to drop tables from the list generated by cursor.tables()

After successfully connecting to a database and iterating through the metadata to retrieve table names, I encountered an unexpected error message: pyodbc.ProgrammingError: ('42S02', "[42S02] [Microsoft][ODBC Microsoft Access Driver] Table &a ...

Divide the strings using punctuation marks, but leave the tags intact

How can I split a string by removing all punctuation marks and spaces, except for the # character? tweet = "I visited #India to experience the culture. It was amazing!" I want to separate the words in the string above like this: ["I", "visited", "#India ...

Python code example: How to compare two dictionaries stored in a list

I am struggling with creating a script to compare two dictionaries that are stored in a list without knowing the names of the dictionaries. Can someone assist me with this? Sample code: Is my approach correct? If not, please guide me towards the right sol ...

Ways to resolve issues with multiple foreign keys

Having trouble setting up multiple foreign keys and encountering errors from flask_sqlalchemy import SQLAlchemy sql= SQLAlchemy(app) app.secret_key = os.urandom(24) app.config['SQLALCHEMY_DATABASE_URI'] = 'mysql:///test' ...

Python can simultaneously strip and split strings

I am attempting to split and strip a string simultaneously. In my file located at D:\printLogs\newPrintLogs\4.txt, I want to extract only the 4.txt portion, remove the .txt extension, and then add ".zpr" to get "4.zpr". This is the code ...

Having trouble importing `beautifulSoup` in Python 2.7 with Selenium

I am having trouble importing the beautifulSoup module and encountering an error. Can anyone explain why this is happening or provide guidance on how to fix it? Microsoft Windows [Version 6.1.7600] Copyright (c) 2009 Microsoft Corporation. All rights res ...

Locating specific phrases within a vast text document using Python

The code below represents the program I have written: with open("WinUpdates.txt") as f: data=[] for elem in f: data.append(elem) with open("checked.txt", "w") as f: check=True for item in data: if "KB2982791" in item: ...

Having trouble importing the pydot module in Python on Ubuntu 14.04?

Recently, I created a basic program using pydot: import pydot graph = pydot.Dot(graph_type='graph') for i in range(3): edge = pydot.Edge("king", "lord%d" % i) graph.add_edge(edge) graph.write_png('example_graph.png') To util ...

What is the best way to unbox nested tuples in python?

Let's imagine there are two users named Peter and Emma who want to place orders for fruits, including bananas and strawberries, in the quantity they desire. Below is the implementation I have created using Python: orders = (('Peter', ((&ap ...

Ways to incorporate CSS design into Django input pop-up when the input is invalid

Looking to enhance the error message styling for a Django Form CharField when an incorrect length is entered, using CSS (Bootstrap classes preferred). I have successfully styled the text input itself (check the attrs section in the code), but I am unsure ...

Adding a large number of plots to Bokeh in bulk

Is there a way to speed up the process of adding 10,000 lines to a bokeh plot which are based on two points for each line? Doing this one by one is very slow, sometimes taking up to an hour. import pandas as pd import numpy as np from bokeh.plotting impor ...

Refreshing a webpage to accurately display changes made in a CRUD application without the need for a hard reset

My to-do app is almost fully functional - I can create items, mark them as completed, and delete them using a delete button. However, there's one issue: when I delete an item, the page doesn't update in real-time. I have to manually refresh the p ...

Unable to execute a Windows command line that has been escaped within Python

When I execute the given command in a Windows command prompt, it works flawlessly. java -jar "C:\Program Files (x86)\SnapBackup\app\snapbackup.jar" main However, when I attempt to run the same command within a Python script, it fails. ...

Switch the type and version of the browser using Selenium

In an attempt to change my user agent using the Selenium Webdriver, I have utilized the following code. However, it has come to my attention that Google Analytics is able to easily detect my browser and its version (specifically Firefox 54.0 in this case). ...

Instead of creating a new figure each time in a loop for imshow() in matplotlib, consider updating the current plot instead

I am facing a challenge where I need to save displayed images based on user input, but the issue lies in the fact that unlike plt.other_plots(), plt.imshow() does not overwrite the existing figure, rather it creates a new figure below the existing one. How ...

Is there a way to send all the results of a Flask database query to a template in a way that jQuery can also access

I am currently exploring how to retrieve all data passed to a template from a jQuery function by accessing Flask's DB query. I have a database table with customer names and phone numbers, which I pass to the template using Flask's view method "db ...

Tips for unit testing Python code with mock patch

Looking to run unit tests on the code snippet below: def extract_features_from_bytes(self, binary: bytes) -> str: with TimingMetric("fx_time") as fx_timing_metric: fv = self.feature_extractor.extract_features_from_bytes(binary) ...

What causes json.parse to malfunction? and how can you resolve the issue

My server sends data to my JavaScript code in the format below. {"triggers": [{"message_type": "sms","recipients": "[\"+91xxxxxxxxx\",\"+91xxxxxxxxx\"]", "message": "This is a test"}]} To parse this JSON string, my code executes the f ...

Learn how to incorporate a newly created page URL into a customized version of django-oscar

Just starting out with django-oscar and trying to set up a new view on the page. I've successfully created two pages using django-oscar dashboard, https://ibb.co/cM9r0v and added new buttons in the templates: Lib/site-packages/oscar/templates/osca ...