Improving the efficiency of cosine similarity calculations among rows in a dataframe

I have a pandas DataFrame that contains a large collection of data (~150k rows), structured with two columns: Id and Features. Each row in the Features column is a 50-position numpy array. My objective is to select a random feature vector from the dataset and calculate its cosine similarity with all other vectors.

The current code implementation accomplishes this task, but it performs slowly:

# Extract a random row as input data
sample = copy.sample()
# Scale sample to match original size
scaled = pd.DataFrame(np.repeat(sample.values, len(copy),axis=0), columns=['ID','Features'])
# Copy scaled dataframe columns to original
copy['Features_compare'] = scaled['Features']
# Apply cosine similarity for Features
copy['cosine'] = copy.apply(lambda row: 1 - cosine(row['Features'],row['Features_compare']), axis=1)
copy

Output:

Id      Features                    Features_compare                cosine
27834   [-21.722315, 11.017685...]  [-25.151268, 1.0155457...]      0.452093
27848   [-24.009565, 2.7699656...]  [-25.151268, 1.0155457...]      0.528901
27895   [-10.533865, 18.835657...]  [-25.151268, 1.0155457...]      0.266685
27900   [-18.124702, 2.6769984...]  [-25.151268, 1.0155457...]      0.381307
27957   [-10.765628, -11.368319...] [-25.151268, 1.0155457...]      0.220016
Elapsed Time: 14.1623s

Are there more efficient methods to compute similarity values for a large dataset using dataframes?

Answer №1

Time to optimize your code using vectorization

from sklearn.metrics.pairwise import cosine_similarity

s = df['Features'].sample(n=1).tolist()

df['Features_compare'] = np.tile(s, (len(df), 1)).tolist()
df['cosine'] = cosine_similarity(df['Features'].tolist(), s)

      Id                                         Features                               Compare_Features    Similarity
0  27834  [-21.722315, 11.017685, -23.340332, -4.7431817]  [-18.124702, 2.6769984, -11.778319, -7.14392]  0.937401
1  27848    [-24.009565, 2.7699656, -10.014014, 9.293142]  [-18.124702, 2.6769984, -11.778319, -7.14392]  0.776474
2  27895  [-10.533865, 18.835657, -10.094039, -7.5399566]  [-18.124702, 2.6769984, -11.778319, -7.14392]  0.722914
3  27900    [-18.124702, 2.6769984, -11.778319, -7.14392]  [-18.124702, 2.6769984, -11.778319, -7.14392]  1.000000
4  27957  [-10.765628, -11.368319, -19.968557, 2.1252406]  [-18.124702, 2.6769984, -11.778319, -7.14392]  0.659093

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Is there a way to automatically start a process during system boot at runlevel 2 using code?

Can someone please help me figure out how to write Python code that will initiate a process during startup, specifically at level two? I've done some research on my own, but I'm still unsure about the most reliable method to use across different ...

Updating password in Django with AJAX

During the development of my website, I successfully added the functionality for users to change their passwords. This was achieved by utilizing the pre-existing view in django.contrib.auth. However, the next step is to enhance this feature by implementin ...

What is the procedure for sending an email with a zip file attachment using Python?

In the directory of my project, I have a zip folder called csv located at: /home/local/user/project/zip_module/csv I am interested in sending an email with this zip folder attached as a file. Previously, I have used Python's smtplib module to send ...

Leveraging NumPy for array index discovery

If I am working with the following array: array = ['KDI', 'KDI', 'KDU', 'KDA', 'ANU', 'AMU', 'BDU', 'CDU', 'CDU', 'DAI', 'DAH'], dtype='

Extracting information from a website link using Python

Can anyone help me create a HTTP link using information from a database, such as IP and port number? I have tried the following code but keep encountering an error while parsing. Any assistance will be greatly appreciated. @app.route('/link_test/< ...

Selenium: optimize speed by not waiting for asynchronous resources

When using Selenium, it waits for asynchronous resource calls before moving on to a new page. For example: <script src="https://apis.google.com/js/platform.js" async defer></script> In cases where a website includes multiple external APIs (s ...

When the FLASK button is triggered, retrieve data from the FLASK and display it

This is a unique test where I will create a button to retrieve information from a MySQL database. However, it's important to first understand this concept. So here is my custom text within the Flask framework: @app.route('/pythonlogin/home', ...

Is there a way for me to direct a call to a staticmethod using a decorator?

Python Version Requirement: Python 3.7 or newer I am currently developing a solution to encode/decode a model using a versioned json/yaml schema. My goal is to find a streamlined way to implement this without resorting to long if-elif-else chains that che ...

Is it possible to utilize all 274 available color spaces in CV2 to generate 274 unique variations of a single image?

Need help solving an issue related to the code below. import cv2 import imutils image = cv2.imread("/home/taral/Desktop/blister_main/blister.jpg") flags = [i for i in dir(cv2) if i.startswith('COLOR_')] count = 1 for flag in flags: mod ...

Querying HTML wrapped in a JSON response using Scrapy: a step-by-step guide

I'm currently in the process of scraping a website that relies on dynamically loaded content through JavaScript. In my attempts to request the data source, I received a JSON response where a key 'results_html' holds all the HTML necessary fo ...

Unable to pass the Selenium driver as an argument to a function when utilizing `pool.starmap`

I am encountering an issue where I cannot pass a Selenium driver as an argument to a function using pool.starmap. Below is a simplified example to reproduce and verify the problem: Main code: from wait import sleep import multiprocessing from selenium im ...

Creating a MySQL database from a CSV file: A step-by-step guide

I am looking to streamline the database creation process by utilizing a CSV file that contains 160 columns and 15 rows of data. Manually assigning names for each column is proving to be quite challenging due to the large number of columns. I have managed t ...

Setting the starting sequence number for a TCP socket

I'm currently involved in testing and I require the ability to set the initial sequence number (ISN) of a TCP connection to a specific value. The ISN is typically a random value chosen by the OS/Network Stack, but I need to have control over it. Is t ...

Exploring wide data in Python using pandas to uncover the initial value within a set of time series

Currently, I am dealing with a data frame that is in wide format. In this data frame, each book has a specific number of sales recorded. However, there are some quarters where the values are null because the book was not released before that particular qua ...

Unable to interact with multiple buttons in a loop using Python Selenium

How do I collect a list of button elements and click on each one, but encounter issues when trying to go back to the previous page using execute_script()? After clicking on the first button successfully, I am unable to click on any other buttons. btnContai ...

Python: Implementing abstract base classes (abc) in Python versions earlier than 2.6

Is there an implementation available for the 'abc' module that is compatible with Python versions older than 2.6? EDIT: Specifically, I am searching for a code snippet that replicates the functionality of ABCMeta and abstractmethod from the &apo ...

Generating a NumPy array from a list using operations

I retrieved data from an SQLite database stored in a Python list with the following structure: # Here is an example data = [(1, '12345', 1, 0, None), (1, '34567', 1, 1, None)] My goal is to convert this list of tuples into a 2D NumPy a ...

Generate an empty alpha image using Numpy

I was attempting to generate a transparent alpha image in order to extract data using py-opencv and save it as a PNG file with a transparent background. I experimented with the following code snippets: blank_image = np.zeros((H,W,4), np.uint8) and blan ...

When reading a JSON file, combining dictionaries with non-Series objects could result in uncertain ordering

Looking to extract information from a website that offers API access, but encountering errors. Any suggestions on how to resolve this issue? import pandas as pd df = pd.read_json('https://api.cbs.gov.il/index/data/price?id=120010&format=json& ...

Guide to setting a limit on image size using get_serving_url

When working with images, I noticed that some exceed the desired maximum width and height. Is there a method to specify the image size limit using get_serving_url()? ...