Scipy: Generating a sparse indicator matrix from one or more arrays

Is there a more optimal method for computing a sparse boolean matrix I using one or two arrays a,b, where I[i,j]==True if a[i]==b[j]? The current approach is quick but not memory-efficient:

I = a[:,None]==b

Another option is slower and still memory-inefficient during creation:

I = csr((a[:,None]==b),shape=(len(a),len(b)))

Although the following provides rows and columns for improved csr_matrix initialization, it also generates a full dense matrix and remains slow:

z = np.argwhere((a[:,None]==b))

Do you have any suggestions for improvement?

Answer №1

To start, one approach is to first identify the common elements between variables a and b by utilizing sets. This method works best when the possible values in a and b are limited. Next, iterate through the unique values (contained in variable values) and use np.argwhere to pinpoint the indices in a and b where these values appear. The 2D indices for the sparse matrix can then be constructed using np.repeat and np.tile:

import numpy as np
from scipy import sparse

a = np.random.randint(0, 10, size=(400,))
b = np.random.randint(0, 10, size=(300,))

## Creating matrix post initial method
I1 = sparse.csr_matrix((a[:,None]==b), shape=(len(a), len(b)))

## Identifying common values in a and b:
values = set(np.unique(a)) & set(np.unique(b))

## Collecting indices in a and b where values match:
rows, cols = [], []

## Looping over shared values, finding their indices in a and b,
## and generating 2D indices of the sparse matrix with np.repeat and np.tile
for value in values:
    x = np.argwhere(a==value).ravel()
    y = np.argwhere(b==value).ravel()    
    rows.append(np.repeat(x, len(x)))
    cols.append(np.tile(y, len(y)))

## Concatenating indices for different values and creating a 1D vector
## of True values for final matrix generation
rows = np.hstack(rows)
cols = np.hstack(cols)
data = np.ones(len(rows), dtype=bool)

## Generating sparse matrix
I3 = sparse.csr_matrix( (data, (rows, cols)), shape=(len(a), len(b)) )

## Verifying correct generation of the matrix:
print((I1 != I3).nnz==0)

The syntax for creating the csr matrix is sourced from the documentation. The test for equality in sparse matrices is inspired by this post.

Previous Response:

While performance may vary, you can avoid constructing a full dense matrix by employing a simple generator expression. Below is some code that initially generates the sparse matrix in the manner demonstrated by the OP, followed by utilizing a generator expression to assess all elements for equality:

import numpy as np
from scipy import sparse

a = np.random.randint(0, 10, size=(400,))
b = np.random.randint(0, 10, size=(300,))

## Matrix generation following OP's method
I1 = sparse.csr_matrix((a[:, None] == b), shape=(len(a), len(b)))

## Matrix creation using generator
data, rows, cols = zip(
    *((True, i, j) for i, A in enumerate(a) for j, B in enumerate(b) if A == B)
)
I2 = sparse.csr_matrix((data, (rows, cols)), shape=(len(a), len(b)))

## Checking for matrix equality
print((I1 != I2).nnz == 0)  ## --> True

It seems like the double loop is unavoidable, and it would be ideal to optimize this within numpy. However, using a generator expression does somewhat streamline the loops...

Answer №2

If you want to compare values with a small tolerance, you can utilize numpy.isclose:

np.isclose(a,b)

Alternatively, you can use pandas.DataFrame.eq:

a.eq(b)

Keep in mind that these functions will output an array of True and False values.

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

BeautifulSoup Pagination in Python: A Complete Guide

I am currently working on a web scraping project for the following website: I have successfully scraped the data, but I am facing challenges with pagination. My goal is to create a loop that scrapes the next page button and uses the URL from that button t ...

Basic Python-based Discord chatbot designed to function as a versatile dictionary utilizing a variety of data sources

Creating a chatbot to assist the translator community on Discord with a comprehensive vocabulary database is my goal. However, due to the massive size of the wordpool, I plan to divide the data into multiple files just like how printed dictionaries organiz ...

Guide for setting the executable path of the Chrome driver to match the operating system. My goal is to create a generic executable path for the Selenium driver using Python

What is the best way to define the executable path of the Chrome driver to match the operating system's path? I am aiming to create a universal executable path that works across all systems for the Selenium with Python driver. ...

Adding a large number of plots to Bokeh in bulk

Is there a way to speed up the process of adding 10,000 lines to a bokeh plot which are based on two points for each line? Doing this one by one is very slow, sometimes taking up to an hour. import pandas as pd import numpy as np from bokeh.plotting impor ...

Learn the process of extracting keys and values from a response in Python and then adding them to a list

I have utilized Python code to fetch both the key and values in a specific format. def test() data={u'Application': u'e2e', u'Cost center': u'qwerty', u'Environment': u'E2E', u'e2e3': u ...

Leveraging pandas for extracting data from a specific section within a JSON file

Currently, I am in the process of analyzing my electric bill usage by utilizing hourly data that I downloaded in JSON format. Even though I was excited about it at first (woot!), the process has turned out to be more cumbersome than I anticipated: import ...

Attempting to upload xgboost to Azure Machine Learning has encountered an issue: %1 is recognized as an invalid Win32 application. The process has concluded with an exit code of 1

I attempted to upload the xgboost python library to Azure ML, but encountered an error stating that my library is not a Win32 application. I made sure to install the 32 bit version of the package and I am running conda 32 bit as well. The library was downl ...

Expanding the dimensions of both 3D and 2D array

Currently, I am faced with a challenge involving a 3D array of shape (3, 4, 3) and a 2D array of shape (3, 3). My objective is to multiply these arrays in such a way that the ith element of the resulting 3D array corresponds to the product of the ith eleme ...

Error messages cannot be dismissed in RobotFramework if the element does not exist

test cases using robotFramework: confirm that a user is able to send a request and be redirected to the next page Wait until element is enabled ${errorCodeMessage} Element Text Should Be ${errorCodeMessage} Vikailmoituksen tapahtumat ou ...

Bring in macros from a Jinja template without running the template's code

Seeking a method to include a Jinja template with top-level content without running the content itself. Here's an example: template_with_macros.html: {% macro link(text, url) %} <a href='{{ url }}'>{{ text }}</a> {% endmacro ...

Troubles encountered when trying to save an Excel.Writer document to a different directory

A timed backup system is being developed for an Excel document using Python, since there will be multiple users accessing it. The goal is to alter the file path away from the local directory. Below is the code snippet; import pandas as pd import datetime ...

Having trouble getting the img src to work in Django 2.1 template?

I'm having trouble displaying images in my Django template file. Despite uploading the images to media static files, they do not appear on the template. When I click on the image link in the Django admin page, it shows a "Page not found(404)" error me ...

A Step-by-Step Guide to Successfully Clicking on a Checkbox Using Selenium and Python

Hello everyone, I'm facing an issue with clicking a checkbox. Here is the code for the checkbox: <label class="has-checkbox terms"><input name="order[terms]" type="hidden" value="0" /><input class="checkbox" type="checkbox" value=" ...

Another instance of UnicodeDecodeError arises when using the format() method, yet this issue does not occur when employing

Dealing with a UnicodeDecodeError issue when trying to print text fields in a class chunk called Chunk. Interestingly, concatenating the text and title fields returns no error compared to using string formatting: class Chunk: # init, fields, ... # th ...

How can a multidimensional array be stored as bit data?

Currently, I am faced with a large numpy matrix that was generated using the following code snippet: np.full(np.repeat(2, 10), 1,dtype='int8') The shape of this matrix is as follows: (2, 2, 2, 2, 2, 2, 2, 2, 2, 2) However, all the values in t ...

Filtering elements in Python API results

Using the CoinGecko APIs (), I encountered an issue with selecting a specific call element. The code snippet in question is as follows: from pycoingecko import CoinGeckoAPI cg = CoinGeckoAPI() mhccex = cg.get_coin_ticker_by_id(id='metahash', exc ...

Incorporating numerous bokeh html files into a Django template

After creating plots with bokeh and exporting them as HTML files, I encountered an issue when trying to add multiple plots into a Django template. The first plot displays correctly, but the second one is not visible. Shown below is the Python file respons ...

Writing data tables to Excel files using Python's Selenium module

I am trying to create a data table in an Excel file. The table consists of 7 rows and 1 column, with each row containing a unique value. However, when I view the Excel file, only the last row is displayed. This is how the data should appear in Excel: FT ...

Dateutil - Your Trusted Source for Relative Date Calculation

Recently, I've been facing a challenge while attempting to parse relative dates like "today at 4:00", "tomorrow at 10:00", "yesterday at 8:00", etc. in Python using dateutil.parse. However, I wish to provide a specific "today" date to serve as a refer ...

Choose a particular tier within the MultiIndex

The code snippet below demonstrates how to extract a specific list from multiple levels: idx1 = sys_bal.index idx2 = user_bal.index idx3 = idx1.intersection(idx2) The following output is generated by the above code: MultiIndex(levels=[[3, 29193], [&apos ...