Scipy: Generating a sparse indicator matrix from one or more arrays

Question

Scipy: Generating a sparse indicator matrix from one or more arrays

Is there a more optimal method for computing a sparse boolean matrix I using one or two arrays a,b, where I[i,j]==True if a[i]==b[j]? The current approach is quick but not memory-efficient:

I = a[:,None]==b

Another option is slower and still memory-inefficient during creation:

I = csr((a[:,None]==b),shape=(len(a),len(b)))

Although the following provides rows and columns for improved csr_matrix initialization, it also generates a full dense matrix and remains slow:

z = np.argwhere((a[:,None]==b))

Do you have any suggestions for improvement?

python numpy scipy sparse-matrix indicator

Answer 1

Answer №1

To start, one approach is to first identify the common elements between variables a and b by utilizing sets. This method works best when the possible values in a and b are limited. Next, iterate through the unique values (contained in variable values) and use np.argwhere to pinpoint the indices in a and b where these values appear. The 2D indices for the sparse matrix can then be constructed using np.repeat and np.tile:

import numpy as np
from scipy import sparse

a = np.random.randint(0, 10, size=(400,))
b = np.random.randint(0, 10, size=(300,))

## Creating matrix post initial method
I1 = sparse.csr_matrix((a[:,None]==b), shape=(len(a), len(b)))

## Identifying common values in a and b:
values = set(np.unique(a)) & set(np.unique(b))

## Collecting indices in a and b where values match:
rows, cols = [], []

## Looping over shared values, finding their indices in a and b,
## and generating 2D indices of the sparse matrix with np.repeat and np.tile
for value in values:
    x = np.argwhere(a==value).ravel()
    y = np.argwhere(b==value).ravel()    
    rows.append(np.repeat(x, len(x)))
    cols.append(np.tile(y, len(y)))

## Concatenating indices for different values and creating a 1D vector
## of True values for final matrix generation
rows = np.hstack(rows)
cols = np.hstack(cols)
data = np.ones(len(rows), dtype=bool)

## Generating sparse matrix
I3 = sparse.csr_matrix( (data, (rows, cols)), shape=(len(a), len(b)) )

## Verifying correct generation of the matrix:
print((I1 != I3).nnz==0)

The syntax for creating the csr matrix is sourced from the documentation. The test for equality in sparse matrices is inspired by this post.

Previous Response:

While performance may vary, you can avoid constructing a full dense matrix by employing a simple generator expression. Below is some code that initially generates the sparse matrix in the manner demonstrated by the OP, followed by utilizing a generator expression to assess all elements for equality:

import numpy as np
from scipy import sparse

a = np.random.randint(0, 10, size=(400,))
b = np.random.randint(0, 10, size=(300,))

## Matrix generation following OP's method
I1 = sparse.csr_matrix((a[:, None] == b), shape=(len(a), len(b)))

## Matrix creation using generator
data, rows, cols = zip(
    *((True, i, j) for i, A in enumerate(a) for j, B in enumerate(b) if A == B)
)
I2 = sparse.csr_matrix((data, (rows, cols)), shape=(len(a), len(b)))

## Checking for matrix equality
print((I1 != I2).nnz == 0)  ## --> True

It seems like the double loop is unavoidable, and it would be ideal to optimize this within numpy. However, using a generator expression does somewhat streamline the loops...

Answer 2

To start, one approach is to first identify the common elements between variables a and b by utilizing sets. This method works best when the possible values in a and b are limited. Next, iterate through the unique values (contained in variable values) and use np.argwhere to pinpoint the indices in a and b where these values appear. The 2D indices for the sparse matrix can then be constructed using np.repeat and np.tile:

import numpy as np
from scipy import sparse

a = np.random.randint(0, 10, size=(400,))
b = np.random.randint(0, 10, size=(300,))

## Creating matrix post initial method
I1 = sparse.csr_matrix((a[:,None]==b), shape=(len(a), len(b)))

## Identifying common values in a and b:
values = set(np.unique(a)) & set(np.unique(b))

## Collecting indices in a and b where values match:
rows, cols = [], []

## Looping over shared values, finding their indices in a and b,
## and generating 2D indices of the sparse matrix with np.repeat and np.tile
for value in values:
    x = np.argwhere(a==value).ravel()
    y = np.argwhere(b==value).ravel()    
    rows.append(np.repeat(x, len(x)))
    cols.append(np.tile(y, len(y)))

## Concatenating indices for different values and creating a 1D vector
## of True values for final matrix generation
rows = np.hstack(rows)
cols = np.hstack(cols)
data = np.ones(len(rows), dtype=bool)

## Generating sparse matrix
I3 = sparse.csr_matrix( (data, (rows, cols)), shape=(len(a), len(b)) )

## Verifying correct generation of the matrix:
print((I1 != I3).nnz==0)

The syntax for creating the csr matrix is sourced from the documentation. The test for equality in sparse matrices is inspired by this post.

Previous Response:

While performance may vary, you can avoid constructing a full dense matrix by employing a simple generator expression. Below is some code that initially generates the sparse matrix in the manner demonstrated by the OP, followed by utilizing a generator expression to assess all elements for equality:

import numpy as np
from scipy import sparse

a = np.random.randint(0, 10, size=(400,))
b = np.random.randint(0, 10, size=(300,))

## Matrix generation following OP's method
I1 = sparse.csr_matrix((a[:, None] == b), shape=(len(a), len(b)))

## Matrix creation using generator
data, rows, cols = zip(
    *((True, i, j) for i, A in enumerate(a) for j, B in enumerate(b) if A == B)
)
I2 = sparse.csr_matrix((data, (rows, cols)), shape=(len(a), len(b)))

## Checking for matrix equality
print((I1 != I2).nnz == 0)  ## --> True

It seems like the double loop is unavoidable, and it would be ideal to optimize this within numpy. However, using a generator expression does somewhat streamline the loops...

Answer 3

Answer №2

If you want to compare values with a small tolerance, you can utilize numpy.isclose:

np.isclose(a,b)

Alternatively, you can use pandas.DataFrame.eq:

a.eq(b)

Keep in mind that these functions will output an array of True and False values.

Answer 4

If you want to compare values with a small tolerance, you can utilize numpy.isclose:

np.isclose(a,b)

Alternatively, you can use pandas.DataFrame.eq:

a.eq(b)

Keep in mind that these functions will output an array of True and False values.

Scipy: Generating a sparse indicator matrix from one or more arrays

Answer №1

Answer №2

Similar questions

BeautifulSoup Pagination in Python: A Complete Guide

Basic Python-based Discord chatbot designed to function as a versatile dictionary utilizing a variety of data sources

Guide for setting the executable path of the Chrome driver to match the operating system. My goal is to create a generic executable path for the Selenium driver using Python

Adding a large number of plots to Bokeh in bulk

Learn the process of extracting keys and values from a response in Python and then adding them to a list

Leveraging pandas for extracting data from a specific section within a JSON file

Attempting to upload xgboost to Azure Machine Learning has encountered an issue: %1 is recognized as an invalid Win32 application. The process has concluded with an exit code of 1

Expanding the dimensions of both 3D and 2D array

Error messages cannot be dismissed in RobotFramework if the element does not exist

Bring in macros from a Jinja template without running the template's code

Troubles encountered when trying to save an Excel.Writer document to a different directory

Having trouble getting the img src to work in Django 2.1 template?

A Step-by-Step Guide to Successfully Clicking on a Checkbox Using Selenium and Python

Another instance of UnicodeDecodeError arises when using the format() method, yet this issue does not occur when employing

How can a multidimensional array be stored as bit data?

Filtering elements in Python API results

Incorporating numerous bokeh html files into a Django template

Writing data tables to Excel files using Python's Selenium module

Dateutil - Your Trusted Source for Relative Date Calculation

Choose a particular tier within the MultiIndex