Skip to content Skip to sidebar Skip to footer

Fill Matrix Of Occurences From Column/row Arrays Of Indexes

I'm searching for an efficient way to create a matrix of occurrences from two arrays that contains indexes, one represents the row indexes in this matrix, the other, the column one

Solution 1:

Use np.add.at, specifying a tuple of indices:

>>> np.add.at(new_matrix, (rows, columns), 1)
>>> new_matrix
array([[ 1.,  0.,  0.],
       [ 0.,  2.,  0.],
       [ 0.,  1.,  2.],
       [ 2.,  1.,  5.]])

np.add.at operates on the array in-place, adding 1 as many times to the indices as specified by the (row, columns) tuple.


Solution 2:

Approach #1

We can convert those pairs to linear indices and then use np.bincount -

def bincount_app(rows, columns, n_rows, n_columns):
    # Get linear index equivalent
    lidx = (columns.max()+1)*rows + columns

    # Use binned count on the linear indices
    return np.bincount(lidx, minlength=n_rows*n_columns).reshape(n_rows,n_columns)

Sample run -

In [242]: n_rows    = 4
     ...: n_columns = 3
     ...: 
     ...: rows    = np.array([0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3])
     ...: columns = np.array([0, 1, 1, 1, 2, 2, 0, 1, 2, 0, 2, 2, 2, 2])

In [243]: bincount_app(rows, columns, n_rows, n_columns)
Out[243]: 
array([[1, 0, 0],
       [0, 2, 0],
       [0, 1, 2],
       [2, 1, 5]])

Approach #2

Alternatively, we can sort the linear indices and get the counts using slicing to have our second approach, like so -

def mask_diff_app(rows, columns, n_rows, n_columns):
    lidx = (columns.max()+1)*rows + columns
    lidx.sort()
    mask = np.concatenate(([True],lidx[1:] != lidx[:-1],[True]))
    count = np.diff(np.flatnonzero(mask))
    new_matrix = np.zeros([n_rows, n_columns],dtype=int)
    new_matrix.flat[lidx[mask[:-1]]] = count
    return new_matrix

Approach #3

This seems like a straight-forward one with sparse matrix csr_matrix as well, as it does accumulation on its own for repeated indices. The benefit is the memory efficiency, given that it's a sparse matrix, which would be noticeable if you are filling a small number of places in the output and a sparse matrix output is okay.

The implementation would look something like this -

from scipy.sparse import csr_matrix

def sparse_matrix_app(rows, columns, n_rows, n_columns):
    out_shp = (n_rows, n_columns)
    data = np.ones(len(rows),dtype=int)
    return csr_matrix((data, (rows, columns)), shape=out_shp)

If you need a regular/dense array, simply do -

sparse_matrix_app(rows, columns, n_rows, n_columns).toarray()

Sample output -

In [319]: sparse_matrix_app(rows, columns, n_rows, n_columns).toarray()
Out[319]: 
array([[1, 0, 0],
       [0, 2, 0],
       [0, 1, 2],
       [2, 1, 5]])

Benchmarking

Other approach(es) -

# @cᴏʟᴅsᴘᴇᴇᴅ's soln
def add_at_app(rows, columns, n_rows, n_columns):
    new_matrix = np.zeros([n_rows, n_columns],dtype=int)
    np.add.at(new_matrix, (rows, columns), 1)

Timings

Case #1 : Output array of shape (1000, 1000) and no. of indices = 10k

In [307]: # Setup
     ...: n_rows = 1000
     ...: n_columns = 1000
     ...: rows = np.random.randint(0,1000,(10000))
     ...: columns = np.random.randint(0,1000,(10000))

In [308]: %timeit add_at_app(rows, columns, n_rows, n_columns)
     ...: %timeit bincount_app(rows, columns, n_rows, n_columns)
     ...: %timeit mask_diff_app(rows, columns, n_rows, n_columns)
     ...: %timeit sparse_matrix_app(rows, columns, n_rows, n_columns)
1000 loops, best of 3: 1.05 ms per loop
1000 loops, best of 3: 424 µs per loop
1000 loops, best of 3: 1.05 ms per loop
1000 loops, best of 3: 1.41 ms per loop

Case #2 : Output array of shape (1000, 1000) and no. of indices = 100k

In [309]: # Setup
     ...: n_rows = 1000
     ...: n_columns = 1000
     ...: rows = np.random.randint(0,1000,(100000))
     ...: columns = np.random.randint(0,1000,(100000))

In [310]: %timeit add_at_app(rows, columns, n_rows, n_columns)
     ...: %timeit bincount_app(rows, columns, n_rows, n_columns)
     ...: %timeit mask_diff_app(rows, columns, n_rows, n_columns)
     ...: %timeit sparse_matrix_app(rows, columns, n_rows, n_columns)
100 loops, best of 3: 11.4 ms per loop
1000 loops, best of 3: 1.27 ms per loop
100 loops, best of 3: 7.44 ms per loop
10 loops, best of 3: 20.4 ms per loop

Case #3 : Sparse-ness in output

As stated earlier, for the sparse method to work better, we would need sparse-ness. Such a case would be like this -

In [314]: # Setup
     ...: n_rows = 5000
     ...: n_columns = 5000
     ...: rows = np.random.randint(0,5000,(1000))
     ...: columns = np.random.randint(0,5000,(1000))

In [315]: %timeit add_at_app(rows, columns, n_rows, n_columns)
     ...: %timeit bincount_app(rows, columns, n_rows, n_columns)
     ...: %timeit mask_diff_app(rows, columns, n_rows, n_columns)
     ...: %timeit sparse_matrix_app(rows, columns, n_rows, n_columns)
100 loops, best of 3: 11.7 ms per loop
100 loops, best of 3: 11.1 ms per loop
100 loops, best of 3: 11.1 ms per loop
1000 loops, best of 3: 269 µs per loop

If you need a dense array, we lose the memory efficiency and hence performance one as well -

In [317]: %timeit sparse_matrix_app(rows, columns, n_rows, n_columns).toarray()
100 loops, best of 3: 11.7 ms per loop

Post a Comment for "Fill Matrix Of Occurences From Column/row Arrays Of Indexes"