Sparse
This implements sparse arrays of arbitrary dimension on top of
numpy and
scipy.sparse
. It generalizes the
scipy.sparse.coo_matrix
and
scipy.sparse.dok_matrix
layouts, but
extends beyond just rows and columns to an arbitrary number of
dimensions.
Additionally, this project maintains compatibility with the
numpy.ndarray
interface rather than the
numpy.matrix
interface used in
scipy.sparse
These differences make this project useful in certain situations where
scipy.sparse matrices are not well suited, but it should not be
considered a full replacement. The data structures in pydata/sparse
complement and can be used in conjunction with the fast linear algebra
routines inside scipy.sparse
. A format conversion or copy may be
required.
Motivation
Sparse arrays, or arrays that are mostly empty or filled with zeros, are common in many scientific applications. To save space we often avoid storing these arrays in traditional dense formats, and instead choose different data structures. Our choice of data structure can significantly affect our storage and computational costs when working with these arrays.
Design
The main data structure in this library follows the Coordinate List (COO) layout for sparse matrices, but extends it to multiple dimensions.
The COO layout, which stores the row index, column index, and value of every element:
row | col | data |
---|---|---|
0 | 0 | 10 |
0 | 2 | 13 |
1 | 3 | 9 |
3 | 8 | 21 |
It is straightforward to extend the COO layout to an arbitrary number of dimensions:
dim1 | dim2 | dim3 | ... | data |
---|---|---|---|---|
0 | 0 | 0 | . | 10 |
0 | 0 | 3 | . | 13 |
0 | 2 | 2 | . | 9 |
3 | 1 | 4 | . | 21 |
This makes it easy to store a multidimensional sparse array, but we still need to reimplement all of the array operations like transpose, reshape, slicing, tensordot, reductions, etc., which can be challenging in general.
This library also includes several other data structures. Similar to COO, the Dictionary of Keys (DOK) format for sparse matrices generalizes well to an arbitrary number of dimensions. DOK is well-suited for writing and mutating. Most other operations are not supported for DOK. A common workflow may involve writing an array with DOK and then converting to another format for other operations.
The Compressed Sparse Row/Column (CSR/CSC) formats are widely used in scientific computing are now supported by pydata/sparse. The CSR/CSC formats excel at compression and mathematical operations. While these formats are restricted to two dimensions, pydata/sparse supports the GCXS sparse array format, based on GCRS/GCCS from which generalizes CSR/CSC to n-dimensional arrays. Like their two-dimensional CSR/CSC counterparts, GCXS arrays compress well. Whereas the storage cost of COO depends heavily on the number of dimensions of the array, the number of dimensions only minimally affects the storage cost of GCXS arrays, which results in favorable compression ratios across many use cases.
Together these formats cover a wide array of applications of sparsity.
Additionally, with each format complying with the
numpy.ndarray
interface and following
the appropriate dispatching protocols, pydata/sparse arrays can interact
with other array libraries and seamlessly take part in
pydata-ecosystem-based workflows.
LICENSE
This library is licensed under BSD-3.