sparse¶
Introduction¶
In many scientific applications, arrays come up that are mostly empty or filled with zeros. These arrays are aptly named sparse arrays. However, it is a matter of choice as to how these are stored. One may store the full array, i.e., with all the zeros included. This incurs a significant cost in terms of memory and performance when working with these arrays.
An alternative way is to store them in a standalone data structure that keeps track
of only the nonzero entries. Often, this improves performance and memory consumption
but most operations on sparse arrays have to be re-written. sparse
tries to
provide one such data structure. It isn’t the only library that does this. Notably,
scipy.sparse
achieves this, along with
Pysparse.
Motivation¶
So why use sparse
? Well, the other libraries mentioned are mostly limited to
two-dimensional arrays. In addition, inter-compatibility with numpy
is
hit-or-miss. sparse
strives to achieve inter-compatibility with
numpy.ndarray
, and provide mostly the same API. It defers to scipy.sparse
when it is convenient to do so, and writes custom implementations of operations where
this isn’t possible. It also supports general N-dimensional arrays.
Where to from here?¶
If you’re new to this library, you can visit the user manual page. If you’re already familiar with this library, or you want to dive straight in, you can jump to the API reference. You can also see the contents in the sidebar.
User Manual¶
The main class in this package is the COO
array type. Learning
a few things about this class can be very useful in using this
library. This section attempts to document some common things about
this object.
Installing¶
sparse
can be obtained from pip via
pip install sparse
You can also get sparse
from its current source on GitHub, to get all
the latest and greatest features. sparse
is under active development,
and many new features are being added. However, note that the API is currently
unstable at this time.
git clone https://github.com/mrocklin/sparse.git
cd ./sparse/
pip install .
Getting Started¶
COO
arrays can be constructed from numpy.ndarray
objects and
scipy.sparse.spmatrix
objects. For example, to generate the identity
matrix,
import numpy as np
import scipy.sparse
import sparse
sps_identity = scipy.sparse.eye(5)
identity = sparse.COO.from_scipy_sparse(sps_identity)
COO
arrays can have operations performed on them just like numpy.ndarray
objects. For example, to add two COO
arrays:
z = x + y
You can also apply any numpy.ufunc
to COO
arrays.
sin_x = np.sin(x)
However, operations which convert the sparse array into a dense one aren’t currently
supported. For example, the following raises a ValueError
.
y = x + 5
However, if you’re sure you want to convert a sparse array to a dense one, you can
do this (which will result in a numpy.ndarray
):
y = x.todense() + 5
That’s it! You can move on to the user manual to see what part of this library interests you, or you can jump straight in with the API reference.
Constructing COO
arrays¶
From coordinates and data¶
This is the preferred way of constructing COO
arrays. The
constructor for COO
(see COO.__init__
) can create these
objects from two main variables: coords
and data
.
coords
contains the indices where the data is nonzero, and
data
contains the data corresponding to those indices. For
example, the following code will generate a \(5 \times 5\)
identity matrix:
coords = [[0, 1, 2, 3, 4], [0, 1, 2, 3, 4]]
data = [1, 1, 1, 1, 1]
s = COO(coords, data)
In general coords
should be a (ndim, nnz)
shaped
array. Each row of coords
contains one dimension of the
desired sparse array, and each column contains the index
corresponding to that nonzero element. data
contains
the nonzero elements of the array corresponding to the indices
in coords
. Its shape should be (nnz,)
You can, and should, pass in numpy.ndarray
objects for
coords
and data
.
In this case, the shape of the resulting array was determined from
the maximum index in each dimension. If the array extends beyond
the maximum index in coords
, you should supply a shape
explicitly. For example, if we did the following without the
shape
keyword argument, it would result in a
\(4 \times 5\) matrix, but maybe we wanted one that was actually
\(5 \times 5\).
coords = [[0, 3, 2, 1], [4, 1, 2, 0]]
data = [1, 4, 2, 1]
s = COO(coords, data, shape=(5, 5))
From scipy.sparse.spmatrix
objects¶
To construct COO
array from scipy.sparse.spmatrix
objects, you can use the COO.from_scipy_sparse
method. As an
example, if x
is a scipy.sparse.spmatrix
, you can
do the following to get an equivalent COO
array:
s = COO.from_scipy_sparse(x)
From numpy.ndarray
objects¶
To construct COO
arrays from numpy.ndarray
objects, you can use the COO.from_numpy
method. As an
example, if x
is a numpy.ndarray
, you can
do the following to get an equivalent COO
array:
s = COO.from_numpy(x)
Generating random COO
objects¶
The sparse.random
method can be used to create random
COO
arrays. For example, the following will generate
a \(10 \times 10\) matrix with \(10\) nonzero entries,
each in the interval \([0, 1)\).
s = sparse.random((10, 10), density=0.1)
Building COO
Arrays from DOK
Arrays¶
It’s possible to build COO
arrays from DOK
arrays, if it is not
easy to construct the coords
and data
in a simple way. DOK
arrays provide a simple builder interface to build COO
arrays, but at
this time, they can do little else.
You can get started by defining the shape (and optionally, datatype) of the
DOK
array. If you do not specify a dtype, it is inferred from the value
dictionary or is set to dtype('float64')
if that is not present.
s = DOK((6, 5, 2))
s2 = DOK((2, 3, 4), dtype=np.float64)
After this, you can build the array by assigning arrays or scalars to elements or slices of the original array. Broadcasting rules are followed.
s[1:3, 3:1:-1] = [[6, 5]]
At the end, you can convert the DOK
array to a COO
array, and
perform arithmetic or other operations on it.
s2 = COO(s)
In addition, it is possible to access single elements of the DOK
array
using normal Numpy indexing.
s[1, 2, 1] # 5
s[5, 1, 1] # 0
Operations on COO
arrays¶
You can do a number of operations on COO
arrays. These include basic
operations with operators, element-wise operations, reductions and other
common operations
Basic Operations¶
COO
objects can have a number of operators applied to them. They support
operations with scalars, scipy.sparse.spmatrix
objects, and other
COO
objects. For example, to get the sum of two COO
objects, you
would do the following:
z = x + y
Note that in-place operators are currently not supported. For example,
x += y
will not work.
Auto-Densification¶
Operations that would result in dense matrices, such as binary
operations with numpy.ndarray
objects or certain operations with
scalars are not allowed and will raise a ValueError
. For example,
all of the following will raise a ValueError
. Here, x
and
y
are COO
objects.
x == y
x + 5
x == 0
x != 5
x / y
However, all of the following are valid operations.
x + 0
x != y
x + y
x == 5
5 * x
x / 7.3
x != 0
If densification is needed, it must be explicit. In other words, you must call
COO.todense
on the COO
object. If both operands are COO
,
both must be densified.
Broadcasting¶
All binary operators support broadcasting
.
This means that (under certain conditions) you can perform binary operations
on arrays with unequal shape. Namely, when the shape is missing a dimension,
or when a dimension is 1
. For example, performing a binary operation
on two COO
arrays with shapes (4,)
and (5, 1)
yields
an object of shape (5, 4)
. The same happens with arrays of shape
(1, 4)
and (5, 1)
. However, (4, 1)
and (5, 1)
will raise a ValueError
.
Full List of Operators¶
Here, x
and y
can be COO
arrays,
scipy.sparse.spmatrix
objects or scalars, keeping in mind auto
densification rules. The following operators are supported:
Basic algebraic operations
operator.add
(x + y
)operator.neg
(-x
)operator.sub
(x - y
)operator.mul
(x * y
)operator.truediv
(x / y
)operator.floordiv
(x // y
)operator.pow
(x ** y
)
Comparison operators
operator.eq
(x == y
)operator.ne
(x != y
)operator.gt
(x > y
)operator.ge
(x >= y
)operator.lt
(x < y
)operator.le
(x <= y
)
Bitwise operators
operator.and_
(x & y
)operator.or_
(x | y
)operator.xor
(x ^ y
)
Bit-shifting operators
operator.lshift
(x << y
)operator.rshift
(x >> y
)
Element-wise Operations¶
COO
arrays support a variety of element-wise operations. However, as
with operators, operations that map zero to a nonzero value are not supported.
To illustrate, the following are all possible, and will produce another
COO
array:
x.abs()
np.sin(x)
np.sqrt(x)
x.conj()
x.expm1()
np.log1p(x)
However, the following are all unsupported and will raise a ValueError
:
x.exp()
np.cos(x)
np.log(x)
Notice that you can apply any unary or binary numpy.ufunc
to COO
arrays, scipy.sparse.spmatrix
objects and scalars and it will work so
long as the result is not dense.
COO.elemwise
¶
This function allows you to apply any arbitrary unary or binary function where
the first object is COO
, and the second is a scalar, COO
, or
a scipy.sparse.spmatrix
. For example, the following will add two
COO
objects:
x.elemwise(np.add, y)
Partial List of Supported numpy.ufunc
s¶
Although any unary or binary numpy.ufunc
should work if the result is
not dense, when calling in the form x.func()
, the following operations
are supported:
Reductions¶
COO
objects support a number of reductions. However, not all important
reductions are currently implemented (help welcome!) All of the following
currently work:
x.sum(axis=1)
np.max(x)
np.min(x, axis=(0, 2))
x.prod()
Note
If you are performing multiple reductions along the same axes, it may
be beneficial to call COO.enable_caching
.
COO.reduce
¶
This method can take an arbitrary numpy.ufunc
and performs a
reduction using that method. For example, the following will perform
a sum:
x.reduce(np.add, axis=1)
Note
sparse
currently performs reductions by grouping together all
coordinates along the supplied axes and reducing those. Then, if the
number in a group is deficient, it reduces an extra time with zero.
As a result, if reductions can change by adding multiple zeros to
it, this method won’t be accurate. However, it works in most cases.
Indexing¶
COO
arrays can be indexed
just like regular
numpy.ndarray
objects. They support integer, slice and boolean indexing.
However, currently, numpy advanced indexing is not properly supported. This
means that all of the following work like in Numpy, except that they will produce
COO
arrays rather than numpy.ndarray
objects, and will produce
scalars where expected. Assume that z.shape
is (5, 6, 7)
z[0]
z[1, 3]
z[1, 4, 3]
z[:3, :2, 3]
z[::-1, 1, 3]
z[-1]
z[[True, False, True, False, True], 3, 4]
All of the following will raise an IndexError
, like in Numpy 1.13 and later.
z[6]
z[3, 6]
z[1, 4, 8]
z[-6]
z[[True, True, False, True], 3, 4]
Note
Numpy advanced indexing is currently not supported.
Other Operations¶
COO
arrays support a number of other common operations. Among them are
dot
, tensordot
, concatenate
and stack
,
COO.transpose
and COO.reshape
. You can view the full list on the
API reference page for sparse
Converting COO
objects to other Formats¶
COO
arrays can be converted to numpy.ndarray
objects,
or to some scipy.sparse.spmatrix
subclasses via the following
methods:
COO.todense
: Converts to anumpy.ndarray
unconditionally.COO.maybe_densify
: Converts to anumpy.ndarray
based on- certain constraints.
COO.to_scipy_sparse
: Converts to ascipy.sparse.coo_matrix
if- the array is two dimensional.
COO.tocsr
: Converts to ascipy.sparse.csr_matrix
if- the array is two dimensional.
COO.tocsc
: Converts to ascipy.sparse.csc_matrix
if- the array is two dimensional.
Contributing to sparse¶
General Guidelines¶
sparse is a community-driven project on GitHub. You can find our repository on GitHub. Feel free to open issues for new features or bugs, or open a pull request to fix a bug or add a new feature.
If you haven’t contributed to open-source before, we recommend you read this excellent guide by GitHub on how to contribute to open source. The guide is long, so you can gloss over things you’re familiar with.
If you’re not already familiar with it, we follow the fork and pull model on GitHub.
Running/Adding Unit Tests¶
It is best if all new functionality and/or bug fixes have unit tests added with each use-case.
Since we support both Python 2.7 and Python 3.5 and newer, it is recommended
to test with at least these two versions before committing your code or opening
a pull request. We use pytest as our unit
testing framework, with the pytest-cov extension to check code coverage and
pytest-flake8 to check code style. You don’t need to configure these extensions
yourself. Once you’ve configured your environment, you can just cd
to
the root of your repository and run
py.test
Adding/Building the Documentation¶
If a feature is stable and relatively finalized, it is time to add it to the documentation. If you are adding any private/public functions, it is best to add docstrings, to aid in reviewing code and also for the API reference.
We use Numpy style docstrings and Sphinx to document this library. Sphinx, in turn, uses reStructuredText as its markup language for adding code.
We use the Sphinx Autosummary extension
to generate API references. In particular, you may want do look at the docs/generated
directory to see how these files look and where to add new functions, classes or modules.
For example, if you add a new function to the sparse.COO
class, you would open up
docs/generated/sparse.COO.rst
, and add in the name of the function where appropriate.
To build the documentation, you can cd
into the docs
directory
and run
sphinx-build -b html . _build/html
After this, you can find an HTML version of the documentation in docs/_build/html/index.html
.
Changelog¶
0.2.0 / 2018-01-25¶
- Add Elementwise broadcasting and broadcast_to (GH#35) Hameer Abbasi
- Add Bitwise ops (GH#38) Hameer Abbasi
- Add slicing support for Ellipsis and None (GH#37) Matthew Rocklin
- Add triu and tril and tests (GH#40) Hameer Abbasi
- Extend gitignore file (GH#42) Nils Werner
- Update MANIFEST.in (GH#45) Matthew Rocklin
- Remove auto densification and unify operator code (GH#46) Hameer Abbasi
- Fix nnz for scalars (GH#48) Hameer Abbasi
- Update README (GH#50) (GH#53) Hameer Abbasi
- Fix large concatenations and stacks (GH#50) Hameer Abbasi
- Add __array_ufunc__ for __call__ and reduce (GH#r9) Hameer Abbasi
- Update documentation (GH#54) Hameer Abbasi
- Flake8 and coverage in pytest (GH#59) Nils Werner
- Copy constructor (GH#55) Nils Werner
- Add random function (GH#41) Nils Werner
- Add lots of indexing features (GH#57) Hameer Abbasi
- Validate .transpose axes (GH#61) Nils Werner
- Simplify axes normalization logic Nils Werner
- User higher density for sparse.random in tests (GH#64) Keisuke Fujii
- Support left-side np.number elemwise operations (GH#67) Keisuke Fujii
- Support len on COO (GH#68) Nils Werner
- Update scipy version in requirements (GH#70) Hameer Abbasi
- Documentation (GH#43) Nils Werner and Hameer Abbasi
- Use Tox for cross Python-version testing (GH#77) Nils Werner
- Support mixed sparse-dense when result is sparse (GH#75) Hameer Abbasi
- Update contributing.rst (GH#76) Hameer Abbasi
- Size and density properties (GH#69) Nils Werner
- Fix large sum (GH#83) Hameer Abbasi
- Add DOK (GH#85) Hameer Abbasi
- Implement __array__ protocol (GH#87) Matthew Rocklin