Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure csr matrices are in "canonical format" before impact calculation #893

Merged
merged 15 commits into from
Jul 17, 2024
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ CLIMADA tutorials. [#872](https://github.com/CLIMADA-project/climada_python/pull
- Centroids complete overhaul. Most function should be backward compatible. Internal data is stored in a geodataframe attribute. Raster are now stored as points, and the meta attribute is removed. Several methds were deprecated or removed. [#787](https://github.com/CLIMADA-project/climada_python/pull/787)
- Improved error messages produced by `ImpactCalc.impact()` in case impact function in the exposures is not found in impf_set [#863](https://github.com/CLIMADA-project/climada_python/pull/863)
- Changed module structure: `climada.hazard.Hazard` has been split into the modules `base`, `io` and `plot` [#871](https://github.com/CLIMADA-project/climada_python/pull/871)
- Ensure `csr_matrix` stored in `climada.hazard.Hazard` have consistent data format and store no explicit zeros [#893](https://github.com/CLIMADA-project/climada_python/pull/893)

peanutfun marked this conversation as resolved.
Show resolved Hide resolved
### Fixed

Expand Down
4 changes: 4 additions & 0 deletions climada/engine/impact_calc.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,8 @@ def __init__(self,
The dimension of the imp_mat variable must be compatible with the
exposures and hazard objects.

This will call :py:meth:`climada.hazard.base.Hazard.check_matrices`.

Parameters
----------
exposures : climada.entity.Exposures
Expand All @@ -61,6 +63,8 @@ def __init__(self,
self.exposures = exposures
self.impfset = impfset
self.hazard = hazard
self.hazard.check_matrices()

# exposures index to use for matrix reconstruction
self._orig_exp_idx = np.arange(self.exposures.gdf.shape[0])

Expand Down
10 changes: 10 additions & 0 deletions climada/engine/test/test_impact_calc.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,16 @@ def test_init(self):
np.testing.assert_array_equal(HAZ.event_id, icalc.hazard.event_id)
np.testing.assert_array_equal(HAZ.event_name, icalc.hazard.event_name)

# Test check matrices
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Test check matrices
# Test check matrices
# Check fraction and intensity have the same shape
# Check that explicit zeroes are pruned

I am a bit confused why the pruning is tested here...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also check via a mock if check_matrices is called, but I recall strong sentiments against such tests 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my confusion comes from the fact that I am uneasy with having the impact class changing the hazard objects. See comment below.

hazard = deepcopy(HAZ)
hazard.intensity[0, hazard.intensity.indices[0]] = 0
hazard.fraction = sparse.csr_matrix(np.ones((1, 1)))
with self.assertRaisesRegex(
ValueError, "Intensity and fraction matrices must have the same shape"
):
ImpactCalc(ENT.exposures, ENT.impact_funcs, hazard)
self.assertEqual(hazard.intensity.nnz, HAZ.intensity.nnz - 1) # was pruned

def test_metrics(self):
"""Test methods to get impact metrics"""
mat = sparse.csr_matrix(np.array(
Expand Down
42 changes: 42 additions & 0 deletions climada/hazard/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,21 @@ class Hazard(HazardIO, HazardPlot):
Contains events of some hazard type defined at centroids. Loads from
files with format defined in FILE_EXT.

Attention
---------
This class uses instances of
`scipy.sparse.csr_matrix
<https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html>`_
to store :py:attr:`intensity` and :py:attr:`fraction`. This data types comes with
its particular pitfalls. Depending on how the objects are instantiated and modified,
a matrix might end up in a "non-canonical" state. In this state, its ``.data``
attribute does not necessarily represent the values apparent in the final matrix.
In particular, a "non-canonical" matrix may store "duplicates", i.e. multiple values
that map to the same matrix position. This is supported, and the default behavior is
to sum up these values. To avoid any inconsistencies, call :py:meth:`check_matrices`
before accessing the ``data`` attribute of either matrix. This will explicitly sum
all values at the same matrix position and eliminate explicit zeros.

Attributes
----------
haz_type : str
Expand Down Expand Up @@ -191,6 +206,33 @@ def __init__(self,
if self.pool:
LOGGER.info('Using %s CPUs.', self.pool.ncpus)

def check_matrices(self):
"""Ensure that matrices are consistently shaped and stored

It is good practice to call this method before accessing the ``data`` attribute
of either :py:attr:`intensity` or :py:attr:`fraction`.

See Also
--------
:py:func:`climada.util.checker.prune_csr_matrix`

Todo
-----
* Check consistency with centroids
peanutfun marked this conversation as resolved.
Show resolved Hide resolved

Raises
------
ValueError
If matrices are ill-formed or ill-shaped in relation to each other
"""
u_check.prune_csr_matrix(self.intensity)
u_check.prune_csr_matrix(self.fraction)
if self.fraction.nnz > 0:
if self.intensity.shape != self.fraction.shape:
raise ValueError(
"Intensity and fraction matrices must have the same shape"
)

@classmethod
def get_default(cls, attribute):
"""Get the Hazard type default for a given attribute.
Expand Down
2 changes: 1 addition & 1 deletion climada/hazard/io.py
Original file line number Diff line number Diff line change
Expand Up @@ -1008,7 +1008,7 @@ def write_hdf5(self, file_name, todense=False):
if var_name == 'centroids':
# Centroids have their own write_hdf5 method,
# which is invoked at the end of this method (s.b.)
pass
continue
elif isinstance(var_val, sparse.csr_matrix):
if todense:
hf_data.create_dataset(var_name, data=var_val.toarray())
Expand Down
38 changes: 32 additions & 6 deletions climada/hazard/test/test_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,18 +124,18 @@ def test_check_wrongFreq_fail(self):
def test_check_wrongInten_fail(self):
"""Wrong hazard definition"""
self.hazard.intensity = sparse.csr_matrix([[1, 2], [1, 2]])

with self.assertRaises(ValueError) as cm:
with self.assertRaisesRegex(
ValueError, "Invalid Hazard.intensity row size: 3 != 2."
):
self.hazard.check()
self.assertIn('Invalid Hazard.intensity row size: 3 != 2.', str(cm.exception))

def test_check_wrongFrac_fail(self):
"""Wrong hazard definition"""
self.hazard.fraction = sparse.csr_matrix([[1], [1], [1]])

with self.assertRaises(ValueError) as cm:
with self.assertRaisesRegex(
ValueError, "Invalid Hazard.fraction column size: 2 != 1."
):
self.hazard.check()
self.assertIn('Invalid Hazard.fraction column size: 2 != 1.', str(cm.exception))

def test_check_wrongEvName_fail(self):
"""Wrong hazard definition"""
Expand Down Expand Up @@ -212,6 +212,32 @@ def test_get_date_strings_pass(self):
self.assertEqual(haz.get_event_date()[560],
u_dt.date_to_str(haz.date[560]))

def test_check_matrices(self):
"""Test the check_matrices method"""
hazard = Hazard("TC")
hazard.fraction = sparse.csr_matrix(np.zeros((2, 2)))
hazard.check_matrices() # No error, fraction.nnz = 0
hazard.fraction = sparse.csr_matrix(np.ones((2, 2)))
with self.assertRaisesRegex(
ValueError, "Intensity and fraction matrices must have the same shape"
):
hazard.check_matrices()
hazard.intensity = sparse.csr_matrix(np.ones((2, 3)))
with self.assertRaisesRegex(
ValueError, "Intensity and fraction matrices must have the same shape"
):
hazard.check_matrices()

# Check that matrices are pruned
hazard.intensity[:] = 0
hazard.fraction = sparse.csr_matrix(([0], [0], [0, 1, 1]), shape=(2, 3))
hazard.check_matrices()
for attr in ("intensity", "fraction"):
with self.subTest(matrix=attr):
matrix = getattr(hazard, attr)
self.assertEqual(matrix.nnz, 0)
self.assertTrue(matrix.has_canonical_format)

class TestRemoveDupl(unittest.TestCase):
"""Test remove_duplicates method."""

Expand Down
36 changes: 35 additions & 1 deletion climada/util/checker.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,8 @@
'size',
'shape',
'array_optional',
'array_default'
'array_default',
'prune_csr_matrix',
]

import logging
Expand Down Expand Up @@ -162,3 +163,36 @@ def array_default(exp_len, var, var_name, def_val):
else:
size(exp_len, var, var_name)
return res

def prune_csr_matrix(matrix: sparse.csr_matrix):
"""Ensure that the matrix is in the "canonical format".

Depending on how the matrix was instantiated or modified, it might be in a
"non-canonical" state. This only relates to its internal storage. In this state,
multiple values might be stored for a single "apparent" value in the matrix.
Also, the matrix might store zeros explicitly, which could be removed.
Calling this function makes sure that the matrix is in the "canonical state", and
brings it into this state, if possible.

See Also
--------
`csr_matrix.has_canonical_format
<https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.has_canonical_format.html#scipy.sparse.csr_matrix.has_canonical_format>`_

Parameters
----------
matrix : csr_matrix
The matrix to check. It will be modified *inplace*. Its ``.data`` attribute
might change, but apparent matrix values will stay the same.

Raises
------
ValueError
If
`csr_matrix.check_format
<https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.check_format.html#scipy.sparse.csr_matrix.check_format>`_
fails
"""
matrix.check_format()
matrix.eliminate_zeros()
matrix.sum_duplicates()
22 changes: 22 additions & 0 deletions climada/util/test/test_checker.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,28 @@ def test_check_optionals_fail(self):
u_check.check_optionals(dummy.__dict__, dummy.vars_opt, "DummyClass.", dummy.id.size)
self.assertIn('Invalid DummyClass.list size: 25 != 3.', str(cm.exception))

def test_prune_csr_matrix(self):
"""Check that csr matrices are brought into canonical format"""
# Non-canonical: First three data points will be summed onto the first matrix
# entry, fourth will be an explicit zero entry
data = [0, 1, 2, 0]
indices = [0, 0, 0, 1]
indptr = [0, 4, 4, 4]
matrix = sparse.csr_matrix((data, indices, indptr), shape=(3, 2))

# These checks just make sure that we understand how csr_matrix works
np.testing.assert_array_equal(matrix.data, data)
np.testing.assert_array_equal(matrix[0, [0, 1]].toarray(), [[3, 0]])
self.assertEqual(matrix.nnz, 4)
self.assertFalse(matrix.has_canonical_format)

# Now test our function
u_check.prune_csr_matrix(matrix)
self.assertTrue(matrix.has_canonical_format)
self.assertEqual(matrix[0, 0], 3)
np.testing.assert_array_equal(matrix.data, [3])
self.assertEqual(matrix.nnz, 1)

# Execute Tests
if __name__ == "__main__":
TESTS = unittest.TestLoader().loadTestsFromTestCase(TestChecks)
Expand Down
Loading