Add RasterMetricSpace for improved pairwise sampling of large 2D grids #114

rhugonnet · 2021-07-14T15:18:43Z

Discussion started in #111 with @mmaelicke and @redhog

Added `RasterEquidistantMetricSpace`

Performs subsampling adapted to Raster/Grid data to improve the empirical Variogram estimation:

What the subclass does:

Selects a disk sample centered on a random point of the grid, with radius based on the resolution of the grid and the number of samples specified in subsample,
Selects successive equidistant samples from the center disk sample, until the max_dist is reached, in 2D here it corresponds to several subsampling within rings based on the center point,
Calculates the pairwises distance between the two samples.

This is performed iteratively for several runs, corresponding to different center points on the grid.

Example

I used GSTools to generate random correlated fields, I couldn't find an easy way with skgstat!

import gstools as gs
import numpy as np
import skgstat as skg
import matplotlib.pyplot as plt

# Grid/Raster of 1000 x 1000 pixels
shape = (1000, 1000)
x = np.arange(0, 1000)
y = np.arange(0, 1000)

# Using GSTools, let's generate a correlated signal at two different length: 5 and 100 (spherical)
model = gs.Spherical(dim=2, var=0.5, len_scale=5)
srf = gs.SRF(model, seed=42)

model2 = gs.Spherical(dim=2, var=0.5, len_scale=100)
srf2 = gs.SRF(model2, seed=42)

# We combine the two random correlated fields (e.g, short-range could represent resolution, and long-range the noise)
field = srf.structured([x, y]) + srf2.structured([x, y])

# Let's see how the correlated field looks
plt.imshow(field, vmin=-3, vmax=3)
plt.colorbar()

# Get grid coordinates, and fatten everything because we don't care about the 2D
xx, yy = np.meshgrid(x, y)
coords = np.dstack((xx.flatten(), yy.flatten())).squeeze()

# We generate a RasterEquidistantMetricSpace instance, providing the shape and coordinate extent of the grid.
# We run 100 samplings of pairwise distances between a random center sample of 25 points and several samples of
# equidistant points
rems = skg.RasterEquidistantMetricSpace(coords, shape=shape, extent=(x[0], x[-1], y[0], y[-1]), samples=25,
                                        runs=100)

# We compute the variogram with custom bins to look at all ranges more precisely
custom_bins = np.concatenate([np.arange(0, 20)] + [np.arange(20, 200, 20)] + [np.arange(200, 1200, 200)])

V1 = skg.Variogram(rems, field.flatten(), bin_func=custom_bins)

# How many pairwise distances did we compute:
print(np.sum(V1.bin_count))

Number of pairwises distances computed:
411183

# Wrapper plotting function to set a log x axis
def plot_vgm(V):
    fig = plt.figure()
    grid = plt.GridSpec(10, 10, wspace=0.5, hspace=0.5)
    ax1 = fig.add_subplot(grid[:3, :])
    ax2 = fig.add_subplot(grid[3:, :])
    ax1.set_xscale('log')
    ax2.set_xscale('log')
    V.plot(axes=[ax2, ax1], grid=grid)

# Plot the empirical variogram (ignore the single-range model fit)
plot_vgm(V1)

# Let's compare with a completely random subsampling
pms = skg.ProbabalisticMetricSpace(coords, samples=1000)
V2 = skg.Variogram(pms, field.flatten(), bin_func=custom_bins)

# Check we computed a similar amount of pairwise distances
print(np.sum(V2.bin_count))

Number of pairwises distances computed:
489951

# Plot the empirical variogram (ignore the single-range model fit)
plot_vgm(V2)

We can see that the sampling is largely improved at shorter distances, and is quite similar at larger distances. The method is much more efficient than a fully random sampling in relation to the pixel size and extent of a Raster/Grid,.
The results could be further improved by increasing the amount of samples drawn, but I kept it small here to limit the computing time of the example.

Update

Added tests
Optimized computing times
Added parameters + descriptions for the spatial sampling
Furnished class documentation

Resolves #111

redhog · 2021-07-14T15:22:25Z

cool!, reading up on this now.
Btw, the easiest way to generate a randomly correlated field in scikit-gstat is to randomly generate a few points and krige them using a variogram you design yourself (i.e. chose a sill and range).

redhog · 2021-07-14T15:24:54Z

There is an example in the tests: https://github.com/mmaelicke/scikit-gstat/blob/master/skgstat/tests/test_variogram.py#L24

The advantage of this approach for testing, is that the range/sill you should get from creating a variogram from the generated correlated data,is known (approximately).

rhugonnet · 2021-07-14T15:28:22Z

There is an example in the tests: https://github.com/mmaelicke/scikit-gstat/blob/master/skgstat/tests/test_variogram.py#L24

The advantage of this approach for testing, is that the range/sill you should get from creating a variogram from the generated correlated data,is known (approximately).

I tried to use this example with 100 random points, setting the Variogram parameters and then kriging with OK on the same (1000, 1000) grid as above, but the calculation kept running for several minutes so I gave up! 😅

redhog · 2021-07-14T15:40:09Z

Ah... this is possibly a bug/limitation in OK: It needs to allocate distance pairs between all points to krige from and all points to krige to, so if the destination array is big, it can eat a lot of ram and start swapping... Batching could help this I guess?

redhog · 2021-07-14T15:43:06Z

As for your code, I have two main feedbacks:

You loop over runs in python. Would it be possible to vectorize this in numpy (haven't looked too deeply at this)?
You represent the grid coordinates as a normal coords array with all the coords, when it could be represented using just width and height (self.coords would have to be a property that uses meshgrid).

rhugonnet · 2021-07-15T14:03:49Z

Thanks for the feedback!
I looked a bit at the performance.

Findings

The removal of duplicates was what took most computing time (95%+) for small to medium pairwise sample sizes (10,000 - 1,000,000), typically between 5 and 100 seconds. And it was still ~80% of the computing time for sample sizes above this.

The implementation was as follows, based on set(), the most performant method without preserving order (https://stackoverflow.com/questions/480214/how-do-you-remove-duplicates-from-a-list-whilst-preserving-order) but requires to zip a Tuple of coordinates + data. Performs the same as dict.fromkeys(list).

 # remove possible duplicates (that would be summed by default)
c, eq, d = zip(*set(zip(c, eq, d)))
dists = sparse.csr_matrix((d, (c, eq)), shape=(len(self.coords), len(self.coords)))

Using the recently updated solution in https://stackoverflow.com/questions/28677162/ignoring-duplicate-entries-in-sparse-matrix reduces that computing time by a factor of 5 (1 and 20 seconds respective to the previous 5 and 100 seconds), as it is based solely on the coordinates:

# Solution 5+ times faster than the preceding, but relies on _update() which might change in scipy (which
# only has an implemented method for summing duplicates, and not ignoring them yet)
dok = sparse.dok_matrix((len(self.coords), len(self.coords)))
dok._update(zip(zip(c, eq), d))
dists = dok.tocsr()

Deriving the random indexes through cidx and eqidx (the for loop you mention @redhog) to sample the pairwise distances was what took the most computing time for medium to large sample sizes (1,000,000 - 100,000,000++).

Because it is random sampling without replacement, it is not straightforward to vectorize the indices operations of numpy because we have no idea in advance how many points we need to subsample (coordinates can be on the edges of the grid, NaN data, ...).
Therefore, I vectorized all the calculations except for the random sampling, that is all the steps that derive the distances from the center points.
In cidx, from:

dist_center = np.sqrt((self.coords[:, 0] - self._center[0]) ** 2 + (
        self.coords[:, 1] - self._center[1]) ** 2)
idx1 = dist_center < self._center_radius

To:

dist_center = np.sqrt((self.coords[None, :, 0] - self._centers[:, 0, None]) ** 2 + (
        self.coords[None, :, 1] - self._centers[:, 0, None]) ** 2)
idx1 = dist_center < self._center_radius

And same for eqidx, with also another dimension for the different radius:

# Get distances from all centers
dist_center = np.sqrt((self.coords[None, :, 0] - self._centers[:, 0, None]) ** 2 + (
        self.coords[None, :, 1] - self._centers[:, 1, None]) ** 2)

# Select samples in a ring of inside radius and outside radius for all runs
idx = np.logical_and(dist_center[None, :] >= np.array(list_in_radius)[:, None, None],
                     dist_center[None, :] < np.array(list_out_radius)[:, None, None])

The downside of this method is that it needs everything in memory to resolve all distances simultaneously. So for 1000 runs, a grid of 1000 x 1000 coordinates and a dozen radius for sampling rings, one rapidly ends up with matrices of size 10,000,000,000 which results in memory errors. While, without vectorizing these steps, one doesn't run as rapidly into errors (with the for loop).
And it is only slightly faster than the for loop (10% faster with 1,000,000 samples, 50% with 100,000,000).

Summary

The new removal of duplicates saves a lot of time, but it is still what takes the most time. I'm not sure if there is a way to remove duplicate a priori using some other approach... so many pairwise combinations possibilites, I don't see a way to sample only independent samples!
I did not add the vectorized approach because it limits memory usage. An option would be to add a multiprocessing wrapper to allow running the subsamplings in parallel? Using several cores would increase memory usage, but this can be tuned a lot more easily by the user, and would help speed up calculations, including the random samplings. What do you think?

Overall, I think the speed is satisfactory for the amount of pairwise distances that are sampled.

@redhog What would be the benefit of using width/height for the coords? Isn't it still width * height coordinates combinations?
I hesitated at first, but decided to provide directly coordinates to avoid dealing with NaNs in the grid data.

rhugonnet · 2021-07-15T16:58:12Z

I have now added optional input parameters to allow users to tweak with the subsampling as they like, with default values that are described.
Also added some basic tests, and a full description of the functionality and parameters directly in the subclass doc 😄

mmaelicke · 2021-07-19T06:26:22Z

Hey Romain,
Thanks for this nice PR. I will happily review it, once you consider it finished. Just request my review and I will try to free some time...

rhugonnet · 2021-07-19T17:53:55Z

Hi Mirko,

PR finished and ready to be reviewed!
(couldn't press the GitHub review button for some reason, maybe this is only available to a specific group)

I have moved the calculations that were into self.eqidx and self.cidx into functions out of the class. This feels cleaner: those variables didn't fit as @property as they were updated within the loop. Now the output given by those is the fixed, final list of pairs of ensembles. Same for the ctree and eqtree.

It was the occasion to add a parallel option for the subsampling. In the end, not much improvement brought by parallel processing, because 90% of the computing time comes from the duplicate removal.
I'm thinking of leaving it in, though, as it does not change anything for the default processing on a single core ncores=1. And it might still be useful at a later stage if a better solution is found for duplicates (e.g., using the duplicate pairwise samples several times instead of removing them? this is currently not possible because CSR is the format expected by the Variogram class, and CSR sparse matrices automatically sum duplicates thus providing wrong pairwise distances).

mmaelicke

I really like your code, it's clean and very well documented. I have just a few comments, after answering (and maybe changing), we can merge this PR.

BTW, I always had the plan to implement some sampling strategies (mainly for virtual landscapes) into the package. Do you think it is easier to inherit/extent this class, by making the rings just one of many options, or shall I go for one class per strategy? What do you think?

skgstat/MetricSpace.py

mmaelicke · 2021-07-21T08:16:59Z

skgstat/MetricSpace.py

+            ratio_subsamp=0.2,
+            runs=None,
+            ncores=1,
+            exp_increase_fac=np.sqrt(2),


why sqrt(2)

Because the surface area of successive rings is going to be preserved when increasing the radius by a factor of exactly sqrt(2). The equation for the area of the ring (disk minus disk) looks like this (once the pis that even out are removed): (sqrt(2)R)**2 - R**2 = R**2

It's briefly described in the class description + the __init__ function at the line it uses the factor

rhugonnet · 2021-07-22T13:23:04Z

I really like your code, it's clean and very well documented. I have just a few comments, after answering (and maybe changing), we can merge this PR.

BTW, I always had the plan to implement some sampling strategies (mainly for virtual landscapes) into the package. Do you think it is easier to inherit/extent this class, by making the rings just one of many options, or shall I go for one class per strategy? What do you think?

Thanks for the nice feedback (and quick review)! 😊

For other sampling strategies: it really depends what they are. I think this class with equal distance to a center sample could be easily extended in a higher number of dimensions (with spheres in 3D, etc...).
For other shapes, it would be doable to integrate those into the same class, but the problem would be to manage the many possible input parameters the user could pass... My guess is that you'd reach a high number of those and need to pass them all as a dictionary or a kwargs and select only the ones of interest. Not so difficult to code, but not so easy to use! So maybe creating different classes for different shapes would be preferable here, not sure.

…array() in fit

mmaelicke

Thanks for your quick changes!

mmaelicke · 2021-07-23T07:02:26Z

Just released as 0.6.7.

I quickly changed the from __future__ import annotations to from typing import Tuple, List, as the annotations module needs Python >= 3.7. SciKit-GStat still supports 3.6. However, if you specifically need annotations module, we can open an issue and discuss. 3.6 might be dropped soon anyway (and eol is in dec I think).

rhugonnet added 2 commits July 14, 2021 16:50

add RasterMetricSpace

5bc6b74

remove unused imports

fee0f3f

speed up duplicate removal, fix a couple issues

4ae5838

rhugonnet added 3 commits July 15, 2021 16:04

remove unusued mp draft

aa7fc55

add test and fix random state

7dde22a

furnish doc, add optional subsampling input parameters

20420fb

correct typos

dd89a1b

rhugonnet added 2 commits July 19, 2021 19:35

add multiprocessing

48d0f50

add outside funcs docstrings

3125aab

remove elapsed time prints

fccd0c6

mmaelicke self-requested a review July 20, 2021 06:26

mmaelicke reviewed Jul 21, 2021

View reviewed changes

rhugonnet added 4 commits July 22, 2021 15:01

formatting and typing annotations for REMS subfunctions

a5702c9

fix comment for ring sampling

dbf8aad

clearer rename for ratio_subsample argument

8ac7a1c

rename ncores in n_jobs

db879e2

rhugonnet added 4 commits July 22, 2021 19:08

fix type annotation support

e30722a

remove multi-core test (not good for ci)

af35fe9

Merge branch 'mmaelicke:master' into rastermetricspace

ff0afa5

fix minor issue with bin_func input as Iterable: need to load with as…

63c128b

…array() in fit

mmaelicke approved these changes Jul 23, 2021

View reviewed changes

mmaelicke merged commit 258b687 into mmaelicke:master Jul 23, 2021

rhugonnet mentioned this pull request Jul 28, 2021

Spatial and robust stats, intro documentation + Update of spatialstats.py with new scikit-gstat features GlacioHack/xdem#159

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RasterMetricSpace for improved pairwise sampling of large 2D grids #114

Add RasterMetricSpace for improved pairwise sampling of large 2D grids #114

rhugonnet commented Jul 14, 2021 •

edited

Loading

redhog commented Jul 14, 2021

redhog commented Jul 14, 2021

rhugonnet commented Jul 14, 2021

redhog commented Jul 14, 2021

redhog commented Jul 14, 2021

rhugonnet commented Jul 15, 2021

rhugonnet commented Jul 15, 2021

mmaelicke commented Jul 19, 2021

rhugonnet commented Jul 19, 2021 •

edited

Loading

mmaelicke left a comment

mmaelicke Jul 21, 2021

rhugonnet Jul 22, 2021

rhugonnet commented Jul 22, 2021

mmaelicke left a comment

mmaelicke commented Jul 23, 2021

Add RasterMetricSpace for improved pairwise sampling of large 2D grids #114

Add RasterMetricSpace for improved pairwise sampling of large 2D grids #114

Conversation

rhugonnet commented Jul 14, 2021 • edited Loading

Added RasterEquidistantMetricSpace

Example

Update

redhog commented Jul 14, 2021

redhog commented Jul 14, 2021

rhugonnet commented Jul 14, 2021

redhog commented Jul 14, 2021

redhog commented Jul 14, 2021

rhugonnet commented Jul 15, 2021

Findings

Summary

rhugonnet commented Jul 15, 2021

mmaelicke commented Jul 19, 2021

rhugonnet commented Jul 19, 2021 • edited Loading

mmaelicke left a comment

Choose a reason for hiding this comment

mmaelicke Jul 21, 2021

Choose a reason for hiding this comment

rhugonnet Jul 22, 2021

Choose a reason for hiding this comment

rhugonnet commented Jul 22, 2021

mmaelicke left a comment

Choose a reason for hiding this comment

mmaelicke commented Jul 23, 2021

rhugonnet commented Jul 14, 2021 •

edited

Loading

Added `RasterEquidistantMetricSpace`

rhugonnet commented Jul 19, 2021 •

edited

Loading