RAM optimizations, additional parameters (discussion) #33

PicoJr · 2021-12-01T21:53:47Z

Motivation

The purpose of this PR is to discuss a few possible optimizations (RAM usage, coverage), additional parameters.

This PR is probably not mature enough to be merged.
It probably contains bugs and breaks some existing API.

@foobarbecue feel free to cherry-pick what you deem useful ;)

RAM optimizations

Parsing NAC index

This PR adds chunks processing of the NAC index

it keeps memory usage low while parsing the NAC index

It works thanks to the chunksize parameter provided by pandas.read_csv

        # read nac_index as chunks instead of reading everything at once in memory
        nac_index = pandas.read_csv(indfilepath, header=None, names=col_list, chunksize=chunksize)

Then the join operation is done over each chunk of lines:

        filtered = []
        chunks_metadata = load_nac_metadata.load_nac_index(
            indfilepath=indfilepath, lblfilepath=lblfilepath
        )
        for (
            chunk_metadata
        ) in chunks_metadata:  # iterate over chunks instead of loading all CSV into RAM
            chunk_footprints = footprints.join(
                chunk_metadata, how="inner", lsuffix="_ode", rsuffix=""
            )

            # redacted...

            filtered.append(chunk_footprints)
        return pandas.concat(filtered).dropna()

Computing pairs (overlay)

geopandas.overlay and to some extent geopandas have performance issues.

When we find many image candidates for pairs geopandas.overlay explodes (RAM and computation time).

It wastes a lot of time computing a huge number of pairs that are mostly discarded by filters (sun geometry and area)

This PR adds a generator that yields chunks of pairs and filter them as they are generated.

That way RAM usage is kept low and it is possible to abort early if enough pairs were found.

def pairs_iter_from_image_search(imagesearch: "ImageSearch") -> Iterator[GeoDataFrame]:
    gdf: GeoDataFrame = imagesearch.results.dropna()
    # Store index (product id) in column so that it's preserved in spatial join operation
    gdf["prod_id"] = gdf.index

    chunk_row_size = 100
    chunks = [
        gdf[i : i + chunk_row_size] for i in range(0, gdf.shape[0], chunk_row_size)
    ]
    for (chunk_1, chunk_2) in tqdm.tqdm(
        itertools.product(chunks, chunks), total=(len(chunks) ** 2.0)
    ):
        pairs = geopandas.overlay(chunk_1, chunk_2, how="union", keep_geom_type=True)
        # redacted ...
        yield pairs

generator that returns chunks of pairs

And then the generator is used here:

    filtered_pairs = []
    filtered_pairs_count = 0
    for pairs_chunk in pairs_iter_from_image_search(imgs):
        filtered_chunk_pairs = filter_small_overlaps(
            filter_sun_geometry(
                pairs_chunk, incidence_range=(incidence_range_low, incidence_range_high)
            )
        )
        filtered_pairs.append(filtered_chunk_pairs)
        filtered_pairs_count += len(filtered_chunk_pairs)
        if filtered_pairs_count > max_pairs and verbose:
            print(f"found {filtered_pairs_count} pairs > --max-pairs={max_pairs}")
            break

    pairs = pandas.concat(filtered_pairs)

Improved coverage

When looking for a minimal set of pairs that covers an area, it seems the way the code chooses a new point is flawed.

Instead of:

# always pick the same point even if it could not find a pair to cover it...
search_point = remaining_uncovered_poly.representative_point()

This PR uses:

# pick a random point in the remaining uncovered poly
search_point = random_points_in_polygon(remaining_uncovered_poly, 1)[0]

This change helps with coverage close to the equator where finding pairs is harder.

New parameters

indfilepath and lblfilepath: paths for INDEX.TAB and INDEX.LBL
max_pairs: stop looking for pairs as soon as we found at least max_pairs
miss_limit: how many times we may fail to cover a point when providing --find-covering=True
incidence_range_low, incidence_range_high: filter out pairs for which image sun incidence are outside this range. This helps finding better pairs when close to north/south poles...
json_output: path for dumping the JSON containing the pairs.

Misc

This PR also modifies the download_NAC.py so that it can read pairs from the json written by find_stereo_pairs.py and download the corresponding images (in parallel).

* Parse NAC table using chunks: do not load the whole table at once in RAM, this helps keeping RAM usage low. * Fix unsupported `GeometryCollection` when calling `polygonize` with latest Shapely version (1.8.0)

* remove hard coded paths * fix covering_set_search * add parameters * reformat using black

…r download_NAC.py, support downloading NAC from JSON

PicoJr added 6 commits December 1, 2021 21:28

Parse NAC table using chunks (use less RAM)

e32f975

* Parse NAC table using chunks: do not load the whole table at once in RAM, this helps keeping RAM usage low. * Fix unsupported `GeometryCollection` when calling `polygonize` with latest Shapely version (1.8.0)

Increase chunk size

9b3aa6f

Compute union overlays on chunks

3d719d1

* remove hard coded paths * fix covering_set_search * add parameters * reformat using black

Add json output to find_stereo_pairs.py, support parallel download fo…

fe75ee8

…r download_NAC.py, support downloading NAC from JSON

run black on download_NAC.py

dafa7e1

Make sure images are not downloaded twice

9132bf3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAM optimizations, additional parameters (discussion) #33

RAM optimizations, additional parameters (discussion) #33

PicoJr commented Dec 1, 2021

RAM optimizations, additional parameters (discussion) #33

Are you sure you want to change the base?

RAM optimizations, additional parameters (discussion) #33

Conversation

PicoJr commented Dec 1, 2021

Motivation

RAM optimizations

Parsing NAC index

Computing pairs (overlay)

Improved coverage

New parameters

Misc