Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAM optimizations, additional parameters (discussion) #33

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

PicoJr
Copy link
Contributor

@PicoJr PicoJr commented Dec 1, 2021

Motivation

The purpose of this PR is to discuss a few possible optimizations (RAM usage, coverage), additional parameters.

This PR is probably not mature enough to be merged.
It probably contains bugs and breaks some existing API.

@foobarbecue feel free to cherry-pick what you deem useful ;)

RAM optimizations

Parsing NAC index

  • This PR adds chunks processing of the NAC index

it keeps memory usage low while parsing the NAC index

It works thanks to the chunksize parameter provided by pandas.read_csv

        # read nac_index as chunks instead of reading everything at once in memory
        nac_index = pandas.read_csv(indfilepath, header=None, names=col_list, chunksize=chunksize)

Then the join operation is done over each chunk of lines:

        filtered = []
        chunks_metadata = load_nac_metadata.load_nac_index(
            indfilepath=indfilepath, lblfilepath=lblfilepath
        )
        for (
            chunk_metadata
        ) in chunks_metadata:  # iterate over chunks instead of loading all CSV into RAM
            chunk_footprints = footprints.join(
                chunk_metadata, how="inner", lsuffix="_ode", rsuffix=""
            )

            # redacted...

            filtered.append(chunk_footprints)
        return pandas.concat(filtered).dropna()

Computing pairs (overlay)

geopandas.overlay and to some extent geopandas have performance issues.

When we find many image candidates for pairs geopandas.overlay explodes (RAM and computation time).

It wastes a lot of time computing a huge number of pairs that are mostly discarded by filters (sun geometry and area)

This PR adds a generator that yields chunks of pairs and filter them as they are generated.

That way RAM usage is kept low and it is possible to abort early if enough pairs were found.

def pairs_iter_from_image_search(imagesearch: "ImageSearch") -> Iterator[GeoDataFrame]:
    gdf: GeoDataFrame = imagesearch.results.dropna()
    # Store index (product id) in column so that it's preserved in spatial join operation
    gdf["prod_id"] = gdf.index

    chunk_row_size = 100
    chunks = [
        gdf[i : i + chunk_row_size] for i in range(0, gdf.shape[0], chunk_row_size)
    ]
    for (chunk_1, chunk_2) in tqdm.tqdm(
        itertools.product(chunks, chunks), total=(len(chunks) ** 2.0)
    ):
        pairs = geopandas.overlay(chunk_1, chunk_2, how="union", keep_geom_type=True)
        # redacted ...
        yield pairs

generator that returns chunks of pairs

And then the generator is used here:

    filtered_pairs = []
    filtered_pairs_count = 0
    for pairs_chunk in pairs_iter_from_image_search(imgs):
        filtered_chunk_pairs = filter_small_overlaps(
            filter_sun_geometry(
                pairs_chunk, incidence_range=(incidence_range_low, incidence_range_high)
            )
        )
        filtered_pairs.append(filtered_chunk_pairs)
        filtered_pairs_count += len(filtered_chunk_pairs)
        if filtered_pairs_count > max_pairs and verbose:
            print(f"found {filtered_pairs_count} pairs > --max-pairs={max_pairs}")
            break

    pairs = pandas.concat(filtered_pairs)

Improved coverage

When looking for a minimal set of pairs that covers an area, it seems the way the code chooses a new point is flawed.

Instead of:

# always pick the same point even if it could not find a pair to cover it...
search_point = remaining_uncovered_poly.representative_point()

This PR uses:

# pick a random point in the remaining uncovered poly
search_point = random_points_in_polygon(remaining_uncovered_poly, 1)[0]

This change helps with coverage close to the equator where finding pairs is harder.

New parameters

  • indfilepath and lblfilepath: paths for INDEX.TAB and INDEX.LBL
  • max_pairs: stop looking for pairs as soon as we found at least max_pairs
  • miss_limit: how many times we may fail to cover a point when providing --find-covering=True
  • incidence_range_low, incidence_range_high: filter out pairs for which image sun incidence are outside this range. This helps finding better pairs when close to north/south poles...
  • json_output: path for dumping the JSON containing the pairs.

Misc

This PR also modifies the download_NAC.py so that it can read pairs from the json written by find_stereo_pairs.py and download the corresponding images (in parallel).

* Parse NAC table using chunks: do not load the whole table at once in RAM, this helps keeping RAM usage low.
* Fix unsupported `GeometryCollection` when calling `polygonize` with latest Shapely version (1.8.0)
* remove hard coded paths
* fix covering_set_search
* add parameters
* reformat using black
…r download_NAC.py, support downloading NAC from JSON
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant