Long DSS Response Times on Requests with Large AOIs #1121

gsmithCAL · 2024-09-18T20:29:44Z

Describe the bug
When making a request to the DSS that includes an area_of_interest consisting of a circle with a radius value of 100,000m, the DSS takes upwards of 30 seconds to respond. This behavior is observed on DSS v0.17.0 and CRDB v.24.1.3. It is important to note that this behavior is not seen when using DSS v0.17.0 and CRDB v21.2.7.

To Reproduce
Steps to reproduce the behavior:

Deploy DSS v0.17.0 and CRDB v24.1.3 docker containers. I did this using the Interuss monitoring local deployment.

Clone the monitoring repo and checkout the main branch
Navigate to the build/dev directory
Modify the scd/rid bootstrappers and dss services to use the dss:v0.17.0 images
Execute the ./run_locally.sh script to bring the services up

Send the DSS a POST request to query all operational intents. I did this using POSTMAN

Request an OAUTH token with the utm.strategic_coordination scope
The Request to the DSS is as follows:
POST http://localhost:8082/dss/v1/operational_intent_references/query
Body: {"area_of_interest": {"volume": {"outline_circle": {"center": {"lng": -95.28, "lat": 32.38}, "radius": {"value": 100000, "units": "M"}}, "altitude_lower": {"value": 0, "reference": "W84", "units": "M"}, "altitude_upper": {"value": 400, "reference": "W84", "units": "M"}}, "time_start": {"value": "2024-09-18T18:27:19.420Z", "format": "RFC3339"}, "time_end": {"value": "2024-09-19T18:27:19.420Z", "format": "RFC3339"}}}

Wait for a response from the DSS. I have seen this take anywhere from 30 seconds all the way up to 90 seconds.

Expected behavior
Using CRDB v21.2.7 these requests usually only take around 700ms to complete.

If anything needs further clarification, don't hesitate to leave a comment here or send a message on slack.

jctrouble · 2024-09-19T19:46:56Z

Running this query times out after a minute on the Wing internal instance. CRDB Debugging Bundle attached.

I believe it's this query:

SELECT
  scd_operations.id,
  scd_operations.owner,
  scd_operations.version,
  scd_operations.url,
  scd_operations.altitude_lower,
  scd_operations.altitude_upper,
  scd_operations.starts_at,
  scd_operations.ends_at,
  scd_operations.subscription_id,
  scd_operations.updated_at,
  scd_operations.state,
  scd_operations.cells
FROM
  scd_operations
WHERE
  cells && $1
  AND COALESCE(scd_operations.altitude_upper >= $2, true)
  AND COALESCE(scd_operations.altitude_lower <= $3, true)
  AND COALESCE(scd_operations.ends_at >= $4, true)
  AND COALESCE(scd_operations.starts_at <= $5, true)
LIMIT
  $6

where the culprit is this part:

cells && $1

where there are some 25,000+ S2 cells passed to the intersection query (in $1).

On CRDB 21.2.7 and below, the INVERTED INDEX cell_idx (cells) on scd_operations wasn't consulted for these sorts of queries (leading to #1070). My suspicion is that because newer CRDB versions actually use the inverted index, the massive number of S2 cells being passed in leads to some pathological behavior.

jctrouble · 2024-09-19T19:58:29Z

In fact, looking at the code that was added in cockroachdb/cockroach#77418, I think there's some nested looping behavior going on, such that for every/some large number of S2 cells in the inverted index we're likely iterating over of the entirety of the passed in cell array (of 25,000 items). That's likely leading to some n^2 behavior

BenjaminPelletier · 2024-09-20T06:36:24Z

@gsmithCAL The InterUSS DSS is generally designed to support the use cases described in F3548-21. Are you planning a flight that is going to cover 12,000 square miles? If not, could you elaborate on why your planning area is so large (more than 1/4 the size of Ohio)?

I confirm I can reproduce using the code below with a local make start-locally DSS deployment from the latest commit on main:

import time

from monitoring.monitorlib import auth

import requests


adapter = auth.make_auth_adapter("DummyOAuth(http://localhost:8085/token,big_planning_client)")
token = adapter.issue_token("localhost", ["utm.strategic_coordination"])
body = {
    "area_of_interest": {
        "volume": {
            "outline_circle": {
                "center": {
                    "lng": -95.28,
                    "lat": 32.38
                },
                "radius": {
                    "value": 100000,
                    "units": "M"
                }
            },
            "altitude_lower": {
                "value": 0,
                "reference": "W84",
                "units": "M"
            },
            "altitude_upper": {
                "value": 400,
                "reference": "W84",
                "units": "M"
            }
        },
        "time_start": {
            "value": "2024-09-18T18:27:19.420Z",
            "format": "RFC3339"
        },
        "time_end": {
            "value": "2024-09-19T18:27:19.420Z",
            "format": "RFC3339"
        }
    }
}
t0 = time.time()
resp = requests.post(
    url="http://localhost:8082/dss/v1/operational_intent_references/query",
    json=body,
    headers={"Authorization": f"Bearer {token}"},
    timeout=300,
)
t1 = time.time()
print(f"Response code {resp.status_code} after {t1 - t0}s")

Using similar code, it does look like processing time scales with n^2 where n relates to area:

gsmithCAL · 2024-09-20T13:21:12Z

@BenjaminPelletier I don't specifically have a use case for this. I originally came across this behavior when our USS attempted to create a subscription over a large area. I agree that in a "production" environment you would be unlikely to ever need to request an AOI as large as this.

Shastick · 2024-09-24T20:20:45Z

Given that the nested loops seem to be used only for the & operator, we might be able to work around this problem by using some form of subquery?

For example, there is an unnest function that will turn an array into a set of rows, on which I assume we could simply join, in the hope that CRDB will use a more efficient approach.

I'll give that a try tomorrow: it should be easy to check if it helps or not.

jctrouble · 2024-09-24T20:33:14Z

@Shastick keep an eye out for #1070 as you do your experiment, as we still need to ensure we use the inverted index, otherwise we'll end up doing a full table scan and hurt the non-pathological cases.

BenjaminPelletier · 2024-09-24T22:42:48Z

I think it's worth investigating why a notional performance improvement (implementation of inverted indices) resulted in a substantial decrease in performance -- especially determining if that might hurt us in more normal circumstances as well. But, I think support for extremely large planning areas would not be particularly high on the priority list as our implementation is targeted at USSs following one or more of the ASTM standards (F3411 and/or F3548) and I don't think we've identified a true use case for this kind of query in that context.

jctrouble · 2024-09-24T23:11:40Z

On CRDB v21.2.7, this query would have been implemented using a full table scan + post filter, or more likely an index scan (of the altitude & start/end times) + post filter. In both cases (full or index scan), the pathologically-large cell probe set is applied to a subset of the table. My guess is the other indexes also kick in and the bad S2 cell set is applied to a much smaller number of rows, leading to the "good" 700ms latency.

It's also quite possible that the logic that was doing the post-filtering was able to do smart things like building a hash map and probing, or using sorting of the arrays to do a binary search rather than a naive loop. Neither of those techniques are applied in the inverted index logic described above.

Post CRDB v21.2.7, the pathologically-large cell probe being applied to every single row in the database, regardless of other filters (because it's happening at index consultation time). In more nominal cases, the inefficiencies of the indexing loops are going to be dwarfed by the cost of performing a full table scan, shipping data over the wire, and then doing the filtering.

This is at least partly a CRDB bug - it should handle this sort of case better than the current nested loop in the index path. However, I don't foresee it being an issue for "normal" queries - see the huge performance gains in #1070 as proof - it exercises this same inefficiency.

Shastick · 2024-09-27T17:57:32Z

@Shastick keep an eye out for #1070 as you do your experiment, as we still need to ensure we use the inverted index, otherwise we'll end up doing a full table scan and hurt the non-pathological cases.

From a quick try I did, and subsequently looking at the CRDB web-UI, it seems that the naive subquery approach results in full table scans. I don't know if this is due to the low number of records that are generated by the qualifier, or if that would still be the case on a real-world deployment with many records.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long DSS Response Times on Requests with Large AOIs #1121

Long DSS Response Times on Requests with Large AOIs #1121

gsmithCAL commented Sep 18, 2024 •

edited

Loading

jctrouble commented Sep 19, 2024

jctrouble commented Sep 19, 2024

BenjaminPelletier commented Sep 20, 2024

gsmithCAL commented Sep 20, 2024

Shastick commented Sep 24, 2024

jctrouble commented Sep 24, 2024

BenjaminPelletier commented Sep 24, 2024

jctrouble commented Sep 24, 2024

Shastick commented Sep 27, 2024

Long DSS Response Times on Requests with Large AOIs #1121

Long DSS Response Times on Requests with Large AOIs #1121

Comments

gsmithCAL commented Sep 18, 2024 • edited Loading

jctrouble commented Sep 19, 2024

jctrouble commented Sep 19, 2024

BenjaminPelletier commented Sep 20, 2024

gsmithCAL commented Sep 20, 2024

Shastick commented Sep 24, 2024

jctrouble commented Sep 24, 2024

BenjaminPelletier commented Sep 24, 2024

jctrouble commented Sep 24, 2024

Shastick commented Sep 27, 2024

gsmithCAL commented Sep 18, 2024 •

edited

Loading