Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long DSS Response Times on Requests with Large AOIs #1121

Open
gsmithCAL opened this issue Sep 18, 2024 · 9 comments
Open

Long DSS Response Times on Requests with Large AOIs #1121

gsmithCAL opened this issue Sep 18, 2024 · 9 comments

Comments

@gsmithCAL
Copy link

gsmithCAL commented Sep 18, 2024

Describe the bug
When making a request to the DSS that includes an area_of_interest consisting of a circle with a radius value of 100,000m, the DSS takes upwards of 30 seconds to respond. This behavior is observed on DSS v0.17.0 and CRDB v.24.1.3. It is important to note that this behavior is not seen when using DSS v0.17.0 and CRDB v21.2.7.

To Reproduce
Steps to reproduce the behavior:

  1. Deploy DSS v0.17.0 and CRDB v24.1.3 docker containers. I did this using the Interuss monitoring local deployment.
  • Clone the monitoring repo and checkout the main branch
  • Navigate to the build/dev directory
  • Modify the scd/rid bootstrappers and dss services to use the dss:v0.17.0 images
  • Execute the ./run_locally.sh script to bring the services up
  1. Send the DSS a POST request to query all operational intents. I did this using POSTMAN
  • Request an OAUTH token with the utm.strategic_coordination scope
  • The Request to the DSS is as follows:
    POST http://localhost:8082/dss/v1/operational_intent_references/query
    Body: {"area_of_interest": {"volume": {"outline_circle": {"center": {"lng": -95.28, "lat": 32.38}, "radius": {"value": 100000, "units": "M"}}, "altitude_lower": {"value": 0, "reference": "W84", "units": "M"}, "altitude_upper": {"value": 400, "reference": "W84", "units": "M"}}, "time_start": {"value": "2024-09-18T18:27:19.420Z", "format": "RFC3339"}, "time_end": {"value": "2024-09-19T18:27:19.420Z", "format": "RFC3339"}}}
  1. Wait for a response from the DSS. I have seen this take anywhere from 30 seconds all the way up to 90 seconds.

Expected behavior
Using CRDB v21.2.7 these requests usually only take around 700ms to complete.

If anything needs further clarification, don't hesitate to leave a comment here or send a message on slack.

@jctrouble
Copy link

Running this query times out after a minute on the Wing internal instance. CRDB Debugging Bundle attached.

I believe it's this query:

SELECT
  scd_operations.id,
  scd_operations.owner,
  scd_operations.version,
  scd_operations.url,
  scd_operations.altitude_lower,
  scd_operations.altitude_upper,
  scd_operations.starts_at,
  scd_operations.ends_at,
  scd_operations.subscription_id,
  scd_operations.updated_at,
  scd_operations.state,
  scd_operations.cells
FROM
  scd_operations
WHERE
  cells && $1
  AND COALESCE(scd_operations.altitude_upper >= $2, true)
  AND COALESCE(scd_operations.altitude_lower <= $3, true)
  AND COALESCE(scd_operations.ends_at >= $4, true)
  AND COALESCE(scd_operations.starts_at <= $5, true)
LIMIT
  $6

where the culprit is this part:

cells && $1

where there are some 25,000+ S2 cells passed to the intersection query (in $1).

On CRDB 21.2.7 and below, the INVERTED INDEX cell_idx (cells) on scd_operations wasn't consulted for these sorts of queries (leading to #1070). My suspicion is that because newer CRDB versions actually use the inverted index, the massive number of S2 cells being passed in leads to some pathological behavior.

@jctrouble
Copy link

In fact, looking at the code that was added in cockroachdb/cockroach#77418, I think there's some nested looping behavior going on, such that for every/some large number of S2 cells in the inverted index we're likely iterating over of the entirety of the passed in cell array (of 25,000 items). That's likely leading to some n^2 behavior

@BenjaminPelletier
Copy link
Member

@gsmithCAL The InterUSS DSS is generally designed to support the use cases described in F3548-21. Are you planning a flight that is going to cover 12,000 square miles? If not, could you elaborate on why your planning area is so large (more than 1/4 the size of Ohio)?

I confirm I can reproduce using the code below with a local make start-locally DSS deployment from the latest commit on main:

import time

from monitoring.monitorlib import auth

import requests


adapter = auth.make_auth_adapter("DummyOAuth(http://localhost:8085/token,big_planning_client)")
token = adapter.issue_token("localhost", ["utm.strategic_coordination"])
body = {
    "area_of_interest": {
        "volume": {
            "outline_circle": {
                "center": {
                    "lng": -95.28,
                    "lat": 32.38
                },
                "radius": {
                    "value": 100000,
                    "units": "M"
                }
            },
            "altitude_lower": {
                "value": 0,
                "reference": "W84",
                "units": "M"
            },
            "altitude_upper": {
                "value": 400,
                "reference": "W84",
                "units": "M"
            }
        },
        "time_start": {
            "value": "2024-09-18T18:27:19.420Z",
            "format": "RFC3339"
        },
        "time_end": {
            "value": "2024-09-19T18:27:19.420Z",
            "format": "RFC3339"
        }
    }
}
t0 = time.time()
resp = requests.post(
    url="http://localhost:8082/dss/v1/operational_intent_references/query",
    json=body,
    headers={"Authorization": f"Bearer {token}"},
    timeout=300,
)
t1 = time.time()
print(f"Response code {resp.status_code} after {t1 - t0}s")

Using similar code, it does look like processing time scales with n^2 where n relates to area:

Screenshot 2024-09-19 at 11 17 55 PM

@gsmithCAL
Copy link
Author

@BenjaminPelletier I don't specifically have a use case for this. I originally came across this behavior when our USS attempted to create a subscription over a large area. I agree that in a "production" environment you would be unlikely to ever need to request an AOI as large as this.

@Shastick
Copy link
Contributor

Given that the nested loops seem to be used only for the & operator, we might be able to work around this problem by using some form of subquery?

For example, there is an unnest function that will turn an array into a set of rows, on which I assume we could simply join, in the hope that CRDB will use a more efficient approach.

I'll give that a try tomorrow: it should be easy to check if it helps or not.

@jctrouble
Copy link

@Shastick keep an eye out for #1070 as you do your experiment, as we still need to ensure we use the inverted index, otherwise we'll end up doing a full table scan and hurt the non-pathological cases.

@BenjaminPelletier
Copy link
Member

I think it's worth investigating why a notional performance improvement (implementation of inverted indices) resulted in a substantial decrease in performance -- especially determining if that might hurt us in more normal circumstances as well. But, I think support for extremely large planning areas would not be particularly high on the priority list as our implementation is targeted at USSs following one or more of the ASTM standards (F3411 and/or F3548) and I don't think we've identified a true use case for this kind of query in that context.

@jctrouble
Copy link

On CRDB v21.2.7, this query would have been implemented using a full table scan + post filter, or more likely an index scan (of the altitude & start/end times) + post filter. In both cases (full or index scan), the pathologically-large cell probe set is applied to a subset of the table. My guess is the other indexes also kick in and the bad S2 cell set is applied to a much smaller number of rows, leading to the "good" 700ms latency.

It's also quite possible that the logic that was doing the post-filtering was able to do smart things like building a hash map and probing, or using sorting of the arrays to do a binary search rather than a naive loop. Neither of those techniques are applied in the inverted index logic described above.

Post CRDB v21.2.7, the pathologically-large cell probe being applied to every single row in the database, regardless of other filters (because it's happening at index consultation time). In more nominal cases, the inefficiencies of the indexing loops are going to be dwarfed by the cost of performing a full table scan, shipping data over the wire, and then doing the filtering.

This is at least partly a CRDB bug - it should handle this sort of case better than the current nested loop in the index path. However, I don't foresee it being an issue for "normal" queries - see the huge performance gains in #1070 as proof - it exercises this same inefficiency.

@Shastick
Copy link
Contributor

@Shastick keep an eye out for #1070 as you do your experiment, as we still need to ensure we use the inverted index, otherwise we'll end up doing a full table scan and hurt the non-pathological cases.

From a quick try I did, and subsequently looking at the CRDB web-UI, it seems that the naive subquery approach results in full table scans. I don't know if this is due to the low number of records that are generated by the qualifier, or if that would still be the case on a real-world deployment with many records.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants