Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(OAuth2): refactor for custom OAuth2 clients #27880

Merged
merged 2 commits into from
Apr 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion superset/common/query_object.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ class QueryObject: # pylint: disable=too-many-instance-attributes
time_range: str | None
to_dttm: datetime | None

def __init__( # pylint: disable=too-many-arguments,too-many-locals
def __init__( # pylint: disable=too-many-locals, too-many-arguments
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, was this change automatically applied by one of our linters?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think what happened is that pylint got bumped. When that happens, the new rules are not enforced in the same PR, since it only runs in files that have been modified. So only in the next PR that the rules will be checked. I think @mistercrunch is working on fixing this.

Here I fixed the lint manually, then someone else fixed in master, but using a different order.

self,
*,
annotation_layers: list[dict[str, Any]] | None = None,
Expand Down
20 changes: 15 additions & 5 deletions superset/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -1409,12 +1409,20 @@ def EMAIL_HEADER_MUTATOR( # pylint: disable=invalid-name,unused-argument

# Details needed for databases that allows user to authenticate using personal
# OAuth2 tokens. See https://github.com/apache/superset/issues/20300 for more
# information
DATABASE_OAUTH2_CREDENTIALS: dict[str, dict[str, Any]] = {
# information. The scope and URIs are optional.
DATABASE_OAUTH2_CLIENTS: dict[str, dict[str, Any]] = {
# "Google Sheets": {
# "CLIENT_ID": "XXX.apps.googleusercontent.com",
# "CLIENT_SECRET": "GOCSPX-YYY",
# "BASEURL": "https://accounts.google.com/o/oauth2/v2/auth",
# "id": "XXX.apps.googleusercontent.com",
# "secret": "GOCSPX-YYY",
# "scope": " ".join(
# [
# "https://www.googleapis.com/auth/drive.readonly",
# "https://www.googleapis.com/auth/spreadsheets",
# "https://spreadsheets.google.com/feeds",
# ]
# ),
# "authorization_request_uri": "https://accounts.google.com/o/oauth2/v2/auth",
# "token_request_uri": "https://oauth2.googleapis.com/token",
# },
}
# OAuth2 state is encoded in a JWT using the alogorithm below.
Expand All @@ -1425,6 +1433,8 @@ def EMAIL_HEADER_MUTATOR( # pylint: disable=invalid-name,unused-argument
# applications. In that case, the proxy can forward the request to the correct instance
# by looking at the `default_redirect_uri` attribute in the OAuth2 state object.
# DATABASE_OAUTH2_REDIRECT_URI = "http://localhost:8088/api/v1/database/oauth2/"
# Timeout when fetching access and refresh tokens.
DATABASE_OAUTH2_TIMEOUT = timedelta(seconds=30)

# Enable/disable CSP warning
CONTENT_SECURITY_POLICY_WARNING = True
Expand Down
2 changes: 1 addition & 1 deletion superset/connectors/sqla/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@ def get_columns_description(
cursor = conn.cursor()
query = database.apply_limit_to_sql(query, limit=1)
cursor.execute(query)
db_engine_spec.execute(cursor, query, database.id)
db_engine_spec.execute(cursor, query, database)
result = db_engine_spec.fetch_data(cursor, limit=1)
result_set = SupersetResultSet(result, cursor.description, db_engine_spec)
return result_set.columns
Expand Down
6 changes: 5 additions & 1 deletion superset/databases/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -1115,9 +1115,13 @@ def oauth2(self) -> FlaskResponse:
if database is None:
return self.response_404()

oauth2_config = database.get_oauth2_config()
if oauth2_config is None:
raise OAuth2Error("No configuration found for OAuth2")

token_response = database.db_engine_spec.get_oauth2_token(
oauth2_config,
parameters["code"],
state,
)

# delete old tokens
Expand Down
62 changes: 25 additions & 37 deletions superset/db_engine_specs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -547,65 +547,53 @@ Alternatively, it's also possible to impersonate users by implementing the `upda

Support for authenticating to a database using personal OAuth2 access tokens was introduced in [SIP-85](https://github.com/apache/superset/issues/20300). The Google Sheets DB engine spec is the reference implementation.

To add support for OAuth2 to a DB engine spec, the following attribute and methods are needed:
Note that this API is still experimental and evolving quickly, subject to breaking changes. Currently, to add support for OAuth2 to a DB engine spec, the following attributes are needed:

```python
class BaseEngineSpec:

supports_oauth2 = True
oauth2_exception = OAuth2RedirectError

@staticmethod
def is_oauth2_enabled() -> bool:
return False

@staticmethod
def get_oauth2_authorization_uri(state: OAuth2State) -> str:
raise NotImplementedError()

@staticmethod
def get_oauth2_token(code: str, state: OAuth2State) -> OAuth2TokenResponse:
raise NotImplementedError()

@staticmethod
def get_oauth2_fresh_token(refresh_token: str) -> OAuth2TokenResponse:
raise NotImplementedError()
oauth2_scope = " ".join([
"https://example.org/scope1",
"https://example.org/scope2",
])
oauth2_authorization_request_uri = "https://example.org/authorize"
oauth2_token_request_uri = "https://example.org/token"
```

The `oauth2_exception` is an exception that is raised by `cursor.execute` when OAuth2 is needed. This will start the OAuth2 dance when `BaseEngineSpec.execute` is called, by returning the custom error `OAUTH2_REDIRECT` to the frontend. If the database driver doesn't have a specific exception, it might be necessary to overload the `execute` method in the DB engine spec, so that the `BaseEngineSpec.start_oauth2_dance` method gets called whenever OAuth2 is needed.

The first method, `is_oauth2_enabled`, is used to inform if the database supports OAuth2. This can be dynamic; for example, the Google Sheets DB engine spec checks if the Superset configuration has the necessary section:

```python
from flask import current_app

The DB engine should implement logic in either `get_url_for_impersonation` or `update_impersonation_config` to update the connection with the personal access token. See the Google Sheets DB engine spec for a reference implementation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some forthcoming major version we might want to consider renaming these methods to make it clear they don't only encapsulate impersonation logic. Maybe call it something that clarifies that it updates the URL to include user information (=JWT in the case of OAuth2 or user impersonation in the more traditional approach).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But OAuth2 is still a form of user impersonation, we're running a query on a connection that is built on behalf of the user. Putting this under user impersonation means that the cache is not shared between users. This is what we want most of the time, but not always, which is why I think we should decouple the "user-level cache" from "user-impersonation" and communicate better what's happening in these flows.

Having said that, I agree that the naming of these methods is confusing, and personally I think they could be combined in a single one. Currently the former modifies the URL and the latter modifies the params passed to create_engine, but they could be consolidated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be splitting hairs, but I perceive user impersonation to be a very specific execution form, which means "query is executed using a service account, but is executed as if a particular user us executing it". In the case of OAuth2, the query is in fact executed by the user, not the service account, and makes it more secure.

I also agree that consolidating these methods could make sense, as having separate methods for mutating the URL and the other for params to create_engine often causes confusion for me, and are IMO really the same thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see.

Interestingly, I always understood it is "Superset is impersonating a user", while your perception is that "the service account is impersonating a user". I agree those are two very different scenarios — OAuth2 starts from zero permissions and increases them; traditional impersonation starts with full permissions and lowers them.

But it just confirms your point, the naming is confusing, and we should definitely clarify it and probably make these two separate things.


class GSheetsEngineSpec(ShillelaghEngineSpec):
@staticmethod
def is_oauth2_enabled() -> bool:
return "Google Sheets" in current_app.config["DATABASE_OAUTH2_CREDENTIALS"]
```

Where the configuration for OAuth2 would look like this:
Currently OAuth2 needs to be configured at the DB engine spec level, ie, with one client for each DB engien spec. The configuration lives in `superset_config.py`:

```python
# superset_config.py
DATABASE_OAUTH2_CREDENTIALS = {
DATABASE_OAUTH2_CLIENTS = {
"Google Sheets": {
"CLIENT_ID": "XXX.apps.googleusercontent.com",
"CLIENT_SECRET": "GOCSPX-YYY",
"id": "XXX.apps.googleusercontent.com",
"secret": "GOCSPX-YYY",
"scope": " ".join(
[
"https://www.googleapis.com/auth/drive.readonly",
"https://www.googleapis.com/auth/spreadsheets",
"https://spreadsheets.google.com/feeds",
],
),
"authorization_request_uri": "https://accounts.google.com/o/oauth2/v2/auth",
"token_request_uri": "https://oauth2.googleapis.com/token",
},
}
DATABASE_OAUTH2_JWT_ALGORITHM = "HS256"
DATABASE_OAUTH2_REDIRECT_URI = "http://localhost:8088/api/v1/database/oauth2/"
DATABASE_OAUTH2_TIMEOUT = timedelta(seconds=30)
```

The second method, `get_oauth2_authorization_uri`, is responsible for building the URL where the user is sent to initiate OAuth2. This method receives a `state`. The state is an encoded JWT that is passed to the OAuth2 provider, and is received unmodified when the user is redirected back to Superset. The default state contains the user ID and the database ID, so that Superset can know where to store the received OAuth2 tokens.

Additionally, the state also contains a `tab_id`, which is a random UUID4 used as a shared secret for communication between browser tabs. When OAuth2 starts, Superset will open a new browser tab, where the user will grant permissions to Superset. When authentication is complete and successful this opened tab will send a message to the original tab, so that the original query can be re-run. The `tab_id` is sent by the opened tab and verified by the original tab to prevent malicious messages from other sites. As an additional security measure the origin of the message should match the OAuth2 redirect URL.

State also contains a `defaul_redirect_uri`, which is the enpoint in Supeset that receives the tokens from the OAuth2 provider (`/api/v1/database/oauth2/`). The redirect URL can be overwritten in the config file via the `DATABASE_OAUTH2_REDIRECT_URI` parameter. This might be useful where you have multiple Superset instances. Since the OAuth2 provider requires the redirect URL to be registered a priori, it might be easier (or needed) to register a single URL for a proxy service; the proxy service can then inspect the JWT and redirect the request to `defaul_redirect_uri`.
When configuring a client only the ID and secret are required; the DB engine spec should have default values for the scope and endpoints. The `DATABASE_OAUTH2_REDIRECT_URI` attribute is optional, and defaults to `/api/v1/databases/oauth2/` in Superset.

Finally, `get_oauth2_token` and `get_oauth2_fresh_token` are used to actually retrieve a token and refresh an expired token, respectively.
In the future we plan to support adding custom clients via the Superset UI, and being able to manually assign clients to specific databases.

### File upload

Expand Down
Loading
Loading